## summary

- summary
    - GPT 非常擅长不需要特别精确的场景，比如 summary
    - summary 的 prompt
- max_token限制与长文本处理
    - 长文本分 chunk
        - 逐 chunk 调用 openai.api
    - 再基于 chunk summary 进行 summary
    - ChatPDF 的实现原理
        - summary 分 chunk，结构化的分 chunk
        - reference：embedding 的 match

## 认识输入文本

In [58]:
def open_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        # 返回全部，以一个字符串的形式；
        return f.read()

In [60]:
input_text = open_file('./data/input.txt')
len(input_text)

20190

### token count

In [62]:
from transformers import AutoTokenizer
import tiktoken

In [63]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
len(tokenizer.encode(input_text, truncation=False))

Token indices sequence length is longer than the specified maximum sequence length for this model (4367 > 1024). Running this sequence through the model will result in indexing errors


4367

In [64]:
tokenizer = tiktoken.get_encoding('gpt2')
len(tokenizer.encode(input_text))

4367

## 划分 chunk

In [65]:
def file_to_chunks(filename, chunk_size=2000, overlap=100):
    input_text = open_file(filename)

    token_ids = tokenizer.encode(input_text)
    num_tokens = len(token_ids)
    
    chunks = []
    for i in range(0, num_tokens, chunk_size-overlap):
        chunk = token_ids[i:(i+chunk_size)]
        chunks.append(chunk)
    return chunks

In [66]:
chunks = file_to_chunks('./data/input.txt')

In [67]:
len(chunks)

3

In [70]:
# tokenizer.decode(chunks[1])

## 调用接口

- `text-davinci-003`：
    - 0.02/1k
- `gpt-3.5-turbo`：
    - 更便宜，0.002/1k

In [71]:
import openai
import tiktoken

In [72]:
openai.version.VERSION

'0.27.2'

In [73]:
openai.api_key = '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'

In [74]:
tokenizer = tiktoken.get_encoding('gpt2')
chunks = file_to_chunks('./data/input.txt')
summary_outputs = []
for chunk in chunks:
    prompt = open_file('./data/summary_prompt.txt').replace('<<SUMMARY>>', tokenizer.decode(chunk))
    messages = [{"role": "system", "content": "This is text summarization."}]    
    messages.append({"role": "user", "content": prompt})
    resp = openai.ChatCompletion.create(model='gpt-3.5-turbo', 
                                        messages=messages, )
    print(resp['usage']['prompt_tokens'], resp['usage']['completion_tokens'], resp['usage']['total_tokens'])
    summary_outputs.append(resp['choices'][0]['message']['content'].strip())

1923 104 2027
1884 117 2001
555 78 633


In [75]:
summary_outputs

['Xerox photocopiers use a lossy compression format that can potentially result in subtle inaccuracies that go unnoticed. This is similar to the way large language models like OpenAI\'s ChatGPT use statistical regularities to create text that approximates the original but may contain nonsensical answers or "hallucinations." However, the concept of compression can also be applied to understanding text and creating artificial intelligence. Better text compression could be key to achieving human-level artificial intelligence by identifying the principles that underlie the text.',
 'Large language models like GPT-3 can identify statistical regularities in text, leading some to question if they have true understanding of subjects like economic theory. However, their failure to accurately perform elementary arithmetic suggests they are lacking in true comprehension. The blurriness of their output also raises concerns about their use in content creation or as replacements for search engines. 

## 结果汇聚

In [53]:
final_prompt = open_file('./data/summary_prompt.txt').replace('<<SUMMARY>>', ' '.join(summary_outputs))

In [54]:
messages = [{"role": "system", "content": "This is text summarization."}]    
messages.append({"role": "user", "content": prompt})
resp = openai.ChatCompletion.create(model='gpt-3.5-turbo', 
                                    messages=messages, )

In [55]:
resp

<OpenAIObject chat.completion id=chatcmpl-6xof6CqdfvWxzCqpubxGkeqwHOgX9 at 0x123698ae0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The article argues that relying solely on AI-generated text to create original content is not a good approach. The hours spent writing unoriginal work are critical in developing the skills needed to ultimately create something novel. In addition, text generated by AI lacks the amorphous dissatisfaction and awareness of the distance between what it says and what the writer wants it to say, which is critical in the rewriting process. While AI-generated text may have its uses, it is not a substitute for human writing in its current state.",
        "role": "assistant"
      }
    }
  ],
  "created": 1679714116,
  "id": "chatcmpl-6xof6CqdfvWxzCqpubxGkeqwHOgX9",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 101,
    "prompt_tokens": 5

In [57]:
resp['choices'][0]['message']['content'].strip()

'The article argues that relying solely on AI-generated text to create original content is not a good approach. The hours spent writing unoriginal work are critical in developing the skills needed to ultimately create something novel. In addition, text generated by AI lacks the amorphous dissatisfaction and awareness of the distance between what it says and what the writer wants it to say, which is critical in the rewriting process. While AI-generated text may have its uses, it is not a substitute for human writing in its current state.'