{ "cells": [ { "cell_type": "markdown", "id": "20aefa6a", "metadata": {}, "source": [ "## summary" ] }, { "cell_type": "markdown", "id": "4e41fed0", "metadata": {}, "source": [ "- summary\n", " - GPT 非常擅长不需要特别精确的场景，比如 summary\n", " - summary 的 prompt\n", "- max_token限制与长文本处理\n", " - 长文本分 chunk\n", " - 逐 chunk 调用 openai.api\n", " - 再基于 chunk summary 进行 summary\n", " - ChatPDF 的实现原理\n", " - summary 分 chunk，结构化的分 chunk\n", " - reference：embedding 的 match" ] }, { "cell_type": "markdown", "id": "9055fbc6", "metadata": {}, "source": [ "## 认识输入文本" ] }, { "cell_type": "code", "execution_count": 58, "id": "1b60415f", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:37:46.540909Z", "start_time": "2023-03-25T03:37:46.530028Z" } }, "outputs": [], "source": [ "def open_file(filepath):\n", " with open(filepath, 'r', encoding='utf-8') as f:\n", " # 返回全部，以一个字符串的形式；\n", " return f.read()" ] }, { "cell_type": "code", "execution_count": 60, "id": "b79cafb5", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:38:30.749151Z", "start_time": "2023-03-25T03:38:30.739893Z" } }, "outputs": [ { "data": { "text/plain": [ "20190" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "input_text = open_file('./data/input.txt')\n", "len(input_text)" ] }, { "cell_type": "markdown", "id": "076b1015", "metadata": {}, "source": [ "### token count" ] }, { "cell_type": "code", "execution_count": 62, "id": "05ee80af", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:39:00.874017Z", "start_time": "2023-03-25T03:39:00.866520Z" } }, "outputs": [], "source": [ "from transformers import AutoTokenizer\n", "import tiktoken" ] }, { "cell_type": "code", "execution_count": 63, "id": "5cbd8e91", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:39:10.810966Z", "start_time": "2023-03-25T03:39:06.573903Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Token indices sequence length is longer than the specified maximum sequence length for this model (4367 > 1024). Running this sequence through the model will result in indexing errors\n" ] }, { "data": { "text/plain": [ "4367" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = AutoTokenizer.from_pretrained('gpt2')\n", "len(tokenizer.encode(input_text, truncation=False))" ] }, { "cell_type": "code", "execution_count": 64, "id": "bba9eebd", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:39:30.380961Z", "start_time": "2023-03-25T03:39:30.286714Z" } }, "outputs": [ { "data": { "text/plain": [ "4367" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = tiktoken.get_encoding('gpt2')\n", "len(tokenizer.encode(input_text))" ] }, { "cell_type": "markdown", "id": "02609eaa", "metadata": {}, "source": [ "## 划分 chunk" ] }, { "cell_type": "code", "execution_count": 65, "id": "b5f89a42", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:42:02.897261Z", "start_time": "2023-03-25T03:42:02.885312Z" } }, "outputs": [], "source": [ "def file_to_chunks(filename, chunk_size=2000, overlap=100):\n", " input_text = open_file(filename)\n", "\n", " token_ids = tokenizer.encode(input_text)\n", " num_tokens = len(token_ids)\n", " \n", " chunks = []\n", " for i in range(0, num_tokens, chunk_size-overlap):\n", " chunk = token_ids[i:(i+chunk_size)]\n", " chunks.append(chunk)\n", " return chunks" ] }, { "cell_type": "code", "execution_count": 66, "id": "05772ebc", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:42:15.399307Z", "start_time": "2023-03-25T03:42:15.372440Z" } }, "outputs": [], "source": [ "chunks = file_to_chunks('./data/input.txt')" ] }, { "cell_type": "code", "execution_count": 67, "id": "9dbe9ded", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:42:17.382547Z", "start_time": "2023-03-25T03:42:17.364621Z" } }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunks)" ] }, { "cell_type": "code", "execution_count": 70, "id": "68ef225d", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:43:08.659186Z", "start_time": "2023-03-25T03:43:08.655188Z" } }, "outputs": [], "source": [ "# tokenizer.decode(chunks[1])" ] }, { "cell_type": "markdown", "id": "74d7f76f", "metadata": {}, "source": [ "## 调用接口" ] }, { "cell_type": "markdown", "id": "e66eb739", "metadata": {}, "source": [ "- `text-davinci-003`：\n", " - 0.02/1k\n", "- `gpt-3.5-turbo`：\n", " - 更便宜，0.002/1k" ] }, { "cell_type": "code", "execution_count": 71, "id": "22fed262", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:43:53.933175Z", "start_time": "2023-03-25T03:43:53.929239Z" } }, "outputs": [], "source": [ "import openai\n", "import tiktoken" ] }, { "cell_type": "code", "execution_count": 72, "id": "832df3d1", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:43:55.513977Z", "start_time": "2023-03-25T03:43:55.504760Z" } }, "outputs": [ { "data": { "text/plain": [ "'0.27.2'" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "openai.version.VERSION" ] }, { "cell_type": "code", "execution_count": 73, "id": "6fd1fb6a", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:44:04.218264Z", "start_time": "2023-03-25T03:44:04.213772Z" } }, "outputs": [], "source": [ "openai.api_key = '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'" ] }, { "cell_type": "code", "execution_count": 74, "id": "d3869c73", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:47:07.940726Z", "start_time": "2023-03-25T03:46:53.916405Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1923 104 2027\n", "1884 117 2001\n", "555 78 633\n" ] } ], "source": [ "tokenizer = tiktoken.get_encoding('gpt2')\n", "chunks = file_to_chunks('./data/input.txt')\n", "summary_outputs = []\n", "for chunk in chunks:\n", " prompt = open_file('./data/summary_prompt.txt').replace('<

>', tokenizer.decode(chunk))\n", " messages = [{\"role\": \"system\", \"content\": \"This is text summarization.\"}] \n", " messages.append({\"role\": \"user\", \"content\": prompt})\n", " resp = openai.ChatCompletion.create(model='gpt-3.5-turbo', \n", " messages=messages, )\n", " print(resp['usage']['prompt_tokens'], resp['usage']['completion_tokens'], resp['usage']['total_tokens'])\n", " summary_outputs.append(resp['choices'][0]['message']['content'].strip())" ] }, { "cell_type": "code", "execution_count": 75, "id": "e0f9fe49", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:47:16.308974Z", "start_time": "2023-03-25T03:47:16.295346Z" } }, "outputs": [ { "data": { "text/plain": [ "['Xerox photocopiers use a lossy compression format that can potentially result in subtle inaccuracies that go unnoticed. This is similar to the way large language models like OpenAI\\'s ChatGPT use statistical regularities to create text that approximates the original but may contain nonsensical answers or \"hallucinations.\" However, the concept of compression can also be applied to understanding text and creating artificial intelligence. Better text compression could be key to achieving human-level artificial intelligence by identifying the principles that underlie the text.',\n", " 'Large language models like GPT-3 can identify statistical regularities in text, leading some to question if they have true understanding of subjects like economic theory. However, their failure to accurately perform elementary arithmetic suggests they are lacking in true comprehension. The blurriness of their output also raises concerns about their use in content creation or as replacements for search engines. A useful criterion for evaluating their quality may be whether their text is good enough to train new models. While they may be useful as a starting point for writing, relying on their output could hinder the development of creativity and originality.',\n", " \"Using large language models for writing is not a substitute for acquiring writing skills through practice and effort. Writing unoriginal work is necessary for eventually creating something original. Additionally, using an AI-generated text initially lacks the writer's own dissatisfaction with their work, which guides them towards improving it. A large language model may be useful in certain circumstances, but it is not necessary for access to the internet.\"]" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary_outputs" ] }, { "cell_type": "markdown", "id": "d6122d06", "metadata": {}, "source": [ "## 结果汇聚" ] }, { "cell_type": "code", "execution_count": 53, "id": "1ca781a2", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:14:51.772335Z", "start_time": "2023-03-25T03:14:51.768690Z" } }, "outputs": [], "source": [ "final_prompt = open_file('./data/summary_prompt.txt').replace('<

>', ' '.join(summary_outputs))" ] }, { "cell_type": "code", "execution_count": 54, "id": "a06e8d72", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:15:20.964096Z", "start_time": "2023-03-25T03:15:15.639347Z" } }, "outputs": [], "source": [ "messages = [{\"role\": \"system\", \"content\": \"This is text summarization.\"}] \n", "messages.append({\"role\": \"user\", \"content\": prompt})\n", "resp = openai.ChatCompletion.create(model='gpt-3.5-turbo', \n", " messages=messages, )" ] }, { "cell_type": "code", "execution_count": 55, "id": "9721799f", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:15:24.755502Z", "start_time": "2023-03-25T03:15:24.750926Z" } }, "outputs": [ { "data": { "text/plain": [ " JSON: {\n", " \"choices\": [\n", " {\n", " \"finish_reason\": \"stop\",\n", " \"index\": 0,\n", " \"message\": {\n", " \"content\": \"The article argues that relying solely on AI-generated text to create original content is not a good approach. The hours spent writing unoriginal work are critical in developing the skills needed to ultimately create something novel. In addition, text generated by AI lacks the amorphous dissatisfaction and awareness of the distance between what it says and what the writer wants it to say, which is critical in the rewriting process. While AI-generated text may have its uses, it is not a substitute for human writing in its current state.\",\n", " \"role\": \"assistant\"\n", " }\n", " }\n", " ],\n", " \"created\": 1679714116,\n", " \"id\": \"chatcmpl-6xof6CqdfvWxzCqpubxGkeqwHOgX9\",\n", " \"model\": \"gpt-3.5-turbo-0301\",\n", " \"object\": \"chat.completion\",\n", " \"usage\": {\n", " \"completion_tokens\": 101,\n", " \"prompt_tokens\": 555,\n", " \"total_tokens\": 656\n", " }\n", "}" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "resp" ] }, { "cell_type": "code", "execution_count": 57, "id": "52845377", "metadata": { "ExecuteTime": { "end_time": "2023-03-25T03:15:34.645487Z", "start_time": "2023-03-25T03:15:34.640744Z" } }, "outputs": [ { "data": { "text/plain": [ "'The article argues that relying solely on AI-generated text to create original content is not a good approach. The hours spent writing unoriginal work are critical in developing the skills needed to ultimately create something novel. In addition, text generated by AI lacks the amorphous dissatisfaction and awareness of the distance between what it says and what the writer wants it to say, which is critical in the rewriting process. While AI-generated text may have its uses, it is not a substitute for human writing in its current state.'" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "resp['choices'][0]['message']['content'].strip()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }