From d6f8916325796941b22c1291ca9ec6f234ad28fc Mon Sep 17 00:00:00 2001 From: "OpenAI model gpt-3.5-turbo" Date: Tue, 7 Mar 2023 15:51:17 +0000 Subject: [PATCH] revise using AI model\n\nUsing the OpenAI model gpt-3.5-turbo --- content/01.abstract.md | 14 ++++++ content/02.introduction.md | 18 +++++++ content/03.methods.md | 68 ++++++++++++++++++++++++++- content/04.results.md | 96 ++++++++++++++++++++++++++++++++++++++ content/05.conclusions.md | 18 +++++++ 5 files changed, 213 insertions(+), 1 deletion(-) diff --git a/content/01.abstract.md b/content/01.abstract.md index 0ffd905..05a00d5 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -1,5 +1,18 @@ ## Abstract {.page_break_before} + In this work, we investigate how models with advanced natural language processing capabilities can be used to reduce the time-consuming process of writing and revising scholarly manuscripts. To this end, we integrate large language models into the Manubot publishing ecosystem to suggest revisions for scholarly text. Our AI-based revision workflow uses a prompt generator that integrates metadata from the manuscript into prompt templates to generate section-specific instructions for the language model. @@ -8,3 +21,4 @@ We tested our AI-based revision workflow in three case studies of existing manus Our results suggest that these models can capture the concepts in the scholarly text and produce high-quality revisions that improve clarity. All changes to the manuscript are tracked using a version control system, providing transparency into the human or machine origin of text. Given the amount of time that researchers put into crafting prose, we anticipate that this advance will significantly improve the type of knowledge work performed by academics. + diff --git a/content/02.introduction.md b/content/02.introduction.md index ef32a5a..e353df6 100644 --- a/content/02.introduction.md +++ b/content/02.introduction.md @@ -1,11 +1,22 @@ ## Introduction + Manuscripts have been around for thousands of years, but scientific journals have only been around for about 350 years [@isbn:0810808447]. External peer review, which is used by many journals, is even more recent, having been around for less than 100 years [@doi:10/d26d8b]. Most manuscripts are written by humans or teams of humans working together to describe new advances, summarize existing literature, or argue for changes in the status quo. However, scholarly writing is a time-consuming process where results of a study are presented using a specific style and format. Academics can sometimes be long-winded in getting to key points, making writing more impenetrable to their audience [@doi:10.1038/d41586-018-02404-4]. + + Recent advances in computing capabilities and the widespread availability of text, images, and other data on the internet have laid the foundation for artificial intelligence (AI) models with billions of parameters. Large language models, in particular, are opening the floodgates to new technologies with the capability to transform how society operates [@arxiv:2102.02503]. OpenAI's models, for instance, have been trained on vast amounts of data and can generate human-like text [@arxiv:2005.14165]. @@ -14,6 +25,12 @@ The most well-known of these models is the Generative Pre-trained Transformer 3 Scientists are already using these tools to improve scientific writing [@doi:10.1038/d41586-022-03479-w]. This technology has the potential to revolutionize how scientists write and revise scholarly manuscripts, saving time and effort and enabling researchers to focus on more high-level tasks such as data analysis and interpretation. + + We present a novel AI-assisted revision tool that envisions a future where authors collaborate with large language models in the writing of their manuscripts. This workflow builds on the Manubot infrastructure for scholarly publishing [@doi:10.1371/journal.pcbi.1007128], a platform designed to enable both individual and large-scale collaborative projects [@doi:10.1098/rsif.2017.0387; @pmid:34545336]. Our workflow involves parsing the manuscript, utilizing a large language model with section-specific prompts for revision, and then generating a set of suggested changes to be integrated into the main document. @@ -21,3 +38,4 @@ These changes are presented to the user through the GitHub interface for review. To evaluate our workflow, we conducted a case study with three Manubot-authored manuscripts that included sections of varying complexity. Our findings indicate that, in most cases, the models were able to maintain the original meaning of text, improve the writing style, and even interpret mathematical expressions. Our AI-assisted writing workflow can be incorporated into any Manubot manuscript, and we anticipate it will help authors more effectively communicate their work. + diff --git a/content/03.methods.md b/content/03.methods.md index 282fe6c..b29f8b0 100644 --- a/content/03.methods.md +++ b/content/03.methods.md @@ -11,18 +11,35 @@ The prompt for the Methods section includes the formatting of equations with ide All sections' prompts include these instructions: *"the text grammar is correct, spelling errors are fixed, and the text has a clear sentence structure"*, although these are only shown for abstracts. ](images/figure_1.svg "AI-based revision applied on a Manubot manuscript"){#fig:ai_revision width="85%"} + We implemented an AI-based revision infrastructure in Manubot [@doi:10.1371/journal.pcbi.1007128], a tool for collaborative writing of scientific manuscripts. Manubot integrates with popular version control platforms such as GitHub, allowing authors to easily track changes and collaborate on writing in real time. Furthermore, Manubot automates the process of generating a formatted manuscript (such as HTML, PDF, DOCX; Figure {@fig:ai_revision}a shows the HTML output). Built on this modern and open paradigm, our AI-based revision software was developed using GitHub Actions, which allows the user to easily trigger an automated revision task on the entire manuscript or specific sections of it. + + When the user triggers the action, the manuscript is parsed by section and then by paragraph (Figure {@fig:ai_revision}b) and passed to the language model along with a set of custom prompts. The model then returns a revised version of the text. Our workflow then uses the GitHub API to generate a new pull request, allowing the user to review and modify the output before merging the changes into the manuscript. This workflow attributes text to either the human user or to the AI language model, which may be important in light of potential future legal decisions that alter the copyright landscape around the outputs of generative models. + + We used the [OpenAI API](https://openai.com/api/) for access to these models. Since this API incurs a cost with each run that depends on manuscript length, we implemented a workflow in GitHub Actions that can be manually triggered by the user. Our implementation allows users to tune the costs to their needs by allowing them to select specific sections to be revised instead of the entire manuscript. @@ -30,13 +47,25 @@ Additionally, several model parameters can be adjusted to tune costs even furthe For instance, using Davinci models (the most complex and capable ones), the cost per run is under $0.50 for most manuscripts. + ### Implementation details + Our tools are comprised of Python scripts that perform the AI-based revision ([https://github.com/greenelab/manubot-ai-editor](https://github.com/greenelab/manubot-ai-editor)) and a GitHub Actions workflow integrated with Manubot. To run the workflow, the user must specify the branch that will be revised, select the files/sections of the manuscript (optional), specify the language model to use (`text-davinci-003` by default), and provide the output branch name. For more advanced users, it is also possible to change most of the tool's behavior or the language model parameters. + + When the workflow is triggered, it downloads the manuscript by cloning the specified branch. It revises all of the manuscript files, or only some of them if the user specifies a subset. Next, each paragraph in the file is read and submitted to the OpenAI API for revision. @@ -46,6 +75,12 @@ If the error cannot be handled or the maximum number of retries is reached, the This allows the user to debug the problem and attempt to fix it if desired. + + As shown in Figure {@fig:ai_revision}b, each API request comprises a prompt (the instructions given to the model) and the paragraph to be revised. The prompt uses the manuscript title and keywords, so both must be accurate to obtain the best revision outcomes. The other key component to process a paragraph is its section. @@ -56,13 +91,25 @@ Therefore, we designed section-specific prompts, which we found led to the most Figures and tables captions, as well as paragraphs that contain only one or two sentences and less than sixty words, are not processed and are copied directly to the output file. + + The section of a paragraph is automatically inferred from the file name using a simple strategy, such as if "introduction" or "methods" is part of the file name. If the tool fails to infer a section from the file, then the user is still able to specify which section the file belongs to. The section can be a standard one (abstract, introduction, results, methods, or discussion) for which a specific prompt is used (Figure {@fig:ai_revision}b), or a non-standard one for which a default prompt is used to instruct the model to perform basic revision (minimizing the use of jargon, ensuring text grammar is correct, fixing spelling errors, and making sure the text has a clear sentence structure). + ### Properties of language models + Our AI-based revision workflow uses [text completion](https://beta.openai.com/docs/guides/completion) to process each paragraph. We tested our tool using Davinci and Curie models, including `text-davinci-003`, `text-davinci-edit-001` and `text-curie-001`. Davinci models are the most powerful GPT-3 model, whereas Curie ones are less capable but faster and less expensive. @@ -70,6 +117,12 @@ We mainly focused on the completion endpoint, as the edits endpoint is currently All models can be fine-tuned using different parameters (see [OpenAI - API Reference](https://beta.openai.com/docs/api-reference/completions)), and the most important ones can be easily adjusted using our tool. + + Language models for text completion have a context length that indicates the limit of tokens they can process (tokens are common character sequences in text). This limit includes the size of the prompt and the paragraph, as well as the maximum number of tokens to generate for the completion (parameter `max_tokens`). For instance, the context length of Davinci models is 4,000 and 2,048 for Curie (see [OpenAI - Models overview](https://beta.openai.com/docs/models/overview)). @@ -83,6 +136,12 @@ The tool automatically adjusts this parameter and performs the request again if The user can also force the tool to either use a fixed value for `max_tokens` for all paragraphs, or change the fraction of maximum tokens based on the estimated paragraph size (two by default). + + The language models used are stochastic, meaning they generate a different revision for the same input paragraph each time. This behavior can be adjusted by using the "sampling temperature" or "nucleus sampling" parameters (we use `temperature=0.5` by default). Although we selected default values that worked well across multiple manuscripts, these parameters can be changed to make the model more deterministic. @@ -92,9 +151,16 @@ Additionally, our workflow allows the user to process either the entire manuscri This allows for more cost-effective control while focusing on a single piece of text, wherein the user can run the tool several times and pick the preferred revised text. + ### Installation and use + We have contributed our workflow ([https://github.com/manubot/rootstock/pull/484](https://github.com/manubot/rootstock/pull/484)) to the standard Manubot template manuscript, which is called rootstock and available at [https://github.com/manubot/rootstock](https://github.com/manubot/rootstock). Users who wish to use the workflow only need to follow the standard procedures to install Manubot. -The section "AI-assisted authoring", in the file `USAGE.md` of the rootstock repository, explains how to enable the tool. +The section "AI-assisted authoring", in the file `USAGE.md` of the rootstock repository, explains how to enable the tool. After that, the workflow (named `ai-revision`) will be available and ready to use under the Actions tab of the user's manuscript repository. + diff --git a/content/04.results.md b/content/04.results.md index f2637c9..a286033 100644 --- a/content/04.results.md +++ b/content/04.results.md @@ -2,6 +2,11 @@ ### Evaluation setup + We evaluated our AI-assisted revision workflow using three GPT-3 models from OpenAI: `text-davinci-003`, `text-davinci-edit-001`, and `text-curie-001`. The first two are based on the most capable Davinci models (see [OpenAI - GPT-3 models](https://beta.openai.com/docs/models/gpt-3)). Whereas `text-davinci-003` is a production-ready model for the completion endpoint, `text-davinci-edit-001` is used for the edits endpoint and is still in beta. @@ -9,6 +14,7 @@ The latter provides a more natural interface for revising manuscripts, as it tak Model `text-curie-001` is faster and cheaper than Davinci models, and is defined as "very capable" by its authors (see [OpenAI - GPT-3 models](https://beta.openai.com/docs/models/gpt-3)). + | Manuscript ID | Title | Keywords | |:-------|:----------------------|:----------| | [CCC](https://github.com/greenelab/ccc-manuscript) | An efficient not-only-linear correlation coefficient based on machine learning | correlation coefficient, nonlinear relationships, gene expression | @@ -18,6 +24,11 @@ Model `text-curie-001` is faster and cheaper than Davinci models, and is defined Table: **Manuscripts used to evaluate the AI-based revision workflow.** The title and keywords of a manuscript are used in prompts for revising paragraphs. IDs are used in the text to refer to them, and they link to their GitHub repositories. {#tbl:manuscripts} + Assessing the performance of an automated revision tool is not straightforward, since a review of a revision will necessarily be subjective. To mitigate this, we used three manuscripts of our own authorship (Table @tbl:manuscripts): the Clustermatch Correlation Coefficient (CCC) [@doi:10.1101/2022.06.15.496326], PhenoPLIER [@doi:10.1101/2021.07.05.450786], and Manubot-AI (this manuscript). CCC is a new correlation coefficient evaluated in transcriptomic data, while PhenoPLIER is a framework that comprises three different methods applied in the field of genetic studies. @@ -30,6 +41,12 @@ Using these manuscripts, we tested and improved our prompts. Our findings are reported below. + + We enabled the Manubot AI revision workflow in the GitHub repositories of the three manuscripts (CCC: `https://github.com/greenelab/ccc-manuscript`, PhenoPLIER: `https://github.com/greenelab/phenoplier_manuscript`, Manubot-AI: `https://github.com/greenelab/manubot-gpt-manuscript`). This added the "ai-revision" workflow to the "Actions" tab of each repository. We triggered the workflow manually and used the three language models described above to produce one pull request (PR) per manuscript and model. @@ -39,6 +56,7 @@ The PRs show the differences between the original text and the AI-based revision We discuss below our findings based on these PRs across different sections of the manuscripts. + ### Performance of language models We found that Davinci models outperformed the Curie model across all manuscripts. @@ -47,23 +65,40 @@ However, the PRs show that the model was not able to produce acceptable revision Most of its suggestions were not coherent with the original text in any of the sections. + We found that the quality of the revisions produced by the `text-davinci-edit-001` (edits endpoint) model was subjectively inferior to `text-davinci-003` (completion endpoint). This model either did not produce a revision (such as for abstracts) or the suggested changes were minimal or did not improve the original text. For example, in paragraphs from the introduction, it failed to keep references to other scientific articles in CCC, and in PhenoPLIER it didn't produce a meaningful revision. This might be because the edits endpoint is still in beta. + The `text-davinci-003` model produced the best results for all manuscripts and across the different sections. Since both `text-davinci-003` and `text-davinci-edit-001` are based on the same models, we only report the results of `text-davinci-003` below. ### Revision of different sections + We inspected the PRs generated by the AI-based workflow and found interesting changes suggested by the tool across different sections of the manuscripts. These are our subjective assessments of the quality of the revisions, and we encourage the reader to inspect the PRs for each manuscript and model to see the full diffs and make their own conclusions. These PRs are available in the manuscripts' GitHub repositories and also included as diff files in Supplementary File 1 (CCC), 2 (PhenoPLIER) and 3 (Manubot-AI). + + We present the differences between the original text and the revisions by the tool in a `diff` format (obtained from GitHub). Line numbers are included to show the length differences. When applicable, single words are underlined and highlighted in colors to more clearly see the differences within a single sentence. @@ -71,6 +106,7 @@ Red indicates words removed by the tool, green indicates words added, and no und The full diffs can be seen by inspecting the PRs for each manuscript and model, and then clicking on the "Files changed" tab. + #### Abstract ![ @@ -78,6 +114,11 @@ The full diffs can be seen by inspecting the PRs for each manuscript and model, Original text is on the left and suggested revision on the right. ](images/diffs/abstract/ccc-abstract.svg "Diffs - CCC abstract"){#fig:abstract:ccc width="100%"} + We applied the AI-based revision workflow to the CCC abstract (Figure @fig:abstract:ccc). The tool completely rewrote the text, leaving only the last sentence mostly unchanged. The text was significantly shortened, with longer sentences than the original ones, which could make the abstract slightly harder to read. @@ -86,6 +127,7 @@ It also removed details about the method (line 5), and focused on the aims and r The main concepts were still present in the revised text. + The revised text for the abstract of PhenoPLIER was significantly shortened (from 10 sentences in the original, to only 3 in the revised version). However, in this case, important concepts (such as GWAS, TWAS, CRISPR) and a proper amount of background information were missing, producing a less informative abstract. @@ -97,6 +139,11 @@ However, in this case, important concepts (such as GWAS, TWAS, CRISPR) and a pro Original text is on the left and suggested revision on the right. ](images/diffs/introduction/ccc-paragraph-01.svg "Diffs - CCC introduction paragraph 01"){#fig:intro:ccc width="100%"} + The tool significantly revised the Introduction section of CCC (Figure @fig:intro:ccc), producing a more concise and clear introductory paragraph. The revised first sentence concisely incorporated ideas from the original two sentences, introducing the concept of "large datasets" and the opportunities for scientific exploration. The model generated a more concise second sentence introducing the "need for efficient tools" to find "multiple relationships" in these datasets. @@ -105,6 +152,12 @@ All references to scientific literature were kept in the correct Manubot format, The rest of the sentences in this section were also correctly revised, and could be incorporated into the manuscript with minor or no further changes. + + We also observed a high quality revision of the introdution of PhenoPLIER. However, the model failed to keep the format of citations in one paragraph. Additionally, the model did not converge to a revised text for the last paragraph, and our tool left an error message as an HTML comment at the top: `The AI model returned an empty string`. @@ -112,6 +165,7 @@ Debugging the prompts revealed this issue, which could be related to the complex However, rerunning the automated revision should solve this as the model is stochastic. + #### Results ![ @@ -119,6 +173,11 @@ However, rerunning the automated revision should solve this as the model is stoc Original text is on the left and suggested revision on the right. ](images/diffs/results/ccc-paragraph-01.svg "Diffs - CCC results paragraph 01"){#fig:results:ccc width="100%"} + We tested the tool on a paragraph of the Results section of CCC (Figure @fig:results:ccc). That paragraph describes Figure 1 of the CCC manuscript [@doi:10.1101/2022.06.15.496326], which shows four different datasets with two variables each, and different relationships or patterns named random/independent, non-coexistence, quadratic, and two-lines. In addition to having fewer sentences that are slightly longer, the revised paragraph consistently uses only the past tense, whereas the original one has tense shifts. @@ -131,18 +190,30 @@ We found it remarkable that the model rewrote some of the concepts in the origin The model also produced high-quality revisions for several other paragraphs that would only need minor changes. + + Other paragraphs in CCC, however, needed more changes before being ready to be incorporated into the manuscript. For instance, for some paragraphs, the model generated a revised text that is shorter, more direct and clear. However, important details were removed and sometimes sentences changed the meaning. To address this, we could accept the simplified sentence structure but add back the missing details. + ![ **A paragraph in the Results section of PhenoPLIER.** Original text is on the left and suggested revision on the right. ](images/diffs/results/phenoplier-paragraph-01.svg "Diffs - PhenoPLIER results paragraph 01"){#fig:results:phenoplier width="100%"} + When applied to the PhenoPLIER manuscript, the model produced high-quality revisions for most paragraphs, while preserving citations and references to figures, tables, and other sections of the manuscript in the Manubot/Markdown format. In some cases, important details were missing, but they could be easily added back while preserving the improved sentence structure of the revised version. In other cases, the model's output demonstrated the limitations of revising one paragraph at a time without considering the rest of the text. @@ -155,6 +226,7 @@ For example, it included the idea of "gene co-expression" analysis (a keyword) t This was a poor model-based revision, indicating that the original paragraph may be too short or disconnected from the rest and could be merged with the next one (which describes follow-up and related experiments). + #### Discussion In both the CCC and PhenoPLIER manuscripts, revisions to the discussion section appeared to be of high quality. @@ -166,12 +238,18 @@ Revisions for some paragraphs introduced minor mistakes that a human author coul Original text is on the left and suggested revision on the right. ](images/diffs/discussion/ccc-paragraph-01.svg "Diffs - CCC discussion paragraph 01"){#fig:discussion:ccc width="100%"} + One paragraph of CCC discusses how not-only-linear correlation coefficients could potentially impact genetic studies of complex traits (Figure @fig:discussion:ccc). Although some minor changes could be added, we believe the revised text reads better than the original. It is also interesting how the model understood the format of citations and built more complex structures from it. For instance, the two articles referenced in lines 2 and 3 in the original text were correctly merged into a single citation block and separated with ";" in line 2 of the revised text. + #### Methods Prompts for the Methods section were the most challenging to design, especially when the sections included equations. @@ -182,21 +260,33 @@ The prompt for Methods (Figure @fig:ai_revision) is more focused in keeping the Original text is on the left and suggested revision on the right. ](images/diffs/methods/phenoplier-paragraph-01.svg "Diffs - PhenoPLIER methods paragraph 01"){#fig:methods:phenoplier width="100%"} + We revised a paragraph in PhenoPLIER that contained two numbered equations (Figure @fig:methods:phenoplier). The model made very few changes, and all the equations, citations, and most of the original text were preserved. However, we found it remarkable how the model identified a wrong reference to a mathematical symbol (line 8) and fixed it in the revision (line 7). Indeed, the equation with the univariate model used by PrediXcan (lines 4-6 in the original) includes the *true* effect size $\gamma_l$ (`\gamma_l`) instead of the *estimated* one $\hat{\gamma}_l$ (`\hat{\gamma}_l`). + In PhenoPLIER, we found one large paragraph with several equations that the model failed to revise, although it performed relatively well in revising the rest of the section. In CCC, the revision of this section was good overall, with some minor and easy-to-fix issues as in the other sections. + We also observed issues from revising one paragraph at a time without context. For instance, in PhenoPLIER, one of the first paragraphs mentions the linear models used by S-PrediXcan and S-MultiXcan, without providing any equations or details. These were presented in the following paragraphs, but since the model had not encountered that yet, it opted to add those equations immediately (in the correct Manubot/Markdown format). + ![ **A paragraph in the Methods section of ManubotAI.** Original text is on the left and suggested revision on the right. @@ -204,7 +294,13 @@ The revision (right) contains a repeated set of sentences at the top that we rem ](images/diffs/methods/manubotai-paragraph-01.svg "Diffs - ManubotAI methods paragraph 01"){#fig:methods:manubotai width="100%"} + When revising the Methods sections of Manubot-AI (this manuscript), in some cases the model added novel sentences with wrong information. For instance, for one paragraph, it added a formula (using the correct Manubot format) to presumably predict the cost of a revision run. In another paragraph (Figure @fig:methods:manubotai), it added new sentences saying that the model was *"trained on a corpus of scientific papers from the same field as the manuscript"* and that its suggested revisions resulted in a *"modified version of the manuscript that is ready for submission"*. Although these are important future directions, neither accurately describes the present work. + diff --git a/content/05.conclusions.md b/content/05.conclusions.md index 413fba5..a33cce4 100644 --- a/content/05.conclusions.md +++ b/content/05.conclusions.md @@ -1,5 +1,10 @@ ## Conclusions + We implemented AI-based revision models into the Manubot publishing platform. Writing academic papers can be time-consuming and challenging to read, so we sought to use technology to help researchers communicate their findings to the community. Our AI-based revision workflow uses a prompt generator that creates manuscript- and section-specific instructions for the language model. @@ -11,6 +16,12 @@ Although the evaluation of the revision tool is subjective, we found that most p The AI model also highlighted certain paragraphs that were difficult to revise, which could be challenging for human readers too. + + We designed section-specific prompts to guide the revision of text using GPT-3. Surprisingly, in one Methods section, the model detected an error when referencing a symbol in an equation that had been overlooked by humans. However, abstracts were more challenging for the model to revise, where revisions often removed background information about the research problem. @@ -24,6 +35,12 @@ Despite these limitations, we found that models captured the main ideas and gene It is important to note, however, that our assessment of performance in case studies was necessarily subjective, as there could be writing styles that are not widely shared across researchers. + + The use of AI-assisted tools for scientific authoring is controversial [@doi:10.1038/d41586-023-00056-7; @doi:10.1038/d41586-023-00107-z]. Questions arise concerning the originality and ownership of texts generated by these models. For example, the *Nature* journal has established that any use of these models in scientific writing must be documented [@doi:10.1038/d41586-023-00191-1], and the International Conference on Machine Learning (ICML) has prohibited the submission of *"papers that include text generated from a large-scale language model (LLM)"* [@url:https://icml.cc/Conferences/2023/llm-policy], although editing tools for grammar and spelling correction are allowed. @@ -34,3 +51,4 @@ Our work lays the foundation for a future in which humans and machines construct Scientific articles need to adhere to a certain style, which can make the writing time-consuming and require a significant amount of effort to think about *how* to communicate a result or finding that has already been obtained. As machines become increasingly capable of improving scholarly text, humans can focus more on *what* to communicate to others, rather than on *how* to write it. This could lead to a more equitable and productive future for research, where scientists are only limited by their ideas and ability to conduct experiments to uncover the underlying organizing principles of ourselves and our environment. +