Using Mistral for Data tagging

Source: Generated using DeviantArt’s DreamUp

Author

· Xuzeng He (ORCID: 0009–0005–7317–7426)

Introduction

Data tagging, in simple terms, is the process of assigning labels or tags to your data so that they are easier to retrieve or analyse. For example, when you are dealing with a database consisting of scientific journals, you may want to tag these documents with their relevant topics so that users can later easily find the journal they are interested in using some filter button without too much effort. To make things better, with the surge of Large Language Models (LLMs) nowadays (e.g. ChatGPT), one can now use them to tag huge amounts of data as long as you can deploy these models on your local computer.

In this post, we will show you how to use a popular large language model called Mistral to tag out a list of documents (in JSON format) from PubMed whose topics are related to Artificial Intelligence (AI) by inspecting their titles and abstracts.

Installation Guide

In this work, we use Ollama to install and run our LLMs locally and use Langchain to interact with our LLMs in a Python environment.

As a first step, we need to install Ollama from the official website. Once the installation is complete and Ollama is running (typically as an application), open the terminal and run the following command to download your preferred LLM onto your local computer (in this tutorial, we will use mistral as an example):

As for the installation of Langchain, simply run:

Now that the installation part is complete, you can safely close the terminal and turn to your favourite Python IDE for the next step. However, you need to make sure Ollama is still running in the background as Langchain needs to interact with the model with the help of Ollama.

Input and Output

Before we move on to the coding part, we first take a look at our input file. As mentioned before, the input file is in JSON format as a list of JSON items, and an example item looks as below:

Here in this tutorial, we will only use the title and abstract as part of our prompt when using the LLM, along with the key to uniquely identify each of these documents.

We also have to think about what the output file looks like. In this work, the output file is in CSV format and has the following columns:

header = ['KEY', 'TITLE', 'ABSTRACT','LLM_OUTPUT', 'related_to_ai', 'topics']

Here, KEY, TITLE and ABSTRACT are properties from the input file, LLM_OUTPUT is the response generated by LLM. related_to_ai and topics, in this case, are both extracted from LLM_OUTPUT (which will be covered in the next section) where related_to_ai should either be Yes or No, and topics mean some specific topics that LLM can find from a document when LLM is provided with a detailed taxonomy to work with. One can safely ignore topics if they are not interested in such details from a document.

Using Mistral

Now that we have figured out what will be the format of our input and output, we can start working with the LLM that we have previously installed by writing prompts and feeding them into the model using Python. First, we import Langchain and Ollama as below:

After that, we can write the entire prompting process as a function since we need to call it several times when dealing with more than one document:

where the prompt mentioned above is structured as follows:

By using this prompt structure, we are asking the model to generate the response in a specific JSON format using the example we have provided. This can allow the extraction for related_to_ai and topics to be easier.

We also ask the model to provide a list of topic tags if the document or article is found to be related to AI by feeding the model with a detailed AI taxonomy named tags (again, one can safely remove this part if they are not interested in doing this)

You may notice that we use two additional parameters (num_predict and stop) when setting up the LLM other than the model parameter. This is partially because the model installed by Ollama can sometimes have unexpected behaviour (e.g. response not matching with the given format, repeating response with the specified format) due to its rather small size of parameters. In this case, we need to further manipulate the model to reduce the possibility of receiving such responses with format errors.

Here, num_predict = 128 means that we only want the model to produce a maximum number of 128 tokens when generating the response since a longer response usually means that it does not match with required JSON format. We also set a stop token of by using stop = [“}”] so that the model can stop generating the response when the JSON format is complete and can thus move on to the next document (since } appears at the end of the JSON format)

You can also change other parameters to make sure the model produces functionally correct output by reading the official document provided by the Langchain community.

Now that we’ve defined prompt_llm(), it’s time to tag some data! But before that, we first import some useful libraries and write the headers for the output file:

Now we can start tagging the document in the input file (named pubmed_1000.json) and write the result into the output file as below:

Here we use multiple if conditions to make sure we can successfully extract related_to_ai and topics even with a partially correct format. (Yes for related_to_ai but no key-value pair for topics in generated response) We also use JSON5 instead of regular JSON because sometimes comments are generated inside the response and JSON5 can load these items without any errors. Eventually, if none of these tricks works, we will use a special JSON item as the output indicating format error.

Here’s what the output file looks like when it’s displayed using Excel (5.1 and 5.2 in the topics column is generated due to the AI taxonomy we input into the prompt in our work, where 5.1 refers to “AI Ethics and Society” and 5.2 refers to “Privacy and Security in AI System”. Again, you can safely ignore the topics column if you are not providing the LLM with a detailed taxonomy):

Conclusion

In this post, we provide a complete tutorial about how to use the Mistral model to tag out documents that are related to AI using their title and abstract. Through continuous investigation and refinement, we believe that the use of Large Language Models can open up even more exciting opportunities for us in the future.

References

About the Author

Data Scientist at Research Graph Foundation |  + posts