Preparing Data for OpenAI Fine-tuning

anuaralfetahe · January 8, 2025, 7:39pm

I have been tasked with fine-tuning an OpenAI model using our customer data, which primarily originates from a Drupal website. The data includes various types of information, such as news articles, blog posts, service descriptions, and more.

Extracting the data is relatively straightforward, but preparing it for fine-tuning is proving to be more challenging. OpenAI expects the data to be in the following format:

[
  {
    "prompt": "Question or context",
    "completion": "Desired response"
  }
]

Are there any tools I can use to transform the data into this format? I would particularly prefer an Elixir-based tool, if one exists. However, I’m open to suggestions for other tools or services that can help with this task.

I’ve considered using the OpenAI API to generate questions and answers based on the provided data, but I’m concerned this approach could become quite expensive.

Additionally, the data will likely need some preprocessing and normalization, as much of it is wrapped in HTML.

I’m new to this field and would appreciate any advice or recommendations on how to approach this problem.

acrolink · January 8, 2025, 8:06pm

jswny · January 8, 2025, 9:33pm

Are you sure you want to fine tune here?

Can you provide some context around what kind of customer data you have and examples of what you’d like the model to do?

It sounds like its possible this may be a better fit for RAG, where you provide a way for a model to grab information that it thinks is relevant by providing it a function to do so, and using a vector db to find the relevant documents, data, etc.

If you want to go with fine tuning, you are going to have to do a lot of work to get the loose data in a question → completion format. If you just want to provide the model with the data required to answer questions, RAG might be a better fit.

anuaralfetahe · January 9, 2025, 4:28am

Our customer is an energy provider. They sell liquid fuels, offer electric chargers, and provide electricity packages to their customers.

They have implemented a chat bot integrated with OpenAI to reduce the workload on customer support. Their clients can ask questions about contracts and services provided. For example: “What is the closest gas station to me with an electric charger that has a Type 2 connector?”

All this information is stored in a database and modeled as entities.
So far, we have created a separate entity storage system to hold structured datasets, which are maintained by the site editors. These datasets are categorized.

Each time a user asks the chat bot a question, we first categorize the query and then select the appropriate datasets associated with it. This dataset is then sent to OpenAI as context which helps the model generate more accurate response for the user.

Our goal is to leverage all the data stored in the system, structure it properly, and fine-tune the model with it.

So far, I have manually fine-tuned the model and observed improvements in the generated answers, even without sending any additional context.