<span class='p-name'>Finding the Perfect Match: A Guide to Testing Different AI and LLM Models for Your Dataset and Purpose</span>

Finding the Perfect Match: A Guide to Testing Different AI and LLM Models for Your Dataset and Purpose

Artificial Intelligence (AI) and Language Model (LLM) technologies have revolutionized numerous industries, from natural language processing to computer vision. However, with the myriad of AI and LLM models available today, choosing the best one for your specific dataset and purpose can be a daunting task.

As detailed earlier, we’re using AI generative technologies to identify potential vectors of attack, ensuring you can trust the information you encounter online. In this blog post, we’ll explore some of our thoughts as we tested different AI and LLM models to identify the perfect fit for our needs.

Step 1: Understand Your Dataset and Purpose

Before diving into testing different AI and LLM models, it’s crucial to thoroughly understand your dataset and purpose. Ask yourself the following questions:

  • What is the nature of your dataset? (Text, images, tabular data, etc.) In our project, we’re interesting in examining the trust score (credibility, relevance, bias, perspective, purpose, and sincerity) of information you encounter in your browser or app. As such, this could include text, video, images, animation, etc. We needed to focus on one form of text (emails) for our initial development. Other forms of text (Twitter, Facebook, webpages) all would be treated differently.
  • What specific tasks do you want the AI/LLM model to perform? (e.g., text classification, translation, sentiment analysis) We want the model to scan the text and identify an overall trust score and provide explainability of these decisions. Explainability is the concept that a machine learning model and its output can be explained in a way that “makes sense” to a human being at an acceptable level.
  • What are your performance requirements and constraints? (e.g., speed, accuracy, model size) Our initial requirements were that the model could correctly identify whether presented information was real or fake. We also wanted the model to be able to identify whether the information (email text in this first example) was human generated or AI generated.

Understanding these aspects helps you narrow down the list of potential models suitable for your needs.

Step 2: Selecting Candidate Models

Based on the dataset and purpose, it’s time to select a pool of candidate AI and LLM models for testing. Popular models like GPT-3.5, BERT, XLNet, LLaMa, and Llama 2 are a good starting point, but there are numerous others to explore as well. For our initial development, we tested and used the following.

Llama. This model was considered as it was readily available on the server. We chose Llama 2, the latest model in this line and fine-tuned it on our dataset of human-generated and AI-generated scam emails. The results weren’t encouraging. Llama tended to hallucinate, make false predictions, wasn’t compatible with explainer models, we couldn’t control the token generation process as much as we’d have liked to. With more time to fine-tune the model, this might provide a great option.

Example output from Llama2 after finetuning

GPT 3.5. We chose GPT 3.5 through the API as opposed to GPT 4.0. GPT-3.5 is faster at generating responses and doesn’t have hourly prompt restrictions. GPT-4 is slower and has hourly prompt restrictions. GPT-3.5 typically responds in a few seconds, whereas GPT-4 can take a minute or more to write out larger responses. Due to limitations on the number of tokens and the busy queues, we felt fine-tuning a GPT model would take longer than the hackathon duration. Fine-tuning is the process of training a pre-trained GPT-3 language model on a specific task or domain to improve its performance. The fine-tuning process adjusts the model’s parameters to better fit conversational data, making the chatbot more adept at understanding and responding to user inputs.

LSTM + LIME. The cornerstones of our project were always trustworthiness and transparency. Given time constraints in initial development, we couldn’t integrate explainer models on top of LLMs like LLAMA. Explainer models, also known as interpretability or explainability models, are additional models built on top of LLMs to provide insights into how the LLMs arrive at their predictions. The primary goal of explainer models is to shed light on the black-box nature of complex LLMs, which can be difficult to interpret due to their massive size and numerous hidden layers.

Circling back to our dataset and purpose, we went back to the basics and trained an LSTM model to detect scam emails and added a LIME explainer model on top of it. This pipeline was able to correctly classify emails and also explain the rationale behind its decision.

Example output from LSTM and LIME

Step 3: Preparing Your Dataset

To fairly evaluate the candidate models, you must prepare your dataset for testing. This involves splitting your data into training, validation, and test sets. The training set will be used to train the models, the validation set to fine-tune and optimize hyperparameters, and the test set to assess their final performance. Ensure that the data distribution in each set represents your real-world use case.

In an earlier post, we discussed the quest for trustworthy data. For our training and initial validation we used datasets with phishing materials, spam emails, and fake news from Kaggle. We also used generative AI tools to create a dataset of emails that were examples of phishing attempts, spamming behaviors, and extortion.

Step 4: Model Training and Evaluation

After identifying our purpose, model, and dataset, we started training and evaluating the candidate models.

Training the Candidate Models: This process involves feeding the model with a dataset that aligns with your specific task, be it text classification, translation, sentiment analysis, or any other natural language processing task. The AI LLM will learn patterns, relationships, and representations from the data during training. Keep in mind that training an LLM requires significant computational resources and time. Be sure to optimize hyperparameters, such as learning rate, batch size, and number of epochs, to achieve the best performance.

Evaluating the Candidate Models: Evaluation is a crucial step to ensure your AI LLM performs accurately and meets your requirements. For this, you’ll need a separate dataset called the validation set, distinct from the one used for training. The validation set acts as an unseen data source that helps assess how well your model generalizes to new examples. By comparing the performance of different candidate models, you can identify the one that best suits your needs.

Fine-Tuning and Iteration: Fine-tuning involves adapting a pre-trained LLM model to your specific task or domain. This process helps the AI model learn task-specific nuances and nuances present in your dataset, making it more effective. Continue fine-tuning, experimenting with different hyperparameters, and making improvements based on your evaluation results. This will help you gauge their performance and identify potential overfitting or underfitting issues.

Step 5: Compare Results, Select the Best Model, and Iterate

After thorough testing and evaluation, compare the performance of each AI/LLM model. Take into account metrics, training time, computational resources required, and any other relevant factors.

Select the model that best aligns with your requirements and provides the optimal trade-off between performance and efficiency. Keep in mind that the best model may vary depending on your specific use case and dataset.

By understanding the nuances of training, carefully evaluating performance, and iterating on your models, you can create powerful and accurate LLMs tailored to your specific natural language processing tasks. Keep in mind that model development is often an iterative process, so don’t be discouraged by initial results. Continue fine-tuning, experimenting with different hyperparameters, and making improvements based on your evaluation results.


Testing different AI and LLM models to find the ideal match for your dataset and purpose requires careful planning, rigorous evaluation, and patience. By thinking through the steps outlined in this guide, you can make an informed decision and confidently deploy the most suitable model for your AI applications. This guide outlines some of the decisions we made as we developed and iterated on our project in a short time frame.

With more time, we would investigate and generate new datasets, test and finetune new models, and identify opportunities for cross-validation and ensemble methods. Ensemble methods combine the predictions of multiple models to improve overall performance. Ensemble methods are particularly effective when dealing with diverse datasets and complex tasks.

Always remember that AI research and advancements are ongoing, so it’s essential to stay up-to-date with the latest developments and continue improving your models over time. Good luck!

If you’d like to stay on top of areas like this, you should be reading my weekly newsletter. You can follow here or on Substack.

Photo by Google DeepMind on Unsplash

Leave A Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.