Today, digital content can be hard to evaluate for accuracy. To help, developers are creating AI tools using Language Models (LMs) and Large Language Models (LLMs) to identify trustworthy information. These tools can help protect users from false information. In an earlier post, I indicated that I’m working with a team that is building an AI-powered tool that aims to protect users from financial fraud and social engineering attacks.
I am not an expert in computer science, AI, or machine learning. However, I hold a PhD in Educational Psychology with a focus in cognition and instruction, and have expertise in literacy, education, and technology. Strangely, my work up to this point has been really helpful in preparing me for this work. even still, as I’m working with the team, I’m need to do a lot of homework behind the scenes to better understand how these tools are created, used, and improved.
Before I started hanging out with folks that are actually researching, building, and testing these models, my thinking was that the algorithm (the recipe that makes the learning model work) was the most important piece of the puzzle. Was I ever wrong. (¬‿¬ )
I’m quickly learning that the data set in building, testing, and iterating on an AI tool, both the dataset and the algorithm play critical roles, but they serve different purposes in the development process.
In this post, I’ll outline what I’ve learned about the process we’ll need to take to create a dataset and train an AI tool to detect trustworthiness and potential scams in digital content.
Step 1: Data Collection
For your AI tool to work effectively, begin by collecting a wide range of content from various sources, like websites, social media posts, emails, and so on. Use web scraping tools to obtain a large amount of data from different topics and domains. The quality and diversity of the data are essential, as they impact the tool’s capability of correctly determining trustworthiness.
Step 2: Labeling Data
Annotating data is the most tedious part of the process. Content must be labeled as reliable or unreliable. To guarantee precision, experts in content assessment may be hired. Establishing definitive rules for labeling is crucial for keeping the dataset consistent.
Step 3: Define Trustworthiness Metrics
Establishing criteria for distinguishing reliable and unreliable content is essential. Reputation of the source, evidence backing assertions, fact-checking, and contextual analysis are key to determining trustworthiness. To create well-defined trustworthiness standards, experts from fields such as cybersecurity, finance and journalism should be consulted.
Step 4: Expert Review
Working with experts is vital in both setting measurable objectives and constructing a reliable dataset. Their input can verify the accuracy of the labels assigned to the content, ensuring the dataset reflects its trustworthiness accurately.
Step 5: Balance the Dataset
For an unbiased AI tool to function correctly, a balanced dataset is necessary. Including both reliable and unreliable sources in this dataset prevents the AI from becoming biased and generating inaccurate outcomes.
Step 6: Preprocessing
To train data, it must be converted into a format compatible with the chosen Language Model. Common preprocessing methods to optimize the process include tokenization, encoding, and data augmentation.
Step 7: Train the LLM
Training the Language Model or Large Language Model is a critical step. Developers can use existing models like GPT-3.5 or fine-tune them on their specific task of identifying trustworthiness. Adequate computational resources and appropriate machine learning techniques are essential for effective training.
Step 8: Evaluate the Model
After training the AI tool, we must evaluate its performance using a different dataset, to find possible issues and accurately measure the tool’s effectiveness. Routine adjustments to the model and its parameters are necessary to maintain and improve the accuracy of the results.
Step 9: Iterate and Improve
Developing an AI tool is a cyclical endeavor. Refining it regularly, incorporating user feedback, and adapting to changing online trends are essential to maintaining its effectiveness.
Building an AI tool to identify trustworthiness and potential scams in online content requires a comprehensive and meticulous approach. By following the steps outlined in this blog post, we have the potential to create a robust dataset and train an AI tool that promotes online safety, protects users from misinformation, and contributes to a more trustworthy digital environment. Ethical considerations, privacy, and compliance with relevant regulations should always be at the forefront of this endeavor, ensuring that the tool is used responsibly for the benefit of all users.
Once again, I’m not the expert in the development, testing, and iterating on AI tools and LLMs. I’m writing this post to share my notes and I research and think through the process. Hopefully this will give you some insight into what is happening behind the scenes as you use these generative AI tools.