What’s wrong with big data?

Is more data better? This question has spelled the end of countless AI projects. Common sense tells us that, yes, more data allows the model to fit our problem better. This is the problem currently plaguing OpenAI’s ChatGPT. This model has famously been trained on, among other things, public web pages. The data quality issue presents itself immediately: “public web pages” do not always contain an absolute truth. This is compounded by the fact that ChatGPT is starting to be trained on its own output, as public web pages begin to integrate AI-generated content. We are learning the answer to the initial question the hard way: no, better data is better–not more data. We should focus on using quality data as our input, not a high volume of data.

Big Data

We’ve been referring to this idea of “big data” for a while, but what is it exactly? Google defines it as “extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time.” Now the repeated data quality issues with big data become evident; there is no mention of the data quality. This isn’t to say that big data is indifferent to data quality. Recently, there’s been a greater push towards enforcing data quality. This sentiment is echoed by the EU AI Act: “high-quality data and access to high-quality data plays a vital role in providing structure and in ensuring the performance of many AI systems.” So, what is data quality? How do we enforce it?

Data quality

Firstly, this shift to quality data necessitates organizations to determine what data quality means. We do have the guidance from international standards in this respect. The ISO/IEC 25024:2015 standard defines data quality as: “the degree to which the data characteristics satisfy stated and implied needs when used under specified conditions.” These characteristics are typically left ambiguous, but agreed-upon characteristics are:

Provenance – the origin of the data: was this from a provider? Was the data scraped? Measured? If so, how reliable are the sensors?
Metadata – data about data: documentation about the data format, provenance, and the various features.
Representability – how well does the dataset represent the actual problem?
Completeness – are there missing entries? Incorrect entries? Does the data suit our task?

Next, we must determine the steps to ensure data quality. The ISO/IEC 42001 standard is primarily a management system standard, but it does define a control (measure to modify risk) for data quality. The control suggests at least the following characteristics:

Suitability – “the organization should ensure that the data are suitable for its intended purpose”
Impact of bias – “the organization should consider the impact of bias on system performance and system fairness and make such adjustments as necessary”

If the organization deems additional characteristics necessary, they can be determined and documented by the organization. The ISO/IEC 5259 standard should also be considered for additional characteristics of data quality and data quality processes. With the implementation of these controls, we can ensure that big data systems are compliant with data quality requirements, like those in the EU AI Act.

“Bad data” vs. “Good data”

Let’s look at how “bad data” can affect an AI system. Namely with the release of GPT-4o, there have been concerns with the model vocabulary in certain languages. Tianle Cai, a PhD student at Princeton University, found that out of the 100 longest Chinese tokens, only 3 were common enough to be used in everyday conversation. The remaining 97? All common in gambling, pornography, and scamming. Other languages did not exhibit this issue; they contain vocabulary reflective of news articles and other literature. OpenAI did not respond publicly to this incident, so we can only speculate on the root cause. A model learns its vocabulary from its training data, so that seems to imply that the Chinese training data contained a plethora of inappropriate content.

What impact does this have on ChatGPT? Users were able to jailbreak the AI system, generate inappropriate responses, and generate unrelated “garbage” responses. Most of these issues could’ve been avoided by analyzing the data provenance. For example, if the data comes from an inappropriate source (website in this case), it must be cleaned or prevented from being used.

This isn’t to say the ideas and technologies behind big data are inherently harmful. Indeed, they only grow in importance as the value of data rises. It seems like other technologies; big data didn’t look before it leaped. Reflecting on ChatGPT, it seems the sensible thing to do would be to slow down. In this case, OpenAI would have benefited from finding better data–not more data.