Big Data
Data quality
- Provenance – the origin of the data: was this from a provider? Was the data scraped? Measured? If so, how reliable are the sensors?
- Metadata – data about data: documentation about the data format, provenance, and the various features.
- Representability – how well does the dataset represent the actual problem?
- Completeness – are there missing entries? Incorrect entries? Does the data suit our task?
- Suitability – “the organization should ensure that the data are suitable for its intended purpose”
- Impact of bias – “the organization should consider the impact of bias on system performance and system fairness and make such adjustments as necessary”
“Bad data” vs. “Good data”
Let’s look at how “bad data” can affect an AI system. Namely with the release of GPT-4o, there have been concerns with the model vocabulary in certain languages. Tianle Cai, a PhD student at Princeton University, found that out of the 100 longest Chinese tokens, only 3 were common enough to be used in everyday conversation. The remaining 97? All common in gambling, pornography, and scamming. Other languages did not exhibit this issue; they contain vocabulary reflective of news articles and other literature. OpenAI did not respond publicly to this incident, so we can only speculate on the root cause. A model learns its vocabulary from its training data, so that seems to imply that the Chinese training data contained a plethora of inappropriate content.
What impact does this have on ChatGPT? Users were able to jailbreak the AI system, generate inappropriate responses, and generate unrelated “garbage” responses. Most of these issues could’ve been avoided by analyzing the data provenance. For example, if the data comes from an inappropriate source (website in this case), it must be cleaned or prevented from being used.
This isn’t to say the ideas and technologies behind big data are inherently harmful. Indeed, they only grow in importance as the value of data rises. It seems like other technologies; big data didn’t look before it leaped. Reflecting on ChatGPT, it seems the sensible thing to do would be to slow down. In this case, OpenAI would have benefited from finding better data–not more data.