Linkedin-inYoutube
logotype
  • Consulting
    • Automotive
      • Functional Safety & Cybersecurity
      • Electric Vehicle (EV) Development
      • Autonomous Product Development
    • Industrial
      • Industrial Functional Safety
      • IACS Cybersecurity
    • Responsible AI
      • Responsible Artificial Intelligence
  • Training
    • Automotive
    • Industrial
    • Responsible AI
  • Company
    • Why SRES Training
    • Leadership
    • Partnerships
    • Careers
  • Insights
  • Contact
Let's Talk
logotype
  • Consulting
    • Automotive
      • Functional Safety & Cybersecurity
      • Electric Vehicle (EV) Development
      • Autonomous Product Development
    • Industrial
      • Industrial Functional Safety
      • IACS Cybersecurity
    • Responsible AI
      • Responsible Artificial Intelligence
  • Training
    • Automotive
    • Industrial
    • Responsible AI
  • Company
    • Why SRES Training
    • Leadership
    • Partnerships
    • Careers
  • Insights
  • Contact
Let's Talk
  • Consulting
    • Automotive
      • Functional Safety & Cybersecurity
      • Electric Vehicle (EV) Development
      • Autonomous Product Development
    • Industrial
      • Industrial Functional Safety
      • IACS Cybersecurity
    • Responsible AI
      • Responsible Artificial Intelligence
  • Training
    • Automotive
    • Industrial
    • Responsible AI
  • Company
    • Why SRES Training
    • Leadership
    • Partnerships
    • Careers
  • Insights
  • Contact
logotype
logotype
  • Consulting
    • Automotive
      • Functional Safety & Cybersecurity
      • Electric Vehicle (EV) Development
      • Autonomous Product Development
    • Industrial
      • Industrial Functional Safety
      • IACS Cybersecurity
    • Responsible AI
      • Responsible Artificial Intelligence
  • Training
    • Automotive
    • Industrial
    • Responsible AI
  • Company
    • Why SRES Training
    • Leadership
    • Partnerships
    • Careers
  • Insights
  • Contact
What’s wrong with big data?
09/01/24
31 Likes

What’s wrong with big data?

Is more data better? This question has spelled the end of countless AI projects. Common sense tells us that, yes, more data allows the model to fit our problem better. This is the problem currently plaguing OpenAI’s ChatGPT. This model has famously been trained on, among other things, public web pages. The data quality issue presents itself immediately: “public web pages” do not always contain an absolute truth. This is compounded by the fact that ChatGPT is starting to be trained on its own output, as public web pages begin to integrate AI-generated content. We are learning the answer to the initial question the hard way: no, better data is better–not more data. We should focus on using quality data as our input, not a high volume of data.

Big Data

We’ve been referring to this idea of “big data” for a while, but what is it exactly? Google defines it as “extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time.” Now the repeated data quality issues with big data become evident; there is no mention of the data quality. This isn’t to say that big data is indifferent to data quality. Recently, there’s been a greater push towards enforcing data quality. This sentiment is echoed by the EU AI Act: “high-quality data and access to high-quality data plays a vital role in providing structure and in ensuring the performance of many AI systems.” So, what is data quality? How do we enforce it?

Data quality

Firstly, this shift to quality data necessitates organizations to determine what data quality means. We do have the guidance from international standards in this respect. The ISO/IEC 25024:2015 standard defines data quality as: “the degree to which the data characteristics satisfy stated and implied needs when used under specified conditions.” These characteristics are typically left ambiguous, but agreed-upon characteristics are:
  1. Provenance – the origin of the data: was this from a provider? Was the data scraped? Measured? If so, how reliable are the sensors?
  2. Metadata – data about data: documentation about the data format, provenance, and the various features.
  3. Representability – how well does the dataset represent the actual problem?
  4. Completeness – are there missing entries? Incorrect entries? Does the data suit our task?
Next, we must determine the steps to ensure data quality. The ISO/IEC 42001 standard is primarily a management system standard, but it does define a control (measure to modify risk) for data quality. The control suggests at least the following characteristics:
  1. Suitability – “the organization should ensure that the data are suitable for its intended purpose”
  2. Impact of bias – “the organization should consider the impact of bias on system performance and system fairness and make such adjustments as necessary”
If the organization deems additional characteristics necessary, they can be determined and documented by the organization. The ISO/IEC 5259 standard should also be considered for additional characteristics of data quality and data quality processes. With the implementation of these controls, we can ensure that big data systems are compliant with data quality requirements, like those in the EU AI Act.

“Bad data” vs. “Good data”

Let’s look at how “bad data” can affect an AI system. Namely with the release of GPT-4o, there have been concerns with the model vocabulary in certain languages. Tianle Cai, a PhD student at Princeton University, found that out of the 100 longest Chinese tokens, only 3 were common enough to be used in everyday conversation. The remaining 97? All common in gambling, pornography, and scamming. Other languages did not exhibit this issue; they contain vocabulary reflective of news articles and other literature. OpenAI did not respond publicly to this incident, so we can only speculate on the root cause. A model learns its vocabulary from its training data, so that seems to imply that the Chinese training data contained a plethora of inappropriate content.

What impact does this have on ChatGPT? Users were able to jailbreak the AI system, generate inappropriate responses, and generate unrelated “garbage” responses. Most of these issues could’ve been avoided by analyzing the data provenance. For example, if the data comes from an inappropriate source (website in this case), it must be cleaned or prevented from being used.

This isn’t to say the ideas and technologies behind big data are inherently harmful. Indeed, they only grow in importance as the value of data rises. It seems like other technologies; big data didn’t look before it leaped. Reflecting on ChatGPT, it seems the sensible thing to do would be to slow down. In this case, OpenAI would have benefited from finding better data–not more data.

Walking on Thin Ice: The Risks of Ignoring Responsible AI in AI System Deployments

Walking on Thin Ice: The Risks of Ignoring Responsible AI in AI System Deployments

08/13/24

How We Stopped Worrying and Learned to Love Requirements

11/11/24
How We Stopped Worrying and Learned to Love Requirements

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Insight Categories

  • Autonomous Systems14
  • Electric Mobility3
  • News9
  • Videos9
  • Functional Safety25
  • Responsible AI17
  • Cybersecurity2
Most Recent
  • SecuRESafe (SRES) Strengthens Leadership in Autonomous Systems and AI Safety, Appoints Industry Veteran Bill Taylor as Partner
    SecuRESafe (SRES) Strengthens Leadership in Autonomous Systems and AI Safety, Appoints Industry Veteran Bill Taylor as Partner
    05/01/25
  • VDA 450: Vehicle Power Distribution and Functional Safety – Part II
    VDA 450: Vehicle Power Distribution and Functional Safety – Part II
    04/28/25
  • SRES Partners on AI & Safety Webinar Series with LHP
    SRES Partners on AI & Safety Webinar Series with LHP
    04/16/25
  • Credo AI and SecuRESafe (SRES) Announce Strategic Partnership to Advance Responsible AI Governance and Safety
    Credo AI and SecuRESafe (SRES) Announce Strategic Partnership to Advance Responsible AI Governance and Safety
    04/14/25
  • Demystifying SOTIF Acceptance Criteria and Validation Targets – Part 3
    Demystifying SOTIF Acceptance Criteria and Validation Targets – Part 3
    04/11/25
logotype
  • Company
  • Careers
  • Contact Us
  • info@sres.ai
  • 358 Blue River Pkwy Unit
    E-274 #2301 Silverthorne,
    CO 80498

Services

Automotive

Industrial

Responsible AI

Training

Resources

Insights

Video

Legal

Privacy Policy
Cookie Policy
Terms & Conditions
Accessibility
Consent Preferences

© Copyright 2025 SecuRESafe, LLC. All rights reserved.

Linkedin Youtube