Training AI on untrustworthy data can actually poison that AI, which in turn can contribute to unreliable AI outputs.

In other words, using untrustworthy data is like building castles of sand.

What’s an age old, tried and trusted way to help address untrustworthy data?

The role of data modeling in AI

Lets start at the start – what is data modeling?

“To model data is to provide a (usually graphical) representation of data in a domain.” David C. Hay, author.

Equally important is to be aware of why we model data, for example:

  • Data storage and processing
  • Data Integration & interoperability
  • Improved decision making
  • Facilitate compliance

But most importantly of all, in this short video I explain a key way that data modeling can contribute to successful AI adoption (using SqlDBM online data modeling tool and the Snowflake Data Cloud as a reference).

 

There are plenty of technical components and processes which compliment the above. For example, Snowflake is introducing a suite of functions covering Data Quality areas such as Freshness, Accuracy, Uniqueness, and Volume. The functions:

  • FRESHNESS determines the freshness of column data
  • DATA_METRIC_SCHEDULED_TIME defines custom freshness metrics. It can be used to return the timestamp for when a data quality function is scheduled to run, or the current timestamp if the function is called manually
  • NULL_COUNT determines the number of NULL values in a column
  • DUPLICATE_COUNT determines the number of duplicate values in a column, including NULL values
  • UNIQUE_COUNT determines the number of non-NULL values in a column
  • ROW_COUNT measures the count of rows in a table
  • UNIQUE_COUNT returns the total number of unique non-NULL values for the specified columns in a table

However, to apply those functions effectively:

We still need a data model!

Semi-structured data

As mentioned in the video, we can configure the Snowflake Variant data type using e.g. SqlDBM online data modeling tool as part of data model delivery. This also opens the door to schema-on-read:

  • As the name suggests, the database schema is created when the data is read
    • As opposed to when the data is written which defines our data before it arrives at our destination
  • In this case, schema-on-read is realised by parsing the semi-structured data stored in our Variant table column e.g. JSON, XML
    • Snowflake has a myriad of functions to facilitate this parsing, such FLATTEN

What have structured and semi-structured data got in common? They can both be used as inputs into AI model training.

Conclusion

In some ways, despite the recent upsurge in AI related discourse and rollout, nothing has changed. We still need good quality data, good quality data processes (such as utilizing the Snowflake Data Cloud functions mentioned earlier), and of course, good quality data models!

The bottom line is that data modeling can play a key role in successful AI adoption.

Important Notes:

  • None of this content was created using Gen AI — it was all created the old fashioned way! 
  • If you would like to build your data modeling skills across the full data modeling lifecycle using SqlDBM online data modeling tool and the Snowflake Data cloud, click here.

© Dan Galavan 2024