A key aspect of successful #DataGovernance is transparency. This is realized in a number of ways e.g.
✳️The quality of the data
✳️Where the data came from
✳️What data transformations have taken place
(Btw #datamodeling can play a key role in relation to the above!):
Data Governance also applies to Generative AI such as LLM’s e.g. what data was used to train an LLM?
In the case of ChatGPT this for the most part is a black box.
But some insight has been provided in relation to other LLMs such as Meta’s LLaMA and BloombergGPT.
They both used a dataset called “Books3”.
Books3 contains 170K+ books from authors like Stephen King and Junot Díaz.
This also prompts significant copyright questions, but that’s for another discussion.
The bottom line is that when it comes to LLMs, just like with any data governance related work, we should ask what data were the models trained on.
Now that the Generative AI hype has eased a little, this question will become more prominent.