ML Long Tails | Charleston Advisory

Machine Learning Long Tails

One of the most challenging parts in the world of machine learning is that our model is really only as good as the coverage of distribution that our training data can achieve. If there are inputs that are very far out of the distribution of the training data, then our model might actually behave in a way that's completely unpredictable or unreliable.

In order for us to get the model to perform optimally well in the edge cases (Black Swan data points) or data inputs that we experience in the real world, we have to have the dataset as comprehensive as possible. The coverage of the 'long-tail' of a normally distributed dataset is critical to the efficiency of the model's outputs.

So what's the Answer ?

Therefore, we need a decentralized marketplace for the contribution of data to a dataset, and we can reward anyone who has very, very unique data out in the world, contribute that data to the network - rather than have a central company scouring billions of data points - who has no way of knowing who has that data.

By "flipping" around the incentive structure from a central focus to a more decentralized approach, we create an incentive for those people to come forward and provide that data on their own accord, then we will actually get significantly better coverage of the longtail. We also know that the performance of machine learning models continues to improve as the dataset grows and as the diversity of the datapoints in the dataset grows. So this can actually supercharge the performance of our ML models to an even greater degree.

We're able to get even more comprehensive data sets that cover the whole distribution.

Because we will know which pockets of data to target, we can then incentivize these individual data points to provide data for the completion of the dataset.

Be wary of Fake Data
We know that if we're going to incentivize people to contribute data, we're going to incentivize people to create fake data so they can get paid. So we must have a robust mechanism to make sure that the data being contributed is authentic. This will be one of the hardest open problems to solve for many companies who will incentivize the provision of data. Incentives drive outcomes so if we incentivize the ‘wrong’ actors, they will contribute to the ‘wrong’ outcomes. This equation works in the right direction (eg. Good actors) but unfortunately also works in the wrong direction (eg. Bad actors) – the formula is ofcourse agnostic. So any data contribution models based on incentives need to be cognisant of what the incentives will produce.

While there are several ways of achieving this, including trusted hardware embedding tests into smart contracts, there are other more efficient ways to authenticate the supply of data. For example, a smart contract that embedded some kind of test, such as a cognitive test could assist authenticating data, while simultaneously testing the veracity of the data on it’s overall existing dataset. A cryptographic proof that identifies providers to ensure that data is not 'replicated' or 'synthetically produced' can also assist with increasing the barriers to entry from bad actors.

Transparent Marketplaces
Where we are heading is a very open bottom-up transparent marketplace that allows people to contribute and source compute data machine learning models for ML that essentially act as a counterweight to the hyper-centralized giant tech companies that are driving all of the AI work today.

Today however, we see both top-down and bottom up approaches to the contribution of data, until such time as a decentralized model truly creates a set of incentives that magnetizes humans to self-organize.

That’s the goal.

Peter Toumbourou

Charleston 2023