Watch On:
Summary
If your training set includes enterprise confidential data, then by definition the machine you construct out of those data using ML elements includes enterprise confidential information. As an example, we need to separate and understand not just operational data and training data as described above, but further determine who has and who should have access to training data at all. That means that security people need to recognize a significant trust boundary between the data owner and the data scientist who trains up the ML system. In many cases, the data scientist needs to be kept at arm’s length from the “radioactive” training data that the data owner controls. The gist of the approach is to use the same kind of mathematical transformation at training time and at inference time to protect against sensitive data exposure.
Show Notes
So if your training set includes sensitive data, then by definition the machine you construct out of those data (using ML) includes sensitive information.
Not surprisingly, one of the main ideas for approaching the training data problem is to fix the training data so that it no longer directly includes sensitive, biased, regulated, or confidential data.
As an example, we need to separate and understand not just operational data and training data as described above, but further determine who has (and who should have) access to training data at all.
In many cases, the data scientist needs to be kept at arm’s length from the “radioactive” training data that the data owner controls.
In this case, that means recognizing and mitigating training data sensitivity risks when they are building their systems.