Working with datasets that have error bars?

amyxst · August 10, 2018, 5:29pm

I am wondering if anyone has ever worked with data that comes with specified uncertainty (e.g. represented as error bars)?

I’m currently working on building a model for a dataset that comes with measurement errors, and would love to see if anyone has had experience dealing with these errors. I haven’t been able to find any literature on this topic unfortunately.

Thanks in advance!

Edit: I am working with a structured dataset, that comes with recorded errors for the values in the columns.

Hadus · August 10, 2018, 7:44pm

If you have enough data, the measurement inaccuracies won’t be a problem.

mnpinto · August 10, 2018, 8:17pm

What are you trying to do more concretely? Do you want to make predictions with error bars?

Hadus · August 10, 2018, 9:12pm

Are you asking if inaccuracies in the data would cause struggles in learning?

It won’t (most likely) and sometimes we even add some “specified uncertainty” to our data in the shape of random noise to pictures or randomly change the class of a picture in classification… We do this to reduce overfitting. Of this, I am sure you can find some papers but I doubt it is what you are looking for.

amyxst · August 10, 2018, 10:59pm

@mnpinto Yes! I’m working on a model that makes a multi-class classification, and where each datapoint is a set of recorded values, but the recorded values all have a recorded experimental error. I’m wondering how I should effectively incorporate the error bars into my data.

As a first try, I’m just adding the specified errors as their own columns, but that’s probably not the best way to go. So I’m toying with the idea of maybe assuming a distribution of the errors, and then augmenting my dataset somehow by modifying recorded values by an error drawn from the distribution.

I’m curious if anyone else in the natural/physical sciences have come up a strategy to deal with these measurement errors.

amyxst · August 10, 2018, 11:02pm

Thanks for jumping in! I am actually looking at errors that are quantified in the dataset:

e.g. Length: 40m ± 0.05

and how to deal with the 0.05 at the end.

mnpinto · August 11, 2018, 9:05am

@amyxst Yes, I was thinking about that! It could be interesting to assume the data samples as random variables and for each epoch or mini-batch draw samples from a normal distribution and use that to train the model. It’s some sort of data augmentation or regularization, maybe there are some papers about that. It would be interesting if you try it to find if there is an improvement in accuracy. After the model trained you can generate predictions using just the samples (the mean of the distribution) or maybe generate several predictions sampled from the normal distributions to have some sort of ensemble. I’ve never tried this but I’m interested in it since I often work with data with uncertainty.

Maybe this can be done with callbacks like on_epoch_end(). Or if the errors are constant for each variable, like you mentioned 40m ± 0.05, then if we assume the errors are normally distributed and the ± 0.05 as the 95% confidence interval then the 0.05 is about 2\sigma (\sigma - standard deviation) so you can easily get samples from a normal distribution with mean 0 and standard deviation \sigma and add them to the sample values.

I’m not an expert on fastai or pytorch, neither in programming… but I’m working to improve, so if you need any help to implement this I’ll do my best!

Hadus · August 11, 2018, 9:19am

I understand now!

Bear in mind that I haven’t done any work on this kind of data.

If the n in m ± n changes:

make ‘n’ another output of the network as well as ‘m’.

If it n doesn’t change I would try two things and see the results of both:

just ignoring the ± n (only do this if n is a constant!)
generate random numbers in the specified range (± n) and add it to m. This would also function as a great way to reduce overfitting (if the random number is different in each epoch).

By n not changing I mean that in one feature it is the same. It doesn’t matter if different features have different n.

amyxst · August 13, 2018, 3:09pm

Thanks for the suggestion on the function! I’ve never done anything conditional with neural nets before, so this will be new.

I’m also surprised that I haven’t really heard of anything in ML with regards to how to best manage data with error bars, when most data in experimental science come with uncertainties (or maybe I haven’t been looking in the right place?)

I’ll try a few different things and hopefully have something useful to report back.

amyxst · August 13, 2018, 3:11pm

Thanks for the input, those were the conclusions I also came to. I think I’ll give it a shot and report back on how it goes!

mnpinto · August 13, 2018, 4:44pm

Take a look at this:

Bruce · August 16, 2018, 2:38am

Plot the errors to see if Gaussian ( just measurement error) or shifted from a zero mean (bias) . If random noise then ignore, if bias then fit the data and adjust for the bias. This is a very simple recipe.

You are stepping into a deep subject. One that usually only cared about by someone that is concerned with the error bounds on your prediction. That is why every paper published has error bars; whether predicting dogs/cats or lung cancer or Higgs boson.

sukhdeep · May 14, 2020, 8:30am

Hey @amyxst , I am also working on same problem, i.e. using a dataset like y±σ to train the deep neural network and donot know how to incorporate the error bars of data in neural network training. Have you made any conclusion as you have worked on it quite a long time ago?