How much is data order is important

JohnyCZS · March 20, 2024, 11:46am

Hello,

first of all please excuse me if my question here seems stupid. I have some experience with python but I just started learning ML and DL. I working on a project of mine where I’m trying to create a model that would be able to recognize a vehicle maker, vehicle model and engine based on the input from a binary file read from the engine ECM.

My main issue is that I’m not able to extract proper data from the binary file as its just a really long field of strings after it has been decrypted, some are human readable and some aren’t. I have a method that parses the strings given parameters.

But I’m not able to categorize the output of this method. For better understanding I will try to provide a pseudo example.

Let’s say my CSV file have 1000 rows and 1000 columns(the number of columns in a real CSV file is up to 50000), some columns have same values for example “XXXX” but due to the nature of bin files sometimes this value “XXXX” is in a different position and its gets “pushed” into a different column due to the parsing process. Hence the problem of categorization. Same data can be in a different columns in the CSV file resulting in an sort of unstructured csv file.

My question is how much does this affect the models performance, I’m treating all columns as a category values. In a small data set around 150 files my model performance was around 75% in accuracy. But I just used basic FAST AI API for testing purposes before diving deeper or trying with bigger datasets.

Is there maybe a particular model or method that I could use that would be prone to this type of data?

Thank you for reading through, hope it makes some sort of sense. If you have any additional questions feel free to ask.

vtecftwy · March 21, 2024, 2:24pm

Unfortunately, I do not think you have easy way out of cleaning up your parsing process to make sure that each piece of data is in the right column in your csv file. fastai api when loading the tabular data will expect one column to be consistently data from one source. It is so much true that you will have to clean your data to 'impute" missing data (NaNs) with something.

First step will therefore be to understand the structure of the ECM files and parse them correctly. You may find that other people have done that before and put it on Github. Worth a search.

JohnyCZS · March 21, 2024, 2:30pm

Hello,

first of all thank you for the response. I went ahead and started testing with simple 2 categories and a tabular_learner and so far the results are better then expected. Around 90% in accuracy. I’m either overfitting but I think that is not the case or my best guess is that there is so much data (again around 30-50k) columns, that the model is able to identify a structure for each ECM. But then again so fast its a binary classification and I need to do more research to see how it behaves when there are 3 or more categories to predict.

Or third option I’m completely in the wrong here and I’m missinterpreting the data.

vtecftwy · March 21, 2024, 2:55pm

To validate whether you overfit or not, split your data in train and validation set (and keep some for a final test too). Then look at the evolution of the training loss vs validation loss. As long as your validation loss continues to improve when you are training (more epochs), you are fine. If your validation loss is increasing while your training loss continues to decrease, you are overfitting and you need to use regularisation or smaller lr. If both losses are increasing, you are diverging and you should try to train with a smaller lr.

JohnyCZS · March 21, 2024, 6:02pm

Yes that is exactly what I’m doing and how I’m monitoring valid loss and train loss. So it looks promising so far.