Drop rows in DataFrame by conditions on column file names

Hello,

I am working with the dataset in the following link: https://www.kaggle.com/neha1703/movie-genre-from-its-poster

Since the data set is pretty large (approx 40 000 images) I decided to create a training model on the sample images included in the dataset (approx 1000 images). Both for learning how to do it as well as not having to wait for 100 years when the model trains :slight_smile:

I would like to create the imagedatabunch from csv or df.

The problem I have right now is that there are many rows of information in the csv which I dont need when only using the 1000 images in the sample. Also; when I try to create the imagelist with ImageList.from_csv I get an error since the specified image from the csv doesnt exist in the folder.

Here is how the first 5 rows of the csv currently looks like:

I would like to use some kind of loop that removes all rows where the file name in column (imdbid) doesnt exist in my folder with the sample images. Something along the lines of this (I have modified the example from this site: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/ ):

Get indexes for where column imdbid filename exists in folder

index = df[ df[‘imdbid’] == exists in folder].index

Remove rows where file name doesnt exists from dataFrame

df.drop(index , inplace=False)

Some guidance in how to write the true/false index statement would be greatly appreciated :slight_smile: Thanks in advance.

//Jan