About regular expression used in the notebook

Hi, everyone. The filenames in the first notebook of Part 1 are of this structure. [PosixPath(’/home/ubuntu/.fastai/data/oxford-iiit-pet/images/saint_bernard_188.jpg’). And the notebook uses this pat = r’/([^/]+)_\d+.jpg$’ regular expression which seems to be matching the entire file name after the last slash including the (*.jpg) part when I run it in an online regex interpreter. However, it seems to only extract the the label name (saint_bernard) when used with ImageDataBunch in the notebook. So while I am trying to make up my own custom regex patters for my file names, should I attempt to extract only the label name or the entire file name (including the extension). Sorry, if this is confusing.

So there are two separate questions.

should I attempt to extract only the label name or the entire file name (including the extension)

Yes, with label_from_re, you want just the label name, not the file path.

[Should it] be matching the entire file name after the last slash including the (*.jpg) part when I run it in an online regex interpreter[?]

No, it should not. I took the expression and put it in regex101.com, and as a python re, it looks like it works


Perhaps you are using the incorrect language for regex, wherever you are testing it?

1 Like

Hi, thank you so much. Did you use str() on the Poxispath to get the (/home/ubuntu…) string?
Also, looks like the full match gives the entire file name and the Group 1 match just gives the label. Is there anyway i can explicitly mention in the regex to just give me the group 1 match?

@keysersoze
This post – the part 2 might help explain what the regex expression is doing (its not exactly an answer to your question, but it might help).

HTH.
Butch

1 Like

i’m sorry but i must completely disagree.

this is correct because the pattern includes .jpg, it’s going to be part of the string the pattern matches against, simple as that.

this is also correct. if you look at the code for name_from_re (and you should have done :wink:) you’ll see its using the first match group from the pattern for the label.

it depends on what you’re going to do with them. if you’re passing them to name_from_re then it doesn’t need to look any different. if you’re going to use it for something else then “it depends”. that said, i’d suggest that generally at some point in your code you may want the filename (or the extention) as well as the label so it’ll probably still look very similar to jeremy’s.

for example, other datasets you use in the future may not use jpgs or have a mixture of image formats with different extentions, and the less your code makes assumptions about what it’s dealing with, the easier it is to cut & paste stuff around from old notebooks.

the main thing you need to remember and get used to with regex when you’re learning it is that it’s greedy and will match as much as it possibly can. that’s one reason those online editors like regexr are good, you can see right away what it’s doing. (i realise this wasn’t one of your questions but for any regexp noobs reading this, it’s the best way to learn)

1 Like

Thank you guys. This made the concept much clearer. Also, went ahead and implemented from_name_re in my own dataset and it worked. :slight_smile:

Oh, I’m sorry, you are right. I misread his question, thought he wasn’t getting the label from the matched group. Thanks for speaking up.

1 Like