GLAMs (Galleries, Libraries, Archives and Museums) fastai study group

Hi all,

it is good to be here and many thanks for the initiative!

  • Who you are?
    I am Péter Király, researcher and software developer at the Göttingen campus computer facility. Previously I worked at different LAM institutions (including Europeana) as software developer. I am also an editor of the Code4Lib Journal, so if you plan to publish your result, please consider this forum as well.
  • Why you are interested in machine learning/deep learning?
    My research question is how can we decide if a given metadata record is good or bad. For this I am working with data science techniques including some (unsupervised) machine learning algorithms. For me working with Big Data makes the problems even harder to solve, since lots of Data Science/ML techniques requires special hardware resources, which I do not have access.
  • Do you already have some potential problems you are (or would like to) use machine learning for?
    Just to name a few: Pattern recognition, such as make distinction between metadata values created for machines and for people. Finding similar records. Finding if all the important entities in a record are under authority control.
  • Datasets you are keen to work with? (either labelled or unlabelled)
    I worked with Europeana dataset (I made it downloadable at rnd-2.eanadev.org/europeana-qa/download.php?version=v2020-06), and several libraries’ full MARC catalogues (github.com/pkiraly/metadata-qa-marc#datasources). These are unlabelled datasets in special metadata formats.
1 Like

I am also interested in this! I am still very much a noob when it comes to all of the details of IIIF but would like to learn more about it and the intersection with machine learning.

This sounds great too, maybe once we get a bit further in the course there might be some scope to play/think about how IIIF could fit into a machine learning workflow in a way that works well within a GLAM context?

I would be up for exploring the crowdsourced data from NLW. I think the intersection between crowdsourcing and machine learning is really interesting and also something which I think is fairly unique to GLAMs (and citizen science projects). The motivations for people to produce this data is quite different to how datasets are often produced in industry and I think that has implications for where machine learning should and can be used in relation to crowdsourced data and how to two might meaningfully interact.

nice to meet you :wave:

As someone who has likely produced some dubious metadata records, :grimacing: this is definitely interesting to me!

I have followed the mailing list for years and will keep it in mind for sharing results :slight_smile:

2 Likes

Hey, I’m so excited to be a part of this!

Who am I?
Silvia Gutiérrez, a Digital Humanities Librarian in Mexico City.

Interest in machine learning / deep learning
I have applied Association Rule Learning methods to our books collection’s metadata (> half a million records). The objective was to understand our Subject Headings better and see if we could create a recommendation systems based on those terms. We couldn’t. But we learned this:

  1. Library data can be super messy and a standardize thesauri does not ensure uniformity (I’ve seen the same problem in National Libraries Records including the British Library :scream:)
  2. There are no out of the box solutions, and the AI-GLAM-projects I know of, are dealing with very interesting but perhaps too specific research questions.
  3. I would like to see how these super powerful methods can help librarians deal with day-to-day issues
  4. A bit more LIS-nerdy result is that both the NYPL’s Subject Headings Graph, as well as ours, has the subject “History” as a central node. And the second largest is the name of the country of each collection (USA for the NYPL and Mexico for ours)

Problems / datasets you are keen to work with?
Cleaning my library’s catalogue (reasons above :upside_down_face:)

Things that would help
I really like to learn from seeing the examples of others, so +1 to @JoseCalvo’s idea of sharing our scripts, perhaps there could be a global index like the one the #30daysofgraphics initiative in Spanish did for the R Scripts the community created

:wave: great to meet you Silvia

This seems to be a common trend :grimacing: do you think the messiness of the data was the main reason you couldn’t build a recommendation system on those terms? I think this is one of the really interesting things about library (meta)data is that it ranges from fairly structured ‘tabular’ data to free-text/prose. As a result one of the things I’m interested in particularly in a GLAM settinig is using deep-learning models that use multiple types of input i.e. both the structured tabular data + free text.

Yes, I really hope that this course will help with this. I think there is a danger that if GLAM institutions overly rely on external collaborators or commercial providers to implement machine learning techniques then the agenda is set too much by others (this is not say those projects aren’t useful too!)

I would be very keen on that too :slight_smile:

2 Likes

You have probably all seen that the course is going to be launched on August 21st!

Since I don’t want to impose a strict schedule on the study group, people can obviously follow the course at a speed they prefer. My aim will be to watch one lesson per week every two weeks and spend the subsequent time working through the notebooks, applying to new data etc. I plan to post comments, questions etc. as I go along (and my notebooks in a GitHub repo). There will probably also be a bit of time needed to make sure a GPU is working and decide on server set-up etc at the start. The forums here are a great source of help on that front.

Schedule for the first few weeks will look like:

Week begin Lesson
24 August Lesson 1 + set-up
7 September Lesson 2

Since quite a few people were keen to have a video call I thought it might be useful to try to schedule one on the week following the course release to say hi and discuss how we want to approach things. I have set up a doodle poll to try and find a good time for that. We might struggle to find a time that works for everyone’s time zone but hopefully, we can make something work :crossed_fingers:

1 Like

Hi all :wave: excited to be here!
Thanks Daniel for taking this initiative!

Who you are?
I’m Philo van Kemenade, based in Amsterdam. I work as UI engineer at the Netherlands Institute of Sound and Vision on R&D into more accessible collection interfaces for end users. Here’s a sneak peek of some recent work. Previously I worked at the R&D lab of the Slovak National Gallery on the online collection platform Web Umenia and ‘special projects’ connecting people to art in the context of exhibitions.

Why you are interested in machine learning/deep learning?
I studied AI just before Deep Learning became a thing and haven’t been using my machine learning knowledge very actively. I’m excited how the scale and curation of digitised cultural collections fit the capabilities of modern machine learning techniques. I’m keen to contribute to a virtuous cycle of the AI and GLAM domains exchanging insights & expertise on technology, ethics and accessibility.

Do you already have some potential problems you are (or would like to) use machine learning for?
Datasets you are keen to work with? (either labelled or unlabelled)
I’m interested in ways to support more intuitive access to large scale collections for end users. I see a lot of opportunity for AI & ML techniques to help here. E.g. running visual analysis on images / videos producing high dimensional feature vectors and using those for similarity calculations or visualisations via dimensionality reduction techniques like UMAP or t-SNE.
Keen to work with the Open Images dataset of openly licensed video material.

1 Like

hey! :wave:

This is really nice!

This is really interesting to me too, I’m also very curious about techniques to allow users to manipulate the navigation of these embeddings i… also including metadata about the collections etc to move between a more ‘traditional’ metadata oriented view on a collection and a newer ‘data orientated’ view.

I’m working on a mini-project using video material at the moment so I’m very keen to discuss this with other people too :slight_smile:

1 Like

Awesome! I often find that a lot of the experimentation with GLAM context stays in offline processing to generate static results (some made available in more traditional interfaces like searchable transcribed text or lists of similar records. I’m very interested in moving some of the control of parameters over to users, so that they can make their own visualisations and find what is of interest to them. To me, this feels like the next step up from ‘Generous Interfaces’. Of course a major challenge to keep such interfaces intuitive!

oh cool! can you tell more on this or share a link?
It feels like the machine learning community has made great strides in processing images in recent years, which I hope will progress to the medium of video :slight_smile:

Hi Daniel

Thanks for starting the study group. My name’s Susan Ford. I’m a Classicist who also has an interest in natural history, so the text I’m working on at present is Pliny the Elder’s Natural History (Latin, 1st century AD). I’m an independent scholar, not working in the GLAM sector (I hope that does not rule me out of the study group!). I want to join up Pliny’s thousands of words on plants and materia medica with relevant images, especially to make the text more useful to general readers. There are image collections labelled with Linaean names and modern common names in an informal sense so it’s a matter of adding Pliny’s names in the first instance.

I am aware of at least one collection of European images relevant to the project: Pl@ntNet. There are probably better ones owned by European herbaria but that’s the one I’m using to inform myself about European phorbs.

I’m OK on Python, though everytime I don’t use it for 6 months some of it goes away.

I look forward to doing the course, especially with the forum members. The main problem I anticipate is figuring out what the workflow should be for the above Pliny project which I’d like to do as I progress through the course (rather than get to the end and still not know how to identify and access relevent image datasets).

Susan

I don’t have much worth sharing yet but hopefully soon!

I was inspired by this project which used CNNs to detect particular shot types in films i.e. long shot, medium, close-up etc. I am hoping to do something similar with other ‘broad’ categories which would apply to most/many types of video (like shot types). I’m particularly interested to see if there is a way of efficiently pre-training a CNN on one task of this type and then fine-tuning it again on a new task on the same videos (or at least similar).

One thing I am trying out at the moment is different approaches to scene change detection as a pre-processing step for working with film. Most methods I’ve looked at seem to use some form of threshold detection. These seem to be fairly sensitive to changes in threshold values. I am hoping to find something that will be fairly robust across a dataset that is big enough that it will be impossible/annoying to manually validate very much of it. It’s not absolutely essential that I get this part working but it would be nice to have the option to split by scenes rather than only on time intervals.

Great to meet you Susan! :wave:

This sounds like a great project :slight_smile:

I think time spent working out how to tackle a project can often be a big chunk of the work, especially when it’s not something for which you can easily replicate approaches other people have taken directly. I think it is really useful to have a specific project/goal in mind when you follow a course like this :slight_smile:

Definitely doesn’t rule you out, very happy for you to join!

ICYMI the course has just been relased :grinning: https://www.fast.ai/2020/08/21/fastai2-launch/

1 Like

Who you are?

Hello, I’m Nicky Nicolson & I’m senior research leader in Biodiversity Informatics at Kew.

As an applied computer scientist working at a collections based botanical research institute, my remit is to provide more efficient data mobilisation and decision support “to curate and provide data-rich evidence from Kew’s unrivalled collections as a global asset for scientific research”.

Digitising and integrating data is expensive, I’m particularly interested in how we can use ML techniques to break down tasks into smaller sets, some of which may be automatable, whilst others can be passed to humans for investigation / verification.

In addition to scientific research at Kew, we recognise that the collections represent history (history of science, collecting & social history) and we are starting to develop more projects to investigate these aspects of the collections, integrating our holdings with those from other organisations.

Why you are interested in machine learning/deep learning?

We’ve a lot of data, but crucially also access to experts to interpret results. Scientific research with biological specimens cross-cuts a lot of different data formats (structured data, images, texts) and traditionally we’ve done a lot of laborious work in to extract, integrate and annotate these - the outputs of these efforts could be re-purposed as training data for machine learning.

Do you already have some potential problems you are (or would like to) use machine learning for?

I’ve been working on data mining the collectors (people and team entities) from messy, aggregated datasets of specimen metadata using clustering techniques and using these to integrate data from specimens held in different institutional collections. General writeup here.

Datasets you are keen to work with? (either labelled or unlabelled)

As well as using data aggregated into the Global Biodiversity Information Facility data portal (structured specimen data and specimen images), I’d be interested in learning more about techniques to deal with text data. This could range in scale from fairly abbreviated sentences describing habitat or collecting localities to the kinds of bibliographic works digitised through the Biodiversity Heritage Library. Also as our data is very interconnected, techniques to work with graph data structures would also be of interest.

Is there anything that you think would help you get prepared to do follow the fastai course (i.e. you are a bit rusty with Python).

I’m happy with Python & sklearn etc, though deep learning & GPU technology is new to me.

Sorry this is a bit last minute. The best time to video call seems to be 17.00-18.00 GMT on Wednesday (tomorrow). Since it’s a bit last minute I realise people might not be able to make it anymore. If you do want to join, I will post a jitsi link here tomorrow morning. jitsi doesn’t require any software to be installed so hopefully will be possible for everyone to use regardless of device etc.

Hi everyone! Sorry I’m a bit late to this.

I’m Barnaby Walker, and I’m a research fellow at Kew.

I work with @nickynicolson, so am also interested in using ML/DL to enrich data for research and automate some tasks. I’m specifically interested in using DL on our specimen images to make them easier to explore and to use them directly in research.

I’ve mainly used other ML methods before now, but am familiar with some deep learning, and keen to go through this course with other people.

Nice to meet you Nicky :wave: I am a big fan of Kew Gardens!

This sounds really interesting.

I think there are some good synergies here with what other people are interested in with using for catalogue/metadata so hopefully, we can keep that thread going as we work through the course.

Nice to meet you Barnaby :wave:

What kind of metadata to these images have already? Something I’m curious about (and have done some very crude experiments with) is trying to use existing metadata for self-supervised learning by using the existing metadata as a “pretext task” before training on a smaller set of other labels. This could be one way of potentially reducing the amount of training data needed for a new model and would leverage all the previous work that has gone into creating metadata.

@danielvs Sorry I didn’t see doole poll, but I can make it to video meeting Wed 1700 GMT
Cheers Susan

Link for the call later today (17:00 UK time): https://meet.jit.si/fastai4glams. Since I’ve been bitten by this recently, the UK is currently on British Summer Time [https://www.worldtimebuddy.com/] should tell you the time in your time zone!

1 Like