GLAMs (Galleries, Libraries, Archives and Museums) fastai study group

Danielvs · August 6, 2020, 9:21am

Taking inspiration from the time-series and geospatial study groups I wanted to start a new study group/thread with a focus for people working in GLAMS (Galleries, Libraries, Archives and Museums).

There is growing interest in using ai in GLAM settings for example: Machine Learning + Libraries report, Fantastic Futures conferences. There are a range of potential use cases for deep-learning in GLAMs with some focused on ‘day-to-day’ work whilst others are more focused on the use of GLAM collections in combination with ML for research purposes.

I sent a tweet about starting a GLAM focused study group for the intro to deep learning course a couple of weeks ago and I was pleased to see that other people were keen to do the course too. I had hoped to use this thread as a place for discussion for this informal study group.

Until the v4 of the course is launched (which should be in the next couple of weeks), it might be nice to use this thread to introduce ourselves for people keen to follow this study group (or who are interested in GLAM applications of ai).

Who you are?
Why you are interested in machine learning/deep learning?
Do you already have some potential problems you are (or would like to) use machine learning for?
Datasets you are keen to work with? (either labelled or unlabelled)
Is there anything that you think would help you get prepared to do follow the fastai course (i.e. you are a bit rusty with Python).

I thought it was easiest to keep the study group mainly asynchronous, but I’m happy to organise some video calls if people are also keen to chat that way. Please indicate in the poll if this is something you would like to do. If a few people express interest and the time zones work out, I can set something up.

I would love to chat via video chat
I’m happy with an asynchronous only study group

0 voters

Similarly, I think there are a lot of advantages to having discussions in a forum but understand there are also some benefits to having a more private channel for discussion too. If you prefer to also have a private channel I can set a slack or similar up to run alongside the forum discussions:

I am happy to keep chats on the forum
I’d also like a more private communication channel (mailing list, Slack channel etc)

0 voters

Hoping to see some other people interested in AI and GLAMs in this thread

Danielvs · August 6, 2020, 9:30am

My intro! My name is Daniel, I currently work at the British Library as a digital curator on a digital history research project. I did the v3 of the fastai course and learned a ton and really enjoyed the course. I also wanted some company whilst I did the course

I am really excited about the opportunities deep learning offers for libraries both for researching collections but also for enabling ‘day-to-day’ activities. I feel there is a good deal of scope for ‘narrow models’ for using deep-learning in libraries. By this, I mean training models to do fairly specific tasks on a fairly specific collection/collection type. This is one of the reasons I particularly love the fastai course/software with it’s focus on transfer learning and data augmentation. I’m also interested in how to deploy ai in libraries in a responsible way and how datasets/models can be shared effectively between institutions.

I have a few datasets that I’d like to work with when going through the upcoming course, I’m particularly keen to work more with tabular data more.

I am also keen to contribute to creating new (labelled) datasets to create training data. I’ve found it can be hard knowing whether you are achieving a good result with some GLAM data because there isn’t much point of comparison. If other people are interested in this too, let me know!

joseed · August 6, 2020, 10:04am

Hi Daniel! Thanks for breaking the ice. My name is José Eduardo and I am a Machine Learning Engineer at Europeana. I believe this initiative is great for sharing ideas, resources and building a community around ML applications to GLAM.

I am currently working on automatic enrichment of Cultural Heritage Objects, in particular applying deep learning models for object detection and image captioning. We are facing shortage of labeled data as well and therefore we are considering crowd and niche sourcing. It’s not an easy problem though, because for most of the cases expert knowledge is needed for labeling data, which can be extremely time consuming and difficult to obtain.

We are planning on sharing datasets and models so the rest of the community can also benefit from our efforts. It would be great for us to obtain some feedback from other ML practitioners for improving the accessibility and usability of these datasets. We believe we will have a couple of pilot datasets in September, let me know if this sounds interesting

Danielvs · August 6, 2020, 11:04am

Great to meet you José

This sounds really interesting! Are you using an existing metadata scheme as your target labels or have the labels been developed specifically with deep learning in mind? I think some existing metadata schemes can map quite well to ‘typical’ machine learning labels but others rely on a lot of external knowledge/information or can be very sparse.

That sounds great I would be really keen to hear how you are planning on sharing the data and models. I think the model part could be particularly tricky to get right. Trying to get someone else’s models working from a GitHub repo can sometimes be a lot of “fun”! Are you thinking of sharing the model weights or also creating docker containers or similar for running the model in inference?

joseed · August 6, 2020, 12:21pm

Europeana’s metadata is stored following the Europeana Data Model (EDM), although we don’t feed the our models with this data structure. Therefore a lot of data cleaning and processing needs to be done as well!

Data platforms such as Kaggle and Zenodo are being considered, together with European data portals like the Social Sciences and Humanities Open Cloud. For sharing models we are considering pytorch and tensorflow hub, although this is still in the very early stages I believe sharing the model weights is a must in either case. The advantage of dockerizing the model is that it will be compatible with almost any infrastructure.

Danielvs · August 7, 2020, 9:32am

Will eagerly look out for those! It’s probably too early but have you also considered some type of model cards to accompany the model release?

joseed · August 7, 2020, 11:27am

Indeed it is early, but model cards seem a great idea! Thanks for the suggestion. However, I believe our main focus will be on releasing data and tutorials on how to work with this data rather than on sharing the already trained models themselves. We aim to build an active community around these resources and we hope it clashes with some of the goals of this group

JoseCalvo · August 7, 2020, 12:51pm

Hi there!
Thanks for the initiative, I am really happy to have found it. My name is also José (first name) Calvo Tello (family name), I work at the State and University Library in Göttingen, Germany. I have a background in Spanish literature and linguistics and in January I submitted my PhD in Digital Humanities where I implemented and evaluated Machine Learning methods to literary corpus. Since May, I work at the Library as researcher and librarian. I have used “classic” Machine Learning algorithms such as Support Vector Machines, Logistic Regression, Random Forest… I have not real experience with deep learning, only a workshop. Since my daily work will have to do with the library catalogs, I am very interested in the group.

It would help me if we would meet in certain frequency (maybe once or twice a month). And also share scripts or even jupyter notebooks.
Thanks and regards!

PS: José Eduardo, I guess you come from a Spanish-speaking country? Last year I was in a conference, there two Spanish guys there: we both were Josés

Danielvs · August 7, 2020, 3:29pm

Nice to meet you José

That sounds like fun What sort of corpus did you work with? Was it all contemporary or was there historic materials too?

I’d be happy to organise that, maybe we should wait until the course is officially announced and then arrange a meeting where we can introduce ourselves for those who want to join.

I also think this would be really great, especially example notebooks which work with ‘real’ data. Often the data you want to work with is a little bit more messy and complicated than the tidier datasets used in example tutorials etc. I planned to add some notebooks to this repository for each lesson in the course and would be very happy for other people to also add notebooks there too. Is there a particular type of data you are going to be working with? Will it also be literary data?

JoseCalvo · August 11, 2020, 1:07pm

Hi Daniel!

That sounds like fun What sort of corpus did you work with? Was it all contemporary or was there historic materials too?

I collected my own corpus with novels published in Spain between 1880 and 1939. At the end, I gathered 358 novels in XML-TEI and I manually annotated several fields of literary metadata. A section is already available here: textbox/spanish/novela-espanola at master · cligs/textbox · GitHub

Is there a particular type of data you are going to be working with? Will it also be literary data?

I want to work with bibliographic data from libraries. For now I want to learn the basics about deep learning applied to such information. Once I feel comfortable, I am specially interested in trying to homogenize historically the thematic labels and categories that were assigned over several decades. An example: a given library was using a taxonomy of librarian categories from its beginning up to the 90s. During 10 years, no categories was assigned. Since the 2000s, a new taxonomy is been applied. To what degree can these labels be homogenized automatically?

Danielvs · August 12, 2020, 2:35pm

Thanks for sharing

This sounds really interesting. Do you think your approach will be to try to directly link these label or to do it via the items they describe?

glenrobson · August 12, 2020, 11:16pm

Hi All,

Thanks Daniel for organising this study group and I look forward to meeting you all. In answer to your questions:

Who you are?
I am Glen Robson and I work as the IIIF Technical Coordinator for the IIIF consortium (https://iiif.io)
Why you are interested in machine learning/deep learning?
I’m really interested in how machine learning can be used with GLAM institutions and in particular how/if the IIIF standards can help.
Do you already have some potential problems you are (or would like to) use machine learning for?
I might but the first thing I want to learn is what problems are suitable for machine learning.
Datasets you are keen to work with? (either labelled or unlabelled)
Again i want to learn what makes a good dataset but I’m hoping some of the open IIIF resources could be used. I also chair the IIIF Newspaper group so would be interested in the application with large text resources and finally I used to work at the National Library of Wales and I’d be interested to see if their crowdsourced data could be used: https://www.library.wales/collections/activities/research/nlw-data/.

Thanks

Glen

JoseCalvo · August 13, 2020, 5:22am

This sounds really interesting. Do you think your approach will be to try to directly link these label or to do it via the items they describe?

Good question, I don´t know yet. Actually the thing is even more problematic, because many libraries are working together in the same database, but with many different taxonomies and vocabularies. I think the information is there, only expressed through many different labels. So I was actually thinking in using the original labels from the libraries in the different decades to try to predict a controlled set of labels for the entire period. But I am still trying to figure it out.

pkiraly · August 13, 2020, 8:34am

Hi all,

it is good to be here and many thanks for the initiative!

Who you are?
I am Péter Király, researcher and software developer at the Göttingen campus computer facility. Previously I worked at different LAM institutions (including Europeana) as software developer. I am also an editor of the Code4Lib Journal, so if you plan to publish your result, please consider this forum as well.
Why you are interested in machine learning/deep learning?
My research question is how can we decide if a given metadata record is good or bad. For this I am working with data science techniques including some (unsupervised) machine learning algorithms. For me working with Big Data makes the problems even harder to solve, since lots of Data Science/ML techniques requires special hardware resources, which I do not have access.
Do you already have some potential problems you are (or would like to) use machine learning for?
Just to name a few: Pattern recognition, such as make distinction between metadata values created for machines and for people. Finding similar records. Finding if all the important entities in a record are under authority control.
Datasets you are keen to work with? (either labelled or unlabelled)
I worked with Europeana dataset (I made it downloadable at rnd-2.eanadev.org/europeana-qa/download.php?version=v2020-06), and several libraries’ full MARC catalogues (github.com/pkiraly/metadata-qa-marc#datasources). These are unlabelled datasets in special metadata formats.

Danielvs · August 13, 2020, 5:20pm

I am also interested in this! I am still very much a noob when it comes to all of the details of IIIF but would like to learn more about it and the intersection with machine learning.

This sounds great too, maybe once we get a bit further in the course there might be some scope to play/think about how IIIF could fit into a machine learning workflow in a way that works well within a GLAM context?

I would be up for exploring the crowdsourced data from NLW. I think the intersection between crowdsourcing and machine learning is really interesting and also something which I think is fairly unique to GLAMs (and citizen science projects). The motivations for people to produce this data is quite different to how datasets are often produced in industry and I think that has implications for where machine learning should and can be used in relation to crowdsourced data and how to two might meaningfully interact.

Danielvs · August 13, 2020, 5:25pm

nice to meet you

As someone who has likely produced some dubious metadata records, this is definitely interesting to me!

I have followed the mailing list for years and will keep it in mind for sharing results

silviaegt · August 14, 2020, 10:46pm

Hey, I’m so excited to be a part of this!

Who am I?
Silvia Gutiérrez, a Digital Humanities Librarian in Mexico City.

Interest in machine learning / deep learning
I have applied Association Rule Learning methods to our books collection’s metadata (> half a million records). The objective was to understand our Subject Headings better and see if we could create a recommendation systems based on those terms. We couldn’t. But we learned this:

Library data can be super messy and a standardize thesauri does not ensure uniformity (I’ve seen the same problem in National Libraries Records including the British Library )
There are no out of the box solutions, and the AI-GLAM-projects I know of, are dealing with very interesting but perhaps too specific research questions.
I would like to see how these super powerful methods can help librarians deal with day-to-day issues
A bit more LIS-nerdy result is that both the NYPL’s Subject Headings Graph, as well as ours, has the subject “History” as a central node. And the second largest is the name of the country of each collection (USA for the NYPL and Mexico for ours)

Problems / datasets you are keen to work with?
Cleaning my library’s catalogue (reasons above )

Things that would help
I really like to learn from seeing the examples of others, so +1 to @JoseCalvo’s idea of sharing our scripts, perhaps there could be a global index like the one the #30daysofgraphics initiative in Spanish did for the R Scripts the community created

Danielvs · August 15, 2020, 10:08am

great to meet you Silvia

This seems to be a common trend do you think the messiness of the data was the main reason you couldn’t build a recommendation system on those terms? I think this is one of the really interesting things about library (meta)data is that it ranges from fairly structured ‘tabular’ data to free-text/prose. As a result one of the things I’m interested in particularly in a GLAM settinig is using deep-learning models that use multiple types of input i.e. both the structured tabular data + free text.

Yes, I really hope that this course will help with this. I think there is a danger that if GLAM institutions overly rely on external collaborators or commercial providers to implement machine learning techniques then the agenda is set too much by others (this is not say those projects aren’t useful too!)

I would be very keen on that too

Danielvs · August 15, 2020, 11:01am

You have probably all seen that the course is going to be launched on August 21st!

Since I don’t want to impose a strict schedule on the study group, people can obviously follow the course at a speed they prefer. My aim will be to watch one lesson ~~per week~~ every two weeks and spend the subsequent time working through the notebooks, applying to new data etc. I plan to post comments, questions etc. as I go along (and my notebooks in a GitHub repo). There will probably also be a bit of time needed to make sure a GPU is working and decide on server set-up etc at the start. The forums here are a great source of help on that front.

Schedule for the first few weeks will look like:

Week begin	Lesson
24 August	Lesson 1 + set-up
7 September	Lesson 2

Since quite a few people were keen to have a video call I thought it might be useful to try to schedule one on the week following the course release to say hi and discuss how we want to approach things. I have set up a doodle poll to try and find a good time for that. We might struggle to find a time that works for everyone’s time zone but hopefully, we can make something work

phivk · August 15, 2020, 11:54am

Hi all excited to be here!
Thanks Daniel for taking this initiative!

Who you are?
I’m Philo van Kemenade, based in Amsterdam. I work as UI engineer at the Netherlands Institute of Sound and Vision on R&D into more accessible collection interfaces for end users. Here’s a sneak peek of some recent work. Previously I worked at the R&D lab of the Slovak National Gallery on the online collection platform Web Umenia and ‘special projects’ connecting people to art in the context of exhibitions.

Why you are interested in machine learning/deep learning?
I studied AI just before Deep Learning became a thing and haven’t been using my machine learning knowledge very actively. I’m excited how the scale and curation of digitised cultural collections fit the capabilities of modern machine learning techniques. I’m keen to contribute to a virtuous cycle of the AI and GLAM domains exchanging insights & expertise on technology, ethics and accessibility.

Do you already have some potential problems you are (or would like to) use machine learning for?
Datasets you are keen to work with? (either labelled or unlabelled)
I’m interested in ways to support more intuitive access to large scale collections for end users. I see a lot of opportunity for AI & ML techniques to help here. E.g. running visual analysis on images / videos producing high dimensional feature vectors and using those for similarity calculations or visualisations via dimensionality reduction techniques like UMAP or t-SNE.
Keen to work with the Open Images dataset of openly licensed video material.