I’d like to create an Irish Language corpus, roughly how much might it cost to host a 5gb corpus for public downloads (say on AWS)?

This AWS calculator estimates 10USD per month for S3 Glacier storage (assuming 500GB/month downloads), does that seem to be in the right ballpark?

Given the lack of availability, I would love to create and host a decent public Irish language corpus provided its not crazy expensive

I’ve been looking around for Irish language datasets that I could use for some general NLP learning and have found it surprisingly difficult to find any decent ones. Either the data seems to be copyrighted or old Irish (50-150+ years old) or legal text.

This 2019 NMT paper outlines some good sources and this one also lists a few (old Irish and legal) but they also had to do their own scraping.

I used the wikipedia script from the fastai dl2 course to grab Irish wikipedia texts but there only about 100k lines of text.


So, how much (ballpark) might hosting such a corpus cost? On AWS? Azure? Others? Say its 5gb (complete guess). Probably there is a bandwidth cost for downloads? Assume 100 downloads a month? Can you limit downloaders per day/week/month? Whats the best solution to host a small number (say 1-5) files?

An alternative would be to host it on Kaggle, which I would probably also do, but I’d also prefer to not have to rely on them.

Any help or pointers would be appreciated!

One suggestion: if you want to offer public downloads I think you’d save money by using a caching layer in addition to S3, something like CloudFront.

Nice, I’ll check it out, thanks!