TL;DR
I’d like to create an Irish Language corpus, roughly how much might it cost to host a 5gb corpus for public downloads (say on AWS)?
This AWS calculator estimates 10USD per month for S3 Glacier storage (assuming 500GB/month downloads), does that seem to be in the right ballpark?
Irish Corpus
Given the lack of availability, I would love to create and host a decent public Irish language corpus provided its not crazy expensive
I’ve been looking around for Irish language datasets that I could use for some general NLP learning and have found it surprisingly difficult to find any decent ones. Either the data seems to be copyrighted or old Irish (50-150+ years old) or legal text.
This 2019 NMT paper outlines some good sources and this one also lists a few (old Irish and legal) but they also had to do their own scraping.
I used the wikipedia script from the fastai dl2 course to grab Irish wikipedia texts but there only about 100k lines of text.
Cost?
So, how much (ballpark) might hosting such a corpus cost? On AWS? Azure? Others? Say its 5gb (complete guess). Probably there is a bandwidth cost for downloads? Assume 100 downloads a month? Can you limit downloaders per day/week/month? Whats the best solution to host a small number (say 1-5) files?
An alternative would be to host it on Kaggle, which I would probably also do, but I’d also prefer to not have to rely on them.
Any help or pointers would be appreciated!