Storing data for Machine learning on AWS Deep Learning AMIs

I have an EC2 example on which I have an elasticsearch database. This EC2 instance runs all of the time. Actually, I have many different Amazon Web Service EC2 instances, all of which have their own data set.

What I need is to combine data from all of these EC2 instances into one big database for a Machine Learning experiment.

I am looking into Amazon’s Deep Learning and services and I have a question about how the data pipeline looks. S3 says that it can store upto 5PB of data at no extra charge. AWS Deep Learning AMI is running as an EC2 instance of some kind. If, say, I put a giant database into S3 (have each of my existing EC2 instances push data there…) Can I then do batch training from my EC2 instance and pull data from a CSV file stored in S3? Or… Do I have to keep my EC2 instance running at all times and keep said CSV file stored on EC2?

Generally, if someone used the Deep Learning AMI, where do you store your data? If you want to do some feature engineering, do you simply change your database on the S3 instance by doing feature magic via an EC2 instance?