Floyd - alternative to AWS P2 instance

Has anyone figured out how to download data from Kaggle to Floydhub, without putting your kaggle password in a plaintext script that’s uploaded to their servers?

1 Like

I played around with Floyd just to check it out and found the system to be far too much work, especially when dealing with datasets. It’s very convoluted.

Hey! this is Naren, one of the co-founders of FloydHub. If there is any specific dataset that would be useful for all fast.ai folks, we would be happy to upload it ourselves and make it publicly available for all users. Let us know!

2 Likes

Hey @dradientgescent. Most of the reviews I’ve seen over the past 2 weeks have been overwhelmingly positive, and I must say, I do like the idea of only paying for my actual compute time, rather than when the machine is on and not necessarily doing much while I fiddle with configuration or other random crap. For my current project I’ve been considering jumping ship from AWS altogether to Floydhub (of course, I have to get working with this dataset first before making a decision :slight_smile:). What specific issues did you face in using FH, and how does your AWS setup (or whatever solution you use) compare?

@narenst I think that dogs vs cats redux would be beneficial to this community, because it is the only ‘mandatory’ data set for this course. Having some subset of the recommended datasets from the course wiki would be really beneficial too. I’d personally like to work with the distracted driver and maybe galaxy zoo datasets. In a little while, I also want to take on the data science bowl which is a massive data set (~67 GB). I do think that good Kaggle integration could be a killer feature for you, if you were to go down that route.

6 Likes

Integration with Kaggle would be a killer feature indeed! Great idea

1 Like

I don’t use AWS, but I was messing with Floyd trial just to see how it works. The way datasets are uploaded is a huge pain. You run a job via the CLI and it outputs a job id. You feed that into your next job using the --data parameter and it is then available in the /input folder. At that point, I ran a job to extract it (Unzip dogscats.zip as I was just testing something simple) and it took 40 minutes to unzip it. When it was done, I had to take the output job id and feed it into a new process. But it didn’t work so smooth as the files were not there. You end up with tons of jobs with numbers after them, 1, 2, 3, 4 and you got do a bunch of clicks to figure out what it is what. It’s super confusing and convoluted. In the end, I gave up out of lost interest.

Hey, I have added the dogs vs cats redux data set to Floyd. It is publicly accessible for all users. It contains the unzipped contents of the training + test data. You can find the data id in the public dataset page. Let me know if this works for you.

We are always looking to add datasets that will be useful to all our users. So thanks for sharing the recommended datasets list - we will try to add them soon. We will keep updating the page as we add new datasets.

And yes, we are planning to talk to folks at Kaggle and see how we can integrate with them :slight_smile: (but not sure how much the news from yesterday affects this).

8 Likes

Yay, thanks for your responsiveness @narenst!

I followed the directions here (thanks @ylguo0716!) by cloning the repo and trying the floyd run command in the readme, but I received a “404 page not found” when (after waiting a while like the FAQ says) I visited the URL for the Jupyter Notebook provided by the Floyd CLI.

I checked the logs (floyd logs _____) and the Jupyter Notebook was running at
http://[all ip addresses on your system]:8888/hash/
– anyone else run into this?

Hi Naren & Sai (at FloydHub)

I echo @dradientgescent’s sentiments here, unfortunately. I want to use Floyd more than AWS, but there are usability issues, esp around Dataset.

  1. There’s too much complication with respect to uploading data to Floyd.
    – I tried @ylguo0716’s repo… Downloaded 861MB dogsncats dataset locally, then tried to uplaod to Floyd. Got connection errors.
    – The Dogsncats public dataset on Floyd is, unfortunately, not in the right format that the scripts/notebooks want.
    – All in all, ended up spending 3-6 hours trying to tinker with this.
    – Gave up.

Unfortunately, I can’t afford to spend much more time on this, so I’m giving up for now.

Errors:

floyd data upload
Creating data source. Total upload size: 821.6MiB
Uploading files …
Traceback (most recent call last):
File “/Users/aa/Developer/miniconda/envs/py35/bin/floyd”, line 11, in
sys.exit(cli())
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/click/core.py”, line 722, in call
return self.main(*args, **kwargs)
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/click/core.py”, line 697, in main
rv = self.invoke(ctx)
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/click/core.py”, line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/click/core.py”, line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/click/core.py”, line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/click/core.py”, line 535, in invoke
return callback(*args, **kwargs)
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/floyd/cli/data.py”, line 57, in upload
data_id = DataClient().create(data)
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/floyd/client/data.py”, line 33, in create
timeout=3600)
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/floyd/client/base.py”, line 49, in request
self.check_response_status(response)
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/floyd/client/base.py”, line 73, in check_response_status
response.raise_for_status()
File “/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/requests/models.py”, line 909, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://www.floydhub.com/api/v1/modules/

1 Like

@atul What I did was using commands (wget, unizp and rm) in terminal of jupyter notebook to directly download the zip file.


I haven’t tried it yet, but from this post it seems like we can use wget to download any dataset from kaggle.
1 Like

I’m in a Facebook group for AI where a few of the people have used Floyd and said good things about it.

1 Like

Hi,

I was using FloydHub for all my experiments for Cats&Dogs (the idea to pay for what you use is great).

Here is my workflow:

  1. I use Python with floyd CLI (have not tried Jupyter Notebooks on FloydHub)
  2. downloaded the data from Kaggle
  3. run some scripts to prepare the data (splitting into classes)
  4. uploaded the prepared data to Google Drive (got sharable link)
  5. initialized FloydHub project:

floyd init redux

  1. run a script for data preparation on FloydHub (it downloads the data from Google Drive and unpacks on FloydHub):

floyd run “python prepare.py

  1. checked if the job has finished (using the job ID):

floyd logs Ars6wpeuZceA9fVpd3qBZS

  1. checked data ID that contains the unpacked data:

floyd data status

  1. used the data ID for training phase:

floyd run --data X4ctVfiMR3c9Amy9zURRuk --env tensorflow-1.0 --gpu “python train.py

  1. the output from this job has trained model

During the run you have access to “/input” (read-only) and “/output” directories and by giving “–data” switch the output from previous run (with data preparation) with be available as “/input” directory in the current run.

The Python scripts just have the code from “Cats&Dogs Redux” notebook from lesson 1.

1 Like

I know exactly what you mean. It costs me about 2000 INR a month.

There is a much more affordable (almost 80% less expensive) option - using p2 Amazon Spot Instances. It takes a little extra effort to setup, but the process is well documented in the fast.ai wiki : http://wiki.fast.ai/index.php/AWS_Spot_instances

I also tend to spend a lot of time reading through the notebook and doing my research before turning my p2 machine on. :slight_smile:

There are issues uploading large data, you can use a script to download but what I found works best is preprocessing the data locally. If you resize images to the actual size you are going to use in your network (224x224, or maybe 448x448 for data augmentation zooming) then the dataset becomes much much smaller and you can easily upload it to floydhub.

What I like most is that you don’t have to worry about shutting down stuff like you do with AWS. When you’re done you’re done.

What I don’t like is the way data and job output is managed. It’s not easy to remove old jobs. It would also be nice to combine data.

I also notice that the times reported in the logs are not very reliable. According to the logs there is nothing happening at the beginning of a job, while in fact there is because doing less shortens this ‘startup lag’.

I always download the data first (and unpack) and then use the output in subsequent runs. I was not able to upload the training data on job submission even after pre-processing.
To remove the old data you can just do:

floyd data delete DATA_ID
I’d like to have a way to mount multiple outputs from other jobs as an input to the new one.

How can I access tensorboard on floydhub?
I have problem navigating to localhost:6006 (http://172.17.0.5:6006) to view tensorboard.


Thank you @kijes, I’ll try that.

I only used Tensorboard for offline viewing (after the job has finished).
I used Tensorboard callback from Keras to generate data to “/output” folder, then you can download the data and run Tensorboard locally:

tensorboard --logdir=path/to/log-directory

This is not ideal if you want real time visualization, but you can always check logs of the running job for some statistics. Maybe there is a better way.

I am new to this course and machine learning to. How can i setup my local machine with a gpu to start course? Is there any way to do it?

A bit late to the party but one of the text embedding datasets would certainly be nice, like the Glove Word vectors…