Spark for deep learning

dhoa · October 14, 2018, 8:28pm

Hi all,

I see many data scientist job descriptions have Big Data skills (Spark, Hadoop, …). I tried to learn it by myself on dataquest but I haven’t totally understood how to integrate it. Most of the resources on internet teach us how to retrieve and clean data but don’t show us what to do after. For example, after retrieve the data by Spark, we will store it as pandas data frame (I think we will get out of memory if data is massif), or we will cut it in parts and lazy-evaluate it ?

I found in 2017 course, we have some materials about Spark, why don’t we see it in the v2 of fast.ai course ?

I am very appreciated if someone can clarify these things for me

nikilp · January 27, 2020, 5:55am

I’m now learning pyspark and I wish run a simple fastai training script with spark-submit. (How) does fastai interface with Spark?

harikrishnanrajeev · May 20, 2020, 2:27pm

Where you able to run a fastai training script with spark-submit ?, can you share your experience please.

msp · May 20, 2020, 2:40pm

I think the reason we don’t have much Spark content here is that Spark follows a “big data” philosophy, whereas fastai has more of a “small(ish) but clever data” mantra But still would be quite interesting if someone is using fastai with Spark.

yegeniy · February 27, 2021, 3:55am

The main use case I’ve seen is using Spark to generate features from very large data sets.
However once the feature set is generated, it can be made small enough to use locally. Similar to the way that the fast.ai courses teach.

I’m not speaking about the fast.ai library specifically. But I’ve gotten perfectly similar results with boosting classifiers that were trained on a Spark cluster, and on a laptop. In that case both models used the same training set and generated very similar and equally performant models.

For example:

An XGBoost Spark model model (trained on a large cluster over the course of hours)
And a HistGradientBoostingClassifier model (trained on a laptop over the course of seconds)

Both perform just as well on the same validation set. Sure, it takes a few minutes to initially download the generated features used for the training set onto the laptop. But it takes a few minutes to spin up a properly sized Spark cluster too.

And waiting 5 seconds to retrain a model makes a world of a difference compared to waiting minutes or hours.