My adventures in the land of the flowing tensor

radek · July 22, 2018, 6:24pm

I recently embarked on a journey to learn Tensorflow. Here is what I learned along the way

The things that are great about Tensorflow

There is a lot of research level code written in TF by some really smart people. Reading the code you learn a lot about building DNNs all the way to the lowest of details. You also learn how people tend to think about structuring the code and learn about the abstractions that permeate the deep learning world. For instance, my guess is that the Stepper we know from fastai might have its ancestry going all the way back to the idea of global step in TF. The issue though is that the code you will find might be doing very interesting things, but is often tightly coupled together and it’s hard / impossible to reuse it. There is often also little in terms of documentation - you need to navigate through code to figure out what is going on.
There are many gems that the engineering teams at google produce. Case in point is the new Dataset API (a very nice way of reading data sequentially of the HDD and still being able to shuffle it, etc, with a very nice functional interface). Problem is, the documentation seems like it was written by the product team looking after this particular piece of code. The most likely scenario a practitioner is to encounter is reading off files from disk. But nowhere in the docs (maybe I didn’t look right) could I find how to actually feed the Dataset API anything but data from a CSV! Than the code you need to find in the Tensorflow repo to understand what is going on is not very easy to parse. The Dataset API is very nice and polished but writing the TFRecords that are used as inputs has a lot of low level plumbing to it that might not be easily accessible. This could however be fixed relatively easily by more examples catering to practitioners vs what seems right now like documentation geared towards people coming across ML for the first time in their lives or Data Engineering pros spinning up tens of instances in a heart beat and streaming the data to them with relative ease. Having said that, if you find an abstraction or a piece of code that addresses the problem you are facing, you are likely to be in a very good spot.
There are a lot of good things happening in the Tensorflow repository, such as the object detection API and the newly introduced estimators. They are good in the sense that if you need a job done, and if you manage to figure out how to use those specific solutions (what code to write, how to transform your data, etc) you can achieve great results not knowing anything about DL. If you are building a product, say a mobile app, maybe there is room for such off the shelf solutions such as the one that the object detection API provides.

Summary

In summary, I had a lot of fun writing whatever little Tensorflow code I wrote and there are many things that are great about the ecosystem. The outreach that google is doing via Kaggle competitions also seems to be working very well. If I were working with Tensorflow on a daily basis as part of an engineering team, I could both enjoy it and learn a lot through this experience. I feel I already learned a lot through whatever little exposure I had to the ecosystem.

Those were good times but I am going back to PyTorch. How you say things is secondary to what you say, and there is no other resource I am aware of that comes close to the materials shared by fastai.

In the webdev world there is this big debate between using Java or Ruby. You can do the same thing in either but the question is not about capability. DHH, the creator of Rails, has a philosophy that he wanted to create a framework that should he lose everything he has, a single person with a clue (like himself) could leverage the framework to bootstrap something amazing. And that is the sort of feel that PyTorch has to it, which is nice.

Anyhow, don’t want to go into a discussion on what is better but maybe some of the information above can be useful to someone. I learned quite a bit from looking at Tensorflow and will most likely use Tensorflow down the road, but it might not be for training custom models but rather using some of the prebuilt solutions or reading Tensorflow code to learn about some novel method through code.

As a side note, I don’t think anyone should look to Tensorflow for learning Deep Learning or Machine Learning. You need to have some fundamentals of both and be quite okay with code to find your bearings and understand what is going on. Maybe the materials such as udacity courses / coursera courses make this experience survivable, but I have a hard time imagining how they do it. (Haven’t tried so maybe I should not be speaking!).

hamelsmu · July 22, 2018, 6:40pm

Thanks for sharing, Radek! Really appreciate it.

jellis11 · July 23, 2018, 1:30pm

Thanks for the writeup. In my opinion, I think standard graph based tensorflow (with sessions, placeholders, contexts, etc) is basically worthless as a development tool and no one should use it unless you are in the deployment stage.

With that said, with tensorflow 1.9.0 and the messaging that a lot of tensorflow devs are putting out it seems that tensorflow is basically scrapping the graph based approach. The new messaging seems to be use the tf.keras API for building the network and use eager to train. With the latest release the TF workflow looks almost exactly like PyTorch. They have also released a lot of cool new demos using this workflow.

TLDR; Graph based tensorflow is for deployment. TF is moving towards tf.keras + eager execution (which is essentially PyTorch but with the ability to convert to graph based tensorflow for deployment) and it is really cool

Ekami · July 23, 2018, 3:02pm

Thanks for sharing @radek !

I totally agree with what @jellis11 said. I’m currently using TF 1.9 + tf.keras model subclassing + tf.data.Dataset + tf.eager and the pipeline I have created really really looks like what I was doing with Pytorch.

Still, I want to point out that you will encounter some difficulties along the way if you build custom solutions like me with TF 1.9.

For instance, few Keras callbacks don’t work with tf.eager & tf.keras (not keras, notice the difference) like Tensorboard (so you’ll have to rewrite the callback yourself for now). Also tf.keras comes with a fit method out of the box to train on tf.data.Dataset but it’s not usable for everything. For instance, in my case, I added a bunch of filter operations on my tf.data.Dataset to remove some input data from the pipeline but tf.keras fit() method requires you to specify the step_per_epoch argument to know how many batch your dataset should yield before finishing an epoch. And if you don’t have this information ahead of time well, you end up like me rewriting the whole training pass. But that ain’t so different than Pytorch you’d say.

In the end I find TF going in the right direction, I hated it so much in the past for its ugly API now I think I won’t use Pytorch again (not for production projects at least) as I have everything I need in TF, on top of that the dataset API is much more powerful than what you can do with the Pytorch Dataset class, even if it’s hard to understand how it works at first.

In the future when autograph will work perfectly we will be able to benefit from optimizations done on TF static graphs but in a dynamic way. So basically we will have the best option of both world: Easily developing in a dynamic manner while benefiting from the best possible code optimizations.

I wrote in the past about it but I repeat myself: While using Pytorch for research/experimentation is def a good idea it’s not production ready and I experimented this myself. See my “production grade project” here.

It’s not because you can run it that it “works everywhere”.
When you deploy into production it always comes down to these issues: Speed, cost & monitoring

While Pytorch can offer you great inference speed it comes at the cost of not being able to run a lot of models on the CPU and doesn’t offer great flexibility on how and where to run your models. Also I don’t speak about monitoring, this part is almost non-existent for Pytorch.

I would really like to share some code I have with TF with you but unfortunately it’s proprietary code made for a client. When I have time I’ll prob wrote an open source project with this or create a blog post.

Sorry for the long writing.

yashkatariya · July 24, 2018, 4:48pm

@radek Have you looked at eager execution? It gives you fine grained control over your training and lets you create custom models with ease. tensorflow.org has a lot of new tutorials that demonstrate it. Its worth checking them out

johnri99 · July 24, 2018, 9:38pm

Good debate, personally I like Pytorch and fastai but it’s good that TF is moving in the same direction. I guess at some point I will have to use it but the overhead of learning another code before you have to seems un-necessary and I need all the time I can get just to keep up (we’ll sort of) with PT and fastai

kennysong · July 27, 2018, 6:10pm

@Ekami, I’m curious about this part of your comment:

Is PyTorch inference faster than TF? And why can’t you run many models on a CPU (compared to TF)? Why is it less flexible?

I’ve never used PyTorch in production so would appreciate learning from your experience

Ekami · July 27, 2018, 9:33pm

Hey @kennysong . I actually wrote 2 blog posts about this here and here. Also take a look at this post.

cedric · July 28, 2018, 9:05am

Good debate. I’ll jump in. Both PyTorch and TensorFlow are going in the right direction. Why can’t we have both? I like to think in this way, engineering design and decision is about trade-offs. I try not to re-iterate what has been said before in another similar discussions elsewhere.

Some context. I have briefly played with TensorFlow+Keras:

when it first came out;
during fast.ai v1 studies;
recently after the introduction of eager execution mode.

So, the following are some of my random thoughts about TensorFlow and PyTorch and is centered around an existing problem I have:

Going production
Has anyone experience shipping PyTorch models into production for mobile and running inference on-device/edge? TL;DR: it’s painful.
For web and some mobile apps, the common approach is, create API endpoint for your model and call it from your app. I think this is not too hard to do. But for certain use cases for mobile like in offline or high latency/low power environment, this common approach is simply not suitable.
Edge ML computing
What I have just described is what edge ML computing is about. I think, currently TensorFlow has a better success stories in this area? PyTorch is catching up in this area. Production deployment is on PyTorch 1.0 checklist/roadmap and PyTorch team are collaborating with Caffe2 team to make this successfully. Looks promising. Caffe2 is a fairly new framework, and seems to be the edge device inference deployment framework of choice for Facebook. It’s lightweight and efficient for deployment. Why this matter? I believe edge ML computing will open up new possibilities that was never possible or very hard (hackish) before this.

Disclaimer: I spend most of my time using PyTorch for educational purposes.

kennysong · July 29, 2018, 5:38pm

Thanks for sharing those posts, @Ekami. They were very enlightening and easy to read.

If I can summarize, there are three ways to run a PyTorch model over the web.

Deploy a Flask server, where you import torch and run your inference function.
Export the PyTorch model to ONNX and deploy with Caffe2, which is more high-performance.
Use a specialized model hosting service which gives you an API endpoint for your model (auto-scaling and “serverless”), like Algorithmia or Paperspace Gradient.

The unique problems with each are:

PyTorch takes up a ton of memory (16GB for your SRPGAN model). Manually managing a webserver is hard.
ONNX doesn’t support dynamic input sizes or dynamic models. The hosting options are probably the same with (1) and (3), but running with Caffe2 would use less memory + faster inference.
You went with the Algorithmia option. It’s great in concept, but their implementation is still bad, as it loads PyTorch from cold storage (1.3GB) at every request, is unreliable, and not very customizable. This & other services may improve as time goes on.

For TensorFlow, the options to run over the web are (in my understanding):

Deploy a SavedModel with TF Serving on a server somewhere, which is designed for production. You can run your own server or use a model hosting service.
Deploy a Flask server, where you import tensorflow and run your inference function. (No reason to do this as (1) is a good option.)
If your model is small enough, export it to tensorflow.js and run it client-side when your page is loaded.

@cedric’s comment about on-device inference on mobile phones is also relevant. TensorFlow Lite runs on mobile. The analogous option for PyTorch would be to export to ONNX and run with some mobile-compatible library?

To summarize some of the other comments here: It seems like PyTorch wants to be more production-friendly in 1.0 by merging with Caffe2. On the other hand, TensorFlow wants to be more research-friendly with its eager execution mode (already a few iterations past launch).

Foivos · July 31, 2018, 3:02am

Greetings to all (it’s my first post actually here - what a great community!),

when I first joined the deep learning journey - July 2017 - I started with TF. It seemed like the only testable/reliable choice. Keras was easy to understand and looked efficient (with CNTK backend even more efficient, but CNTK documentation wasn’t easy). Then when I wanted to do more things, Keras wasn’t flexible enough (for my needs). I really enjoyed TF and I fully respect the effort developers have put in it. Tensorboard was a great tool, it helped a lot in making sense of deep learning training. Unfortunately, TF proved to be too complicated for things it really shouldn’t be (like try to parallelize in multi-gpus in the same machine, or - back then - use BatchNorm layer, it simply wasn’t working as it should without additional modifications). Some times I was feeling it had the unnecessary complexity of C++ (for a python API) and the inefficiency of python (flooding all memory from gpus by default, channels last etc).
Data parallelization or distributed was a nightmare (to me, calculate the gradients manually, aggregate them (manually), average them (manually …) … ). The probability of introducing a bug was simply too high. Add this to the already difficult hyper parameters fine tuning that is necessary for deep learning, it was making things too slow for production.

I quickly started trying to find alternatives: I was very much interested in performance gains as well as not spending months debugging the code. I was hearing mxnet was a very efficient framework but wasn’t easy to go through the documentation. Then gluon was announced (Oct 2017) like an easy API that made use of mxnet very easy and, whoa, it did made a difference (to me). gluon - for me, a humble non-expert in deep learning - was a game changer. It has almost identical syntax with pytorch (translating models from pytorch to gluon is dead easy), but also the ability to go static with a push of a button “mynet.hybridize()”. This speeds up things by a lot (in particular x2-x3, in my experiments). I found it overall a much better approach than tensorflow + keras. Of course it is not perfect, and I do not claim it is the best/fastest framework out there. It’s just what worked really well for my needs (easy multigpu training, nice syntax, flexibility etc - still haven’t solved distributed async training, getting there …).

I think the deep learning world is moving to a place where people will create their models in an imperative model (like pytorch / gluon / chainer etc) and then with “a push of a button” (if the model permits), will go static (like gluon does now with hybridize, like pytorch is trying to achieve by translating models to caffe etc).

Kind regards,
Foivos

PS For the record I’ve been using TF/Keras/mxnet/gluon to develop semantic segmentation applications in my work (remote sensing).

cedric · July 31, 2018, 6:52am

Welcome to the community

Ditto. I used tensorboardX, a TensorBoard wrapper for PyTorch when I was trying to implement Capsule Networks last year. I choose it over Visdom.

Yeah, I heard something along this line too. I think another upside for mxnet is that it’s a project incubated under Apache Foundation, so I’m guessing that means less vendor lock-in. But, AWS heavily invested in mxnet

Foivos · July 31, 2018, 7:34am

Thank you @cedric

I used to use tensorboard as well, mxnet has a version for it too, however I find it more convenient now to store my own variables and visualize them after/during training.