Inquiry about Bias in Deep Learning Methods

Mirarh · May 28, 2018, 2:52pm

Hi @jeremy, @rachel, and Fast.ai community! I just completed the 4th lecture and had some questions about different deep learning methods. It’s been fascinating for me to compare Keras deep learning approach (can read more in the Deep Learning in Python by Francois Chollet!) and Andrew Ng’s Coursera approach (Deep Learning specialization).

The topic of algorithmic bias seems to come up a lot in various contexts, but if the Deep learning models are using the same data (such as Kaggle data) is there a way to better understand why models are preforming differently? I just had a few questions about that, so that when training my models I have a better grasp of why I would use one approach over another and what one model is doing that another may not be. Thank you!!

For example, in the Embeddings lecture the model using the fast.ai library beat the previous most-accurate model. If Virtual got 94.1 and Fast.ai got 94.5% accuracy for its text classification method, what are the reasons for these differences? Is there a way to really understand why these models preform differently?
Like Jeremy mentions in the lecture, its crazy how many applications of DL there are (and growing). I have a dataset that i am particularly interested in looking at and applying deep learning. My question though is how should we approach thinking about what the independent and dependent variables are? If i know what I want to try and classify, is that always the dependent variable? By doing this however, are we not making some assumptions about casualty/relationship between specific factors?
Are the ‘most’ accurate classification models using greater numbers of categorical variables? Jeremy mentioned that in the example of the Rossman sales example, it was advantageous to keep as many categorical variables as possible because this allows the models to learn from distributed representation; if its continuous the only thing it can do is find a single functional form that fits it well. We we are using structured data, would you recommend that we train our models by identifying as many categorical variables as possible?

Could we use a pertained model’s output (such as the results of a CNN) as a categorical variable for a structured data problem?

cqfd · May 29, 2018, 2:08am

Not sure how insightful of an answer I can give here, but basically, some differences I can think of: different model architectures (number of layers, sizes of embeddings, etc.), different training regimens (cyclical learning rates? which optimizer?), as well as different regularization strategies (weight decay? if so, how much? dropout? if so, how much?), etc. Every one of those changes will affect the final accuracy.
If you’re trying to use some features X and to classify some y, then yeah, y would be your dependent variable. No need to make any assumptions about causality; as Judea Pearl would put it (his new popular book on causality, The Book of Why, is very fun), the machine learning we’re doing here is purely “associational”, and neither knows nor cares about anything causal. You’re simply trying to notice some statistical association between your features and your classes.
I took Jeremy’s point in that lecture to be that categorical variables are neat because they get embedded, and different levels of the category can easily get very different embeddings (if that’s what the network wants to do). The difference with a continuous variable is that it’s just a lot harder for a network to learn to map nearby “levels” of a continuous variable to very different embeddings, because you’re going to embed it with a continuous function (e.g. a linear layer, multiple layers, etc.), and continuous functions almost by definition resist mapping nearby inputs to spread out outputs

For example: suppose you encode t-shirt sizes as 0.0/1.0/2.0 for S/M/L (rather than using a categorical variable). You could “embed” that to, say, a 10d embedding with a nn.Linear(1, 10) layer, but that wouldn’t work very well (based on the way that layer works, the embeddings would be 0.0/1.0/2.0 times one particular 10-d vector–so not terribly different from one another). You could also try a deeper embedding network… but at the end of the day, it’s going to be easier for the regular categorical embedding approach: it just makes a 3x10 matrix and learns how to fill it in. It’s definitely possible to get the same embeddings out of the non-categorical approach, but it would be much harder.
The output of a CNN will presumably be a continuous vector, so I don’t think it will work as a categorical variable. But that’s fine though, because it could work as the embedding itself. (After all, that’s really all you’re doing when you turn a categorical variable into an embedding: somehow going from a discrete set of categories to a continuous vector.)

Mirarh · May 29, 2018, 12:38pm

Hi @cqfd, Thank you! Just to follow up on question 3–there’s a kaggle competition where the goal is to classify survivors/non-survivors based on a lot of features that are given (both categorical and continuous) and I was just wondering if this type of problem would be an example of when RNN could be used? I’ve just been looking for opportunities on kaggle to test some of this and it seemed like a problem that was similar. What do you think? Thank you again!

cqfd · May 29, 2018, 2:20pm

This competition? https://www.kaggle.com/c/titanic/data

If you’re given a fixed number of features (categorical or not), there’s no need to use an RNN. You generally use an RNN when you have a variable number of features: for example, in machine translation, one input sentence might have six words and another might have sixty. Or suppose you’re trying to classify Youtube videos based on whether or not they have cats in them: different videos can have very different numbers of frames. RNNs handle that kind of variability “easily”: you just keep feeding them inputs (e.g. each word embedding for each word in the input sentence). So for an RNN each individual input has the same size (e.g. embedding dimension), but the number of inputs can vary.

Maybe worth saying that you don’t have to use an RNN if you’ve got a varying number of inputs. One classic example is bag-of-words: you map any input sentence, regardless of how long it is, to a fixed-size vector whose i-th slot counts how many times word i (from some predefined vocabulary of your choice) appeared in the sentence. That smooshes your input sentences in all their variation to a common shape (you can think of it as a sentence embedding), and now you’re off to the races: you can feed it to a regular-old feedforward neural network, or whatever. And there are fancier ways to not use an RNN: convolutional layers don’t care how big their inputs are, and there are also things like https://arxiv.org/abs/1706.03762.