I’ve been curious about the way that models work with either sparse data or data where some values have increased importance. While more practice will help with the intuition, I’m interested in what others have to say about this, and if my mental models make sense.
My understanding of datasets with sparse data (eg. using one-hot encoded vectors) is that this leads to many neurons not being activated, the neurons are not relying on the relationship between the input values, simply “which approach to take”. This leads to problems such as: excess memory is being used, possibly more difficult to build relationships between data points, and slower to train the network.
The second thing I am curious about is the way that models behave with inputs that revolve around one piece of data, and how this differs based on types of data/model. While different, some examples of this are: appending a (0 to 1) value to word vectors for the positional encoding, using a token for sentence similarity (such as in the patent similarity competition Jeremy went over), or a nn predictor for college success (where high school GPA is valued). It makes a lot of sense that a transformer model could work well with a single word vector (or a separator token) because of the self-attention mechanism, allowing for that greater emphasis on a single value. However, I am still much more uncertain as to how nn’s behave with data that has multiple categories, with varying correlation. While I struggle to explain it, having the network depend heavily on one singular value seems awfully brutish and that it might lead to weird results.
I’d appreciate any feedback/ideas! Thanks!