Help Needed: Learning on binary input streams

I’m interested in feeding files into a neural net. I know there are ways to convert any bytes to images, but if I wanted to learn on the binary input stream, what would be a good method?

Examples -

Detect file types (this would probably be well suited for the bytes -> image approach).

Likely locations of functions in binaries (function parsing can be quite slow), training a neural net on parsed out control flow graphs could really speed this up. A good analogy might be adding punctuation to sentences. Another might be image segmentation, this part is the header, this part in the data, this part is code, this part is dead space.

Files (say PDF) and how they map to memory traces (e.g. the parser creates a Lexer object on the heap during execution). This seems like an encoder / decoder or translation problem. Feed in a file, predict what it translates to.

Generating embeddings. Say we have parsed out the control flow graphs from functions, what are the strategies for generating embeddings for some arbitrary data?

I know NLP uses stuff like byte pair encoding to find common tokens in words. I think this is a little bit harder for binary files. For example, an offset or size field in a file format isn’t really tokenizable based on bytes. Even for something like PDFs that are a mix of tokens and byte streams, very often there are far more (e.g.) null bytes than (e.g.) dictionary markers “<<” in a file. So something like BPE will just create really long null byte tokens and miss the less frequent but more critical symbols in the file format.

Is there any processing for binary streams? N-grams seems like the easy go to.

Is there any architecture that could learn sizes and offsets in a file format? I have no doubt one could learn to interpret little/big endian integers, but how would you represent the relationship of offset to it? I’m kind of thinking make the data in the form of (position, byte value), then maybe an RNN so it has some memory. Ex. the last four bytes are probably an offset, we are at position == offset now, these are connected. This also kind of feels like a graph.

This strikes me as a good experiment for what is essentially a semi-supervised recurrent neural network, especially Long-Short Term Memory. This method would train by trying to guess the next set of characters given the last few characters in a binary input. As a result, it would learn to differentiate different file types. Then you can take some examples of what you’re looking for (like the location of functions) and examples where it’s not there (anything that you know is NOT a function) and train on top on the predicted outputs. You can then label the false positives and true positives, and retrain, and keep labeling more data until you have something that reaches sufficient accuracy.