How would I turn CodeGen into an ULMFit-like classifier?

echometer · August 15, 2023, 1:19am

I found a causal language model named CodeGen on HuggingFace pre-trained to generate code, and I want to turn it into a classification model that would identify malware. However, I am not able to run it with AutoModelForSequenceClassification.

and nowhere in the docs does it provide how to convert it from a causal LM to a classifier. (something like this)
What pre-processing to I have to do to turn this into a classifier?

ste · August 26, 2023, 11:32pm

Just guessing by looking at HF documentation:

github.com

huggingface/transformers/blob/v4.32.0/src/transformers/models/codegen/modeling_codegen.py#L410


      
                      more detail.
                  return_dict (`bool`, *optional*):
                      Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
          """
          
          

          
@add_start_docstrings(
              "The bare CodeGen Model transformer outputting raw hidden-states without any specific head on top.",
              CODEGEN_START_DOCSTRING,
          )
          class CodeGenModel(CodeGenPreTrainedModel):
              def __init__(self, config):
                  super().__init__(config)
          
          
        self.embed_dim = config.n_embd
                  self.vocab_size = config.vocab_size
                  self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
                  self.drop = nn.Dropout(config.embd_pdrop)
                  self.h = nn.ModuleList([CodeGenBlock(config) for _ in range(config.n_layer)])
                  self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
                  self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads)

I would probably create a new pytorch module using CodeGenModel as encoder and adding a classification head to classify the “hidden states” (aka “embeddings” generated by your input code).

Be sure only the CodeGenModel is frozen while you’re training initially; run for few epochs like this then unfreeze the CodeGenModel.

Good luck! It seems an interesting and pretty challenging task for classification.

ste · August 27, 2023, 2:06pm

Another approach, at least to have a strong baseline, is to use GPT (aka see how good is one of the best general purpose LLM to spot vulnerabilities)

For this I suggest to take a look at this great video FSDL where he review Jeremy’s “LLM Science Exam” kaggle competition starter notebook on

Here it shows you how to organize your experiment, how to pre-prompt the model in order to “reason before decide”, and how to force it to return what you need (in your case it will be malware/not-malware).