Does language model really work?

I had previously done Sentiment analysis project on Amazon fine food reviews where I used bi-grams and logistic regression. Although the model gave a good accuracy, it would misclassify things like “One would not be disappointed by the food.” or “I would feel sorry for anyone eating here” Back then I thought using 3 or 4 grams certainly wasn’t the solution.

When I came across the concept of language model I thought it might be a good solution to this problem. But even a language model misclassifies these sentences. I followed the same steps as described in the IMBD notebook on a movie review dataset that I found online. (I haven’t commited the notebook but I will soon do it and share the link).

What kind of issues does the language model address that the bigram approach doesn’t? And how can I train the model so that it doesn’t misclassify the kind of sentences mentioned above? Any kind of help / information will be useful.
Thanks.