Hi!
Are there any papers that look at factoring out spelling (e.g. adding xxUP and xxMAJ to indicate that a word is capitalized or upper-cased) during tokenization? I used this approach when doing tokenization for machine translation and it turned out to work quite well - but I don’t see any evidence that people actually use it anywhere.
Are you aware of any research into this?