Idea: language model tokenizer that only supports the top 255 most common words and punctuation. Using this to limit an existing LLM's output via a grammar gbnf definition. Have the existing LLM "reword" training data in this limited grammar. Use it to train nanochat with a highly compressed token space. ??? Profit?
@brandon yeah like an automated thing explainer. I guess if you're going up to 1024 words you might as well just use an existing model and enforce the grammar restrictions on it 😅
@mauve Like Thing Explainer (https://en.wikipedia.org/wiki/Thing_Explainer), but you need 1000 words. 255 is not enough. Might as well make it 1024!