(Original author of spaCy and Explosion CTO here)

Okay so, first some terminology. LLMs can mean a bunch of different things, people call models the size of BERT LLMs sometimes. So let's talk specifically about in-context learning (ICL) with either zero or a few examples. So we'll say LLM ICL, and contrast that with techniques where you annotate enough data to train with, which might only be something like 10-40 hours of annotation. The something you do with that data is probably training a task-specific classification model initialised with weights from a language modelling objective. This is sometimes called "fine-tuning", but fine-tuning can also mean taking an LLM and adapting its ICL. So we'll just call it "training", and the fact you use transfer learning like BERT or even word vectors is just tactics.

Okay. So, here's something that might surprise you: ICL actually sucks at most predictive tasks currently. Let's take NER. Performance on NER on some datasets is below 2003. Here's some recent discussion: https://twitter.com/mayhewsw/status/1700139745769046409.

The discussion focusses on how bad the CoNLL 2003 dataset is, and indeed it's a crap dataset. But experiments have also been done on other datasets, e.g. check out the comparison of ICL and training in this paper from Microsoft: https://universal-ner.github.io/ . When GPT4 is used this one paper reports it slightly better on some tasks: https://arxiv.org/abs/2308.10092 . Frustratingly they don't do enough GPT4 experiments. This other paper also does a huge number of experiments, but not with GPT4: https://arxiv.org/pdf/2303.10420.pdf

The findings across the literature are really clear. ICL is generally much worse than training a model in accuracy, and you generally don't need much training data to surpass ICL in accuracy.

For tasks like text classification, ICL sometimes does okay. But you need to pick the problem characteristics carefully. Most text classification tasks people actually want to do have something like 20 labels, the texts are kind of long, and the labels don't capture the category especially well. Applying ICL to such tasks is very non-trivial. Your prompt balloons up if you have lots of classes to predict between, and providing the examples is hard if your texts are even a few hundred words.

Let's say you want to do something ultra simple: classify articles into categories for some news site or blog. This is the type of problem text classifiers have been eating for breakfast for 20 years. This is not a difficult problem -- a unigram bag of words does fine, and the method of estimating the weights can be almost anything, like just averaged perceptron will be totally okay.

But how will an LLM be able to do this? Probably your topic categories include several different types of article under them. If you know what those types of article are you can separate them out and make sure they're represented in the prompt. But now we're back at needing a tonne of domain knowledge about your problem -- that's like having to write custom features to make your model work. We all switched to deep learning so we wouldn't have to do that.

LLMs build a much more sophisticated representation of the meaning of the data to classify. But then you give them very few examples of the problem. So they can only build a shallow function from this deep representation. If you have lots of examples, you can learn a complex function from shallower features. For a great many classification tasks, this is better. The rest of your system usually needs the classification module to have some sort of consistent behaviours anyway. To do that, you basically have to make an annotation manual, and then you want to annotate evaluation documents. Once you're there the effort to make training data and train a model is minimal.

The other elephant in the room is the expense of the LLM solutions. The papers are missing results on GPT4 not because they're lazy, but because it's so expensive to use GPT4 as a classification solution that they want to get the rest of their results out the door.

The world cannot migrate all its current NLP models for text classification and NER to ICL. There are nowhere near enough GPUs in the world for that to happen. And I don't know about you, but I expect the number of text classification and NER models to grow, not shrink. So, the idea that we'll stop training small models for these tasks is just implausible. The OpenAI models that support batching are almost viable for prediction, but models like GPT4 don't support it (perhaps due to the mixture of experts?), so it's super slow.

The other thing is, many aspects of language that are useful as annotations are consistent linguistic features. The English language codes for proper names and numeric entities. They behave differently in the grammar. So some sort of named entity annotation can be done once, and then the model trained and reused. This is what spaCy does. We do this for a variety of other useful annotations across languages. We actually need to do much more: we need to collect new annotations for these models to keep them up to date, and we need to do this for more tasks, such as semantic role labelling. But it's definitely a good way to reuse work. We can do this annotation once, train the models, and users can reuse the models.

The strength of ICL is that you can get started very easily, without doing the work of annotation and training. There's lots of research on making ICL few-shot learning less bad on arbitrary text classification, NER and other tasks. We're working hard to take these results from the literature and build best-practice prompts and parsers you can use as a drop-in annotation module in spaCy: https://github.com/explosion/spacy-llm . Our annotation tool Prodigy also supports initializing the annotations from an LLM, and just correcting the output: https://prodigy.ai . The idea is to let you start with an LLM, and then transition to a model you train yourself, which can be run much faster.

continue reading on news.ycombinator.com

⚠️ This post links to an external website. ⚠️