Grammar vs. Transformer Models for Search — Do we need Trillion Parameter Models?

how to get precise NLP

If you don’t know the story of the parameter wars it’s been quite a lot of madness. It started at Google with the BERT model. Ok it probably started with earlier pre-BERT models but BERT was the breakthrough model. It was a model that had been pre-trained with a gazillion (aka a lot!) hours of compute time to code its 125 million parameters. At the time we thought that was madness.

My team had been working on NLP and we tried to use BERT, which did several things, for search. It didn’t really go so well and in the end we decided the technology wasn’t ready for prime time. Compared to our transformative grammar approaches for search — specifically encoding morphological sentence structure — Bert was not in the same class.

Enter the billionaires. Elon and others. Billions were put forward for “safe AI” and the company OpenAI was created. Ilya Sutskever was given a supposedly $2M salary to run the place after being pulled away from Google Brain. At Goog they had developed the winner in the image recognition challenge and had a lot of kinda odd projects that didn’t seem that great or spectacular but kinda added to the goog mission of little pieces to make search better. Nothing groundbreaking. Now Geoff Hinton also went to Goog, as he was Ilya’s PhD advisor and a 30 year research into neural networks. I’ve never been a fan of Turing award winning Hinton — he seemed stuck at the same research I ready when I did my own thesis in 1989. Worse, the whole Tensor matrix approach to neural network models work great for images but I suspect they are not correct for language. But as Goog has such a huge impact on the industry all fo the main GPU programming packages only support frameworks that are Tensor based. So we are literally locked into too primitive models unless we try to do a lot of the scaffolding from scratch especially GPU scaffolding which requires low level programming skills. So basically we’ve been stuck in old fashioned approaches and models. Ok back to the story.

“OpenAI’s mission is to build safe AI, and ensure AI’s benefits are as widely and evenly distributed as possible. We’re trying to build AI as part of a larger community, and we want to share our plans and capabilities along the way.” — from the OpenAI website.

So OpenAI started out making video games. Huh? Yes you heard me right, a few billions to get things done and they make auto-playing video games. Now if you know anything about video games, each frame can be translated to a finite state and the neural network was learning the solution to the finite state machine. Cool. But not earth shattering. They made for giggly demos in the early days of OpenAI. But no big whoop. What a waste of billions in research money I thought.

The next project they took on was Generative Pre-trained Transformer or as we know it GPT-2, which had 1.5 billion parameters and eventually GPT-3 which had 175 billion parameters. They were advances from the BERT architecture which had a two stage training with the first being the huge training set and then a custom training applied. With the pre-trained transformer models this training was not required — just use it.

Now Google and Microsoft have announced versions with a trillion + parameters. OpenAI wanted to make money and being a non-profit it was hard for them to cash in, so they split the company into a non-profit and a for profit and then licensed GPT-3 to Microsoft for a billion dollars. Great for making money but I wonder if OpenAI has been corrupted and is now off their original mission. I don’t see these transformer models as earth shattering ground breaking new technology — but that’s the hype. Just because they are gigantic enormous models doesn’t mean that this is the right approach. Having worked for three decades in the field of cognitive cybernetics all of my gut instincts on architecture just screams this is not the right path. I studied the transformer approach of GPT-3 and if you really dive deep into the core architecture, you start to find several failings. More on that later.

One last part of the story. Several people in Google’s Ethical AI group quit over much controversy. The reason being that these models gobbled up huge amounts of electricity to train. Huge. And that was very “green” when people were starving for resources. It was causing climate change. Call it Goog’s, Microsoft’s and OpenAIs dirty little secret. If these models are so great and such great architecture why are they so wasteful and require such humungous training cycles? It comes back a bit to Ilya who basically said to make a better version just make it bigger. This is not a great attitude to have. Good architecture is minimal and elegant. This flies in the face of good engineering.

“OpenAI recently published GPT-3, the largest language model ever trained. GPT-3 has 175 billion parameters and would require 355 years and $4,600,000 to train — even with the lowest priced GPU cloud on the market.”[1]

Now, these transformer models push mainly on their “language generation” abilities not search. And for good reason. One has to ask, are they any good at search? Early efforts seems to indicate they do good perhaps even great at search, but not Amazing. And the reason for this has to do with the structure of these neural models. Few people understand how these models work and even the simplest explanations is often difficult for people to follow. But I’ll give you some of the basics. First these are not traditional Convolutional neural network models, they are instead more like encoders. They encode probabilities of clusters of words which form a sentence (bag of words) to other clusters of words. Expose it to billions of sentences and it can encode what tends to go with what. But it can’t ever be ACCURATE. It’s a gist. Because in some cases it requires more context than one sentence to know what really maps with what. When doing language generation it can produce language that kinda makes sense. But people as they have explored deeper are starting to grumble that it’s not really that useful.

Now when doing search you encode all the base sentences into the transformer model and transform the sentence bag of words and apply it to the pre-trained relationship weightings of the transformer models. The problem is that language is infinate and if there are a million words in english if you permutate all combinations of sentences to all other combinations of sentences you never really have enough relationship spots in the training network to store all the relationships. Or what happens when you get something you’ve never seen before? So far the solution has been to add more parameters and do more training. If you really think about it, there’s something wrong with this.

Now let’s talk about the advantages of morphological structure grammars as compared to these mega transformer models. These encode word roots and structure and then encode relationships. You can do the secondary part with a neural network or more rigidly with a search indexing technology. Because all of the million words are automatically available to the grammar for the encoding process there is no trillion ^ trillion relationships trap. Fast efficient search without pre-training (except for the pre-training of the structural model analyzers).

So what’s the drawback with this approach where a transformer billion parameter model might do better. The same thing that makes transformer models bad for enterprise search — fuzzy relationships — is exactly the converse of what happens with morphological structural search. You have to take extra steps to encode flexibility. But at least you have the choice. If you want a strict search with little flexibility you just code the direct structural grammar. But if you want more flexibility you can augment the tokens of the search with items that solve for near similarity.

Let’s take a few examples. “Joe sent a letter” “Joe emailed a letter”. In these cases if you code the search to “sent” then you will miss the “emailed. This does not happen in transformers as they tend to have all of the relationships, but difficulty in choosing which ones should be relevant. Current techniques to model word similarity and distance from a corpus such as GLOVE and Word2Vec tend not to produce useful results unless again you can run a mega sized corpus. There are also neural network models which specifically can be trained for word similarity. You can apply that as a secondary analytic to the input query to expand it. This gives you a bit more control and more effective precision than the GPT-3 approach where it’s always “expanded”.

One final issue on using these large transformer models in enterprise search and why they don’t quite work is the pricing. They are quite expensive to use based on the Microsoft hosting charge rates and would be prohibitive to run a giant corpus of documents through it. So unless some breakthrough happens these models are interesting, fascinating, but ultimately a nich technolgy.

Insomma, I don’t believe that these transformer models are on the right path. They certainly don’t seem to lead to general AI or conversational systems. They are more like an interesting toy that can do things we haven’t seen computers do before, but ultimately it’s an in-efficient architecture and a poor approach. Moving away from tensor and transformer concepts of neural networks, cognitive cybernetic models of language ultimately seem more likely to actually understand language and the world and communicate. This will require more advanced three dimensional self organizing models of neural networks, not simple planar transforer and matrix arithmetic models. We are stuck a bit with the legacy of Hinton and Image search as pushing our architectures and have not yet arrived at an architecture for language and conversation but one thing is certain and that is expanding to 100,000 trillion parameters won’t get us there, it will only drain electricity from the planet in the same way Bitcoin is doing. God why can’t I find a GPU to buy! In the end I agree with the Google protestors, these models are too inefficient and expensive while not providing enough worth to take seriously, it will take a new paradigm shift in computing to get there and they are stuck in the stone age.

  1. We use Lambda GPU instance pricing of $1.50 / hour for a Tesla V100 (8x Tesla V100s are $12.00/hour). 355Y×365D/Y×24H/D×$1.5/H=$4.6M

— — — — — — — — — — — — -’

Gianna Giavelli is the president of Noonean Inc which develops advanced language tools for enterprise search and knowledge management.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store