Generally Intelligent is a podcast made for deep learning researchers (you can learn more about it here).
Our first guest is Kelvin Guu, a senior research scientist at Google AI, where he develops new methods for machine learning and language understanding. Kelvin is the co-author of REALM: Retrieval-Augmented Language Model Pretraining. The conversation is a wide-ranging tour of language models, how computers interact with world knowledge, and much more; here are a few of the questions we cover:
- Why language models like GPT-3 seem to generalize so well to tasks beyond just predicting words
- How you can store knowledge in a database, in the weights of a model, or a with mix of both approaches
- What interesting problems and data sets have been overlooked by the research community
- Why cross-entropy might not be such a great objective function
- Creative and impactful ways language and knowledge models might be used in the future
Below is the full transcript. We love feedback, ideas, and questions, so please feel free to reach out!
Kanjun Qiu: [00:00:00] Hey everyone. I’m Kanjun Qiu, and this is Josh Albrecht. We run Generally Intelligent, an independent research lab that’s investigating the fundamentals of learning across both humans and machines in order to create more general intelligence.
And you’re listening to Generally Intelligent, a series of conversations with deep learning researchers about the ideas, opinions, and intuitions behind their papers and talks. We hope this ultimately becomes a global colloquium where we can make guest speakers available to everyone, not just people inside their own lab.
Our guest today is Kelvin Guu. Kelvin’s at Google AI and the primary author of REALM from ICLR 2020. REALM lets language models use explicit world knowledge that’s stored in text documents instead of relying on just the model parameters to memorize everything. Why does this matter? Well, it means that you can update what your model knows as information changes in the world without having to retrain the entire model.
Kelvin did his PhD in the Stanford NLP group under Percy Liam. Our conversation ranges over a variety of topics. How to think about the fundamental question of encoding knowledge, why language models like GB three seem to generalize so well to tasks beyond just predicting words and much, much more. We found this conversation to be very insightful and hope that you will too. The views and opinions expressed in this podcast are personal to Kelvin and they do not represent any Google position or policy.
Before we get started, just know that the views and opinions expressed in this podcast are personal to Kelvin and they do not represent any Google position or policy.
Kanjun Qiu: [00:01:32] Welcome to the podcast, Kelvin. We’re really happy to have you here today. We’d love to start by understanding – how did you initially get into research?
Kelvin Guu: [00:01:41] Thanks for having me on the podcast! In undergrad, I was doing a lot of biology research and as an undergraduate, you don’t really get the most interesting jobs in the lab. I actually worked on a lot of the more tedious parts – so like counting cells under a microscope, counting neuron firing activations, monitoring video footage of a rat to see where it was spending its time. So a lot of things that I realized were making science slower than it should be, and taking time away from really smart people to do good work and actually benefit others. And, I felt that a lot of that work could be automated. So I started spending time looking into that and, actually, to be honest, I still need to get back to doing science. But I’ve been doing AI research since then.
Kanjun Qiu: [00:02:33] If you’re to think about your evolution from being a bio researcher all the way through now and the ideas or the hunches or the thesis that you had over the years, how would you describe those and how they evolved?
Kelvin Guu: [00:02:47] I think the most relevant part when I was doing bio was just the large amount of knowledge that you needed to acquire and connect for yourself from reading papers. I’m thinking here of building a mental model, like a graph of the different organs and proteins and other things that interact with each other. And when I did start doing machine learning research, one of the first things that I worked on was knowledge graph modeling. The research on knowledge bases spans everything from extracting relational facts from texts and populating a knowledge base with it, or predicting new edges in the knowledge graph from existing edges, typically known as knowledge-based completion. And I think that influenced a lot of what I did early on as a NLP researcher. Over the years, what I found was that knowledge bases are extremely effective for representing information in a way that you can algorithmically reason over. They just really struggle to have high coverage of all the different things that might be important. So, for example, there are lots of applications where world knowledge would be important in a way that you might not expect. For example, let’s say you’re building a customer service chat bot or something, and you’ve designed your chat bot to handle all sorts of common intents: I want to refund my order or, can I get a discount, or something like that. But then you’ll get customers who will say ugh, I’m having a really bad day. My grandfather passed away and you want your system to respond with, I’m very sorry about that or something along those lines. But you’re never going to engineer that into your schema. You’re not going to have a intent that handles that. And yet it’s super important world knowledge. And so it just seemed to me that knowledge base is, could only be part of the solution to that problem. And since then, I’ve been spending more time trying to get world knowledge into models in a not as structured way or maybe more unsupervised semi-supervised way.
Josh Albrecht: [00:04:48] So I assume that REALM would be an example of work in this vein?
Kelvin Guu: [00:04:53] Yeah so that’s something that I’ve been working on recently. REALM stands for Retrieval-Augmented Language Model. And the basic idea behind that is, so for people who are familiar with things like BERT, the overall game you’re playing there is you take texts that is unlabeled and you mask out some words in the text and you try to recover what those words are and try to predict it. And oftentimes if you mask out certain things, they can be actually very world knowledge, heavy things. Like if the sentence is Barack Obama was born in Hawaii and you blank out Hawaii that’s not really a fact about language itself that, Hawaii was where he was born but more fact about the world. I’ve been working on this project where. Instead of just encoding the surrounding text, using your typical transformer cell model, you have this additional component where you retrieve documents that you think are relevant to that sentence and condition on those to fill the blank as well. And indeed that is our attempt at bringing more world knowledge into models by allowing them to access, say a very large document knowledge base where the knowledge base is just free text, it’s just what people are writing into documents, as opposed to manually curating into a knowledge base.
Kanjun Qiu: [00:06:04] And this way, you’re basically accessing this external knowledge base, as opposed to encoding all of this knowledge into the various weights of your neural net?
Kelvin Guu: [00:06:13] Right. Right. Not to say that alternative approaches aren’t very powerful. I think that’s something that I’m also spending a lot of time looking into, I’m sure lots of people are familiar with the really great results that we’ve seen from large models, like GPT-3 or Meena. They can answer lots of questions and I think there’s a sort of an emerging question of, okay they know a lot, but how can we control what they know and also update what they know in meaningful ways? And I think there’s lots of cool research to do there. That could mean either more interpretable memory architectures, like things such as REALM where you’ve got this interpretable representation, or it could be something totally different maybe to give an example, Let’s say you ask a large model like GPT-3 or maybe one that’s been trained to do dialogue. Let’s say Meena, you ask it, what team does LeBron James play for? And, at this point in time, if it’s trained on recent data, it should say the Lakers, I think. But he could change teams. And if he does change teams, do we spend another several million dollars just to update the fact that it’s changed to use? Or do we want some way of incrementally adding that knowledge? And I think that’s a super cool question. And also knowledge isn’t totally independent. If he changes teams, a lot of other facts change as well. So we don’t want to just change knowledge about that question, but we want the model to understand the implications of that with regard to all other things in the world.
Josh Albrecht: [00:07:51] Just going back to something you were talking about earlier in terms of the memory architectures for this, REALM was this example of a memory architecture where you’re just using free texts to encode the information. But there are other alternatives, as you said, another alternative is to put it all on the weights, like GPT-3, can you talk about the trade offs between those two and any hybrids or alternatives?
Kelvin Guu: [00:08:11] Yeah. So there are a couple of different dimensions along which things can vary. One is definitely the representation itself. So text as your representation is the point on the spectrum that REALM sits and then full dense parameters is another point on the spectrum, which you could think of as belonging to the GPT-3 way of doing things. There are also a mixture of expert models, where the overall network is still all parameters, but only as far as a subset of it is activated at any time. And then coming back to structured knowledge graphs – there’s also a lot of work on conditioning on knowledge bases, retrieving relations or tuples in there, and those can be very effective depending on the setting as well.
Josh Albrecht: [00:08:56] Would you say those are the main approaches or are there other alternatives that you’re excited about as well?
Kelvin Guu: [00:09:02] If you take a model like REALM, there’s two components to it. There’s the retrieval component. And then there’s what we call the encoder or the reader, which takes what it has retrieved and then decides to do something with it. And at least in that kind of setting, you can vary the size of the retriever versus the reader. So if you make the retrieval corpus very small and you make the reader very large, you end up with something like GPT-3 and on the other extreme, if you make the reader extremely simple, and you make the retrieval, everything else, you end up with something that looks more like k-nearest neighbors.
Josh Albrecht: [00:09:37] How else do you think about the more fundamental question about encoding knowledge in the first place? It’s pretty obvious that we don’t have a bunch of text stored in our head, that we’re not pulling out some old document and using that to give you the answer to a question. But clearly we know some things, sometimes. I haven’t memorized that many things, but humans have some kind of different knowledge representation. Do you have any thoughts on other ways of representing knowledge that might be closer to the biological side or more fundamental or different ways of thinking about it?
Kelvin Guu: [00:10:04] There’s definitely a sense that entities are an important structure to aggregate facts around. This is just an intuition about when you’re reasoning about something, you tend to start with the entities and what properties they have. So it seems like there should be some way to get more mileage out of that. I think text is obviously not natural. There’s no organic analog of retrieving large amounts of text. But you could at least argue that people do this semi grounded simulation in their mind. Like when they’re planning, there’s an episodic component to the way you plan and simulate what you’re going to do or the way you recount things that have happened. In any system you design, ideally there are intermediate components that are interpretable and grounded in some way, so that a), the user of the system can see what is going on. And b) if it’s interpretable, you should be able to manipulate that variable in a way that gives you more controllability. For example, in the future, if we have a system that can make short range plans about how to get a certain objective done, Ideally you’d be able to inspect what that plan looked like before it started executing. I think even if things can be done in a completely uninterpretable neural network, black box, there’s still some value in having intermediate grounded states.
Kanjun Qiu: [00:11:26] Interesting. Would you say that interpretability is something you’re excited about right now?
Kelvin Guu: [00:11:30] Yeah, I would say interpretability is an important thing for me. One aspect of it, that’s also interesting is, being able to pick up on spurious correlations. So to give an example, let’s say you’re training a sentiment classifier on product reviews and you’re training data. It just so happens that everybody who bought toothbrushes happened to like them and everybody who bought fruit happened to not like it or something like this. And if you’re not careful, you end up learning a toothbrush versus fruit classifier rather than a sentiment classifier. Practitioners just run into that sort of issue all the time. They collect a dataset, we all think it’s going to be great. And then we train our models, get really high dev accuracy, and then we deploy and find that it just doesn’t work at all. There’s some spurious correlation and it requires a lot of human effort to pick out what that is. Unlike all the other problems that big models are solving, this is not a problem that will just go away with larger models. Sometimes it helps, but oftentimes with more capacity and the ability to pick up on any arbitrary feature, it can just pick up on a different feature.
Josh Albrecht: [00:12:39] Do you have any intuitions about interesting directions to explore on removing spurious correlations or avoiding those?
Kelvin Guu: [00:12:46] Sure. Yeah. There are a few papers that, if you at least know which features are spurious, you can at least track out the influence they have and that’s usually done with data re-weighting or data augmentation or model ensembling, where you’re, ensembling your real model with some model that only has access to spurious features. There’s a whole bunch of techniques there, but I’ve been looking for ways of automatically detecting spurious correlations. And I haven’t anything great in that area yet, and I don’t have any great ideas myself, if you guys come across something or if anyone else does that could be interesting.
Josh Albrecht: [00:13:23] There was one paper I remember seeing recently about – one of the reasons it’s effective is that it helps remove some of these spurious second order correlations.
Kanjun Qiu: [00:13:32] Aside from interoperability, what else have you been excited about?
Kelvin Guu: [00:13:38] I’ve been quite excited about few shot learning and different places you can apply that. I’ve been quite interested in understanding the magic behind language model pre-training – how is it that despite many tasks not looking like language model prediction, you can still get so much generalization from that task, and also the well-known GPT-3 results with the in context learning where you show the model a sequence of xy pairs, and it’s able to complete that pattern. Just understanding where the model gets its bias for pattern continuation and repetition, that’s been on my mind, learning more about causal inference.
Kanjun Qiu: [00:14:23] I’m really interested in this understanding of the magic behind language model training. Why can you get so much generalization from this task? And where does the model get its bias for pattern continuation from?
Kelvin Guu: [00:14:34] Yeah. So a complete guess in the dark, is just that there are a lot of list-like structures on the internet that these models have in their corpus. 10 things to do on vacation, top 10 products that I recommend, et cetera. So if that’s really the case, it’d be interesting to test whether that were true – so if you if take the subset of your common crawl corpus that is and just train on that, what do you get?
Josh Albrecht: [00:15:08] That is an interesting question though, which is, how much of it is dataset bias? If you removed all those lists, would you still end up with the same?
Kanjun Qiu: [00:15:16] What if you train on everything that was not a list on the internet, right?
Kelvin Guu: [00:15:21] So I think one way to approach it is ablations, that’s definitely one way to look at it. I think influence functions are another interesting tool that, you could apply in a setting like this. I don’t know about the scalability or accuracy of that, but, generally speaking influence functions are measuring, for each data point in your training set, if you infinitesimally upweight it by some amount, how much does that infinitesimally alter your prediction on some example? And, and that gets at the idea of the influence of a point in your pre-training data. So you could take , your listicle data point inside your set and check if that has any effect on something like GPT-3. But the other problem is that it measures the individual influence at that point, so as long as there are thousands or millions of other points like that, its individual effect might not be so big. It still leaves that as an open question.
Kanjun Qiu: [00:16:17] Kind of related to this, aside from these very large language models, do you feel like you have any strong intuitions about particular important directions that you’d want to investigate?
Kelvin Guu: [00:16:28] I could give another one that I think is interesting, but I haven’t spent much time on. So there’s a lot of work on text generation and a big part of it has actually been supervised tasks from a generation point of view – counting self supervised – tasks where you are training a sequence generator with a cross-entropy loss with teacher forcing on all the tokens , while a classic and great recipe, just doesn’t seem like the only way one should be doing text generation. So one of the issues that I think is problematic about that is, because of the way cross entropy is set up, if you completely put no probability on something and that thing ends up appearing, you suffer infinite loss. And so naturally this leads to models that hedge a lot, and the less modeling capacity your model has, the more it has to hedge in ways that are unnatural just so that it can cover everything that could happen. that can result in really weird generations at test time, because that part of the distribution that really was just there to prevent negative infinity is actually being sampled, you see where things or the model hallucinates things that it doesn’t really actually know to be true. And I think there’s lots of interesting work on replacements for that way of training. Yeah. So a few things that come to mind, there’s some work on truncating the loss so that you’re not actually affected by the extreme regions of the loss. There’s some good work by a former collaborator of mine, Tatsu Hashimoto, on that subject. And also I think out of OpenAI, there was a pretty nice paper on learning to generate summaries from human pairwise comparisons, and in that setting where they’re training a reward function to emulate human judgments of which piece of text is better, and then training generator off that which seemed quite interesting as well. All of these basically seem to get at the fact that you don’t have to model everything.
Josh Albrecht: [00:18:42] The OpenAI one then is avoiding that by giving it a different training signal where there’s no more negative infinities. Like you can’t get a negative infinity score from a human judge. Is that right? I mean that certainly seems reasonable to me truncated the loss. So you don’t have a negative infinity. it’s really problematic because your. The vocabulary is gigantic. So you have to put probability all over the place so that you don’t get infinitely wrong . Whereas for people, if you ask someone to give me a random word , they’re only gonna give you like one of 30 words.
Kelvin Guu: [00:19:11] Right. You’re not like, wait, let me get the full dictionary out first.
Josh Albrecht: [00:19:24] Maybe one more question on that then – is there anything more to it than just truncating? context or paying attention or what other kinds of techniques can there be in there that helps with that same kind of thing?
Kelvin Guu: [00:19:34] Yeah. So I think this reminds me of contrastive losses. If you can pick your negative candidates carefully in a contrastive loss, you can emphasize the thing that’s important about the right thing while also finding the boundary of everything that’s invalid. It’s really tricky to get that right, because the thing most people will do with contrastive losses is just randomly sample negatives. And most of those are not close to the decision boundaries. So you don’t learn much. And if you use some adversarial approach where you really look for things that are close to the boundary, then you sometimes sample things that are positive and that can just be confusing as well.
Josh Albrecht: [00:20:15] Do you think there are any other negative side effects to sampling too close to the boundary? So accidentally getting false positives is one. Are there other things that come into play there as well?
Kelvin Guu: [00:20:26] Yeah I think another common thing is if you’re always just sampling close to the boundaries, sometimes the model is just catastrophically bad away from the boundary. And here by boundary, I don’t mean the model’s boundary. I mean what a human would think of as the boundary.
Josh Albrecht: [00:20:38] Hmm. that’s interesting. So another direction in this case is thinking about how to appropriately select those kinds of contrastive examples of positive and negatives in a language case, instead of it being random sampling or having to be over all words, just the words that could be confusing here.
Kelvin Guu: [00:20:56] Right.
Josh Albrecht: [00:20:59] Is there any good work there that comes to mind?
Kelvin Guu: [00:21:00] I feel like it’s pretty open. There’s definitely work in that area, but oftentimes it’s some heuristic – Oh, these are sentences from the same document, but not the same paragraph. So it’s hard to find that general principle that makes it work.
Josh Albrecht: [00:21:15] It feels very similar to me in images. In images, a lot of the time you’re doing a contrastive loss and like very clearly dog or cat. Okay, great. you’re all set. You don’t need another dog comparison. your dog thing is pretty good, but some of the dogs and cats look very similar. And then you want to find those parts that are really hard. I think there was actually some recent work from ICLR this coming year about which negative examples to use, and showing that most of the learning comes from the hard negative examples in this case. but then you want to make sure that those hard negative examples aren’t actually teaching you the wrong thing. Some dogs just do look like cats and you shouldn’t really necessarily be labeling them one way or the other,
Kelvin Guu: [00:21:54] And I don’t even think that’s where the heuristics end, another issue that seems to come up a lot with these contrasted things is that the true label and that setting often has some spurious correlation that lets you pick it out easily. So in the image case, if you cut out a piece of an image. and then you hide it among a bunch of distractors. There’s only one of those images whose edge detector boundaries lined up with the hole that was cut out. And so you have to remove some margin of that boundary to eliminate that feature and in text it’s the same. There are only so many things that grammatically fit in a slot. And so you have to, when you cut, you have to cut out a proper sentence or some other structure that reduces the correlation on the boundary
Josh Albrecht: [00:22:37] Sounds like self supervised or contrastive losses applied to language in like a slightly different way. It seems like an interesting direction to explore more
Kelvin Guu: [00:22:46] Maybe part of the reason why the straight up cross entropy generate everything approach to things has been more successful. Is that it’s a harder recipe to get wrong. If we look harder in the contrastive space we might get something.
Josh Albrecht: [00:22:59] That’s another meta question is as you’re thinking about research, you could go in this direction, and try and make a slightly more complicated contrastive loss that, might give better performance or you can make the model bigger, how do you think about which directions to pursue, which ones seem the most promising and which ones to work on versus which ones to save for later.
Kelvin Guu: [00:23:19] Yeah, I think most recently a few criteria that have been top of mind for me are first, will this be obsoleted by increasingly large models? it’s important to me that what I’m doing is orthogonal to that, or even benefits from scaling up further. Oftentimes for me, that means being able to take advantage of extremely large data sets in new ways. that usually seem to work well with the scaling factor.
Josh Albrecht: [00:23:49] Could you give an example or two?
Kelvin Guu: [00:23:53] Oh, sure a default strategy for using large unlabeled text corpra is left to right language modeling. But I think there are many other tasks that you could hypothetically do from such a dataset. This one’s not new, but for example, there’s tons of hyperlinks out on the internet. And take a random hyperlink and ask you to predict what URL it goes to, you can get lots of information out of that potentially. And it’s not just left to right language modeling and yeah, there are many other sorts of tasks one could many other one
Kanjun Qiu: [00:24:29] Interesting. I’m curious – how did you get to determining that you were most interested in language models branching out into other things?
Kelvin Guu: [00:24:46] Yeah, I’ve been thinking about language models, a fair bit. This goes back to my interest in world knowledge. It’s not so much the language modeling task, but learning things from corpa that’s interesting to me and the fact that it’s hard to get that kind of knowledge from anywhere else. Like it just, it doesn’t seem like data curation is ever going to get us that scale.
Josh Albrecht: [00:25:09] One idea I was playing with a year or two ago was around making a knowledge base that grew itself. Instead of us going and putting in all the knowledge ourselves, is there a way that you can make a system that bootstrapped that almost let it go and do Google searches every time it’s confused, Oh, I’m not sure. let me go look that up. Oh, now I know about blah, but try to only look up things that it’s maximally unsure about almost using Google to bootstrap into a more general version of a knowledge base. Because that’s what I do. Oh, I don’t know anything. Let me look it up.
Kelvin Guu: [00:25:38] Yeah, I generally like the idea of having a tiered system where you’ve got your core set of facts that, you know, really well, and maybe those are in the dense parameters of your model. And then you’ve got maybe a more secondary set of facts that you go to text to retrieve like there are certain things that I don’t just know off the top of my head, but I have it filed away in a literal document somewhere. And beyond that there could be another tier of access for things that are not even in your retrieval corpus.
Josh Albrecht: [00:26:08] One other thing that seems related to that is you can have your core knowledge and your weights. You can have your secondary knowledge that you can go look up. Those both make a lot of sense. How do you think also about the compositionality of those things? If you have some core knowledge, there’s a lot of implied knowledge. How do you think about that component?
Kelvin Guu: [00:26:24] That’s a great point. So far we’ve been talking about memory or knowledge access mechanisms that are pretty sparse. So there’s some very targeted thing that you want to know and in that case, that sort of spotlight notion of memory where you just single out a single document or something makes a lot of sense. But I think there are many other cases where it’s some aggregate statistic that you want. maybe the case that we were talking about much earlier, where you’re the customer service dialogue agent and the customer says that they’ve lost a relative. There’s probably not like one specific conversation you would pull from your memory to rehearse from that, but you would have an aggregate sense. I think there are two ways of doing that aggregation. One is at training time as a part of the training procedure. Like you can imagine in language modeling. Every time, it encounters some bit of dialogue on the web where there’s this kind of conversation happening. It’s putting some of its weight towards viable responses to that. And in that way, it is actually aggregating over all the different conversations and building a distribution over reasonable responses. So the aggregation is done completely at training time, but an alternative is to do the aggregation at inference time. You retrieve a bunch of conversations that look like the one you’re having now, and then you do something with that to choose an appropriate response. And I think which approach is reasonable really depends on your application. . I think a case where you would want the aggregation at test time or inference time would be something where you actually don’t really trust the aggregate statistics in your training data. So if there’s some sort of systematic bias or other issue, you would much rather all the aggregate data in front of you at inference time and be able to manipulate it and, make sure that it is sound in some way before making a decision.
Josh Albrecht: [00:28:17] Especially, as the world changes you would want to not have just memorized things that happened before and spit that back out, but be able to compose things at test time. It’s also, going back to one of the other things that you mentioned as the interesting thing you’ve been thinking about, mentioned a few shot learning, which feels a little bit related here in that you can imagine at test time you get a little bit of information now you have to make some updated prediction. Okay. you know how to do that. What kind of things feel really interesting there to you?
Kelvin Guu: [00:28:47] Yeah. So at first glance, few shot learning seems like we’re tying our hands behind her back that, Oh, we have all these data sets with tens of thousands of data points. Why are we studying few shot learning? But I think the reality of the situation is that there are far more real-world tasks that are few shot. They’re just not recognized as machine learning tasks because they don’t have big data sets around them. And with time, we should see that. In fact, there are a lot of small niche problems that many different people have, that can be viewed as few shot learning, and we’ll be able to somehow study that in an academic way as well. Maybe what comes to mind is you can take something very general, like question answering, and you can say, Oh, we have big data sets for question answering like SQuAD is a hundred thousand questions, nothing could stop us. But, in reality that only covers a very narrow slice of possible questions one could ask. There are questions that can be answered by Wikipedia, they have a certain format. And so the long tail of question answering is not anything covered so well by any dataset that we know. And the full range of things that could be questions, each one of those is a few shot learning problem, and we need better generalization to solve it, I would think.
Kanjun Qiu: [00:30:09] I’ve always felt humans are a few shot learners primarily. so when we’re thinking about machine learning models in the long-term , more general models have to be few shot learners. Is this an unusual perspective? That few shot learning is useful?
Kelvin Guu: [00:30:26] I’m not sure how unusual it is, maybe it is actually quickly becoming the recognized fact.
Kanjun Qiu: [00:30:32] Do you feel like you have any opinions that are unusual among people that you’ve encountered?
Kelvin Guu: [00:30:39] I have one unpopular opinion, which may be, is rightfully unpopular. I used to work on a problem where we were mapping natural language to actions in the web browser. And this was an angle for learning to map language to imperative, actionable behavior without having to go through some sort of intermediate semantic representation. So usually if you’re building a dialogue agent or an assistant, you map natural language to some sort of formalism, oftentimes like a logical form that indicates what some backend should do and the hope is to get rid of that intermediate formalism and go straight to execution in some well-understood backend, like the browser.
Kanjun Qiu: [00:31:30] So do you mean as an example, an accessibility command, for example, I say click on the refresh button and then you’d want to map that to the action of the person clicking on the refresh button, or do you mean something more general in terms of language input?
Kelvin Guu: [00:31:45] That’s exactly the sort of use case that we were thinking about , being able to operate your browser hands-free and maybe starting with the super atomic things – click on that button – to more sophisticated commands – book me a flight for X . The supposed benefits of an approach like this are that it’s actually it’s actually easy to gather data because people know how to operate the browser and they can give you demonstrations in a way that people don’t know how to write logical forms for semantic parsing. So that was one hope. But one thing we learned from that project was we had to set up a headless Chrome browser and various other hacky things to do these experiments. And it’s just very hard for the research community to pick up something like that. I feel that it’s very easy for the community to run with datasets that you can just download and map inputs to outputs very easily. But for something that has a complex environment, it’s hard to build momentum around it.
Kanjun Qiu: [00:32:45] That’s really interesting. It seems like what you’re saying is that if the dataset requires a good deal of setup and, you know, headless Chrome browsers, then it’s less likely for other researchers to set that up and collect the data and then do more work on it. actually huge swaths of uncharted territory in like a research sense. It’s that no one wants to download the Docker container and scale it up on their Kubernetes cluster and run the test.
Kelvin Guu: [00:33:19] I wholeheartedly agree with that and aside from the project I was working on there are lots of other domains where it’s just hard to get the data in the first place. So lots of, clinical applications turn out to be that way – super important problems, but there’s no way that you’re going to find a clinical data set, just sitting on the web. And I think there are so many great research labs that one of the things they do is they just really focus on certain applications and build the relationships with whoever is interested in that problem or has the data and they make huge progress. I think there’s a lot of work that’s pretty valuable, it’s just hard to build momentum.
Kanjun Qiu: [00:34:02] That’s really interesting. Are there other things that you think are undervalued by the research community or overlooked?
Kelvin Guu: [00:34:10] Maybe continuing on the same notice before – I think language understanding for scientific and medical literature is still – despite having been around for a very long time – I think it’s not that gets as much participation as you would think. To be fair, there are large workshops in communities, whole communities that work on this. So it’s not like they’re small scale in any way, but you still don’t see as many papers at say NeurIPS or ICML as you would expect. I think because those conferences are very methods focused and oftentimes to get going in one of those domains it’s not really the methods that need to be extremely novel, but just understanding all of the features of the problem.
Kanjun Qiu: [00:34:56] Yeah, It’s really hard to publish something that is not novel in methods. But it’s novel in the dataset that you’re using or the tasks that you’re doing or things like that. But there are no existing benchmarks for this dataset or this task and people don’t care or you’re just using, existing methods, so those are overlooked
Kelvin Guu: [00:35:19] Yeah, I agree. I think you can see how a task does make its way into the mainstream, it often takes many attempts. For example, maybe to take SQuAD, the question and answering data set, as an example – before SQuAD, there were many question and answering data sets as well. And in fact, maybe they were actually harder than SQuAD. And so they just didn’t, cater to the benefits of methods at the time so they weren’t able to get traction, even though now it’s a very important, well-known task.
Kanjun Qiu: [00:35:58] Interesting. It’s like product market fit in startups. Sometimes you have a product that’s too early. Sometimes you have a product that’s too late. If the dataset is too early it won’t get traction. Have you ever thought about designing any kind of new tasks or new data sets?
Kelvin Guu: [00:36:17] I have. And it’s not easy. I. definitely have a few unpublished papers in that area. There are a few that if somebody were to do it, I would love to see a data set on it.
Josh Albrecht: [00:36:30] If you want to describe a few of those, maybe someone listens to this and we’ll email you in the future about it!
Kelvin Guu: [00:36:35] Sure. Yeah. So one data set that I think is really cool is called narrative QA, where the tasks set up is you ingest an entire book then you answer questions about the book. So, the sorts of questions ended up being very different from the sort of questions you would see in something like natural questions or SQuAD, because in those settings, it’s very often that there’s a single sentence that roughly answers the question, and then you need to extract the answer from it. But in narrative QA, oftentimes the questions end up being dependent on some sort of causal relation. Like some character gets injured in chapter one and then they’re ill for a few chapters and then they die. And then the question is, why did that character die? And you have to trace you things back. But the data set itself, but maybe I just didn’t do a good job, but it was a bit hard to get hold of the data And that’s maybe that was ahead of its time, because we don’t really have models that can condition on full books. The length of that is just challenging.
Josh Albrecht: [00:37:48] I think we might actually start to have some of those now, even the past few months, or maybe some of these things could do a slightly better job.
Kanjun Qiu: [00:37:55] What makes you say that?
Josh Albrecht: [00:37:56] Oh just some of the ICLR2021 papers that are like, we made a transformer that has infinite – and I’m like, oh cool, how about you try this problem?!
Kelvin Guu: [00:38:01] Maybe this year is the year.
Kanjun Qiu: [00:38:07] One thing I was really curious about is, related to these unpublished data sets and new benchmarks, have you had any things that seemed obvious and you thought it would work, but it just didn’t work and you couldn’t publish it as a result?
Kelvin Guu: [00:38:18] I don’t have a particularly good example of that Plenty of things didn’t work, but all for reasons that I later came to understand. But to be honest, there’s still a lot that I don’t understand about why things work. I think there’s a lot of heuristic knowledge that I can use when I’m building things, but I can’t really summarize for people. Like one heuristic is people seem to be always trying to build models that have a shorter path between the input and the output, in terms of the number of non-linearities that it passes through. And these skip connection type intuitions seem to come over and over again, and yet I am not able to give, a great explanation. There are papers on this, but me personally, I still don’t have that intuition. The heuristics that I have are maybe more of a procedure for building models than a static thing. So it’s a process where you try to start with a large model, make sure you can overfit. Once you can overfit, try to understand what sort of regularization you can add to improve the inductive bias. you’re You’re always checking for spurious correlations early on when you’re setting up the data. To be honest, another heuristic, which doesn’t sound great, but I always try to start with working code. So if that means someone else’s implementation, even if I know exactly how the model is set up, and I could implement an exact copy of myself, I would rather start with their code and incrementally move it towards where I want it to go. Just because there are so many tiny details that end up changing the result.
Josh Albrecht: [00:40:02] Yeah, they can have a very large impact, even really small changes. Actually our one of our blog posts was entirely about that , one of our recent blog posts where we implemented, BYOL bring your own latent from scratch, and forgot to put batch norm in in one of the places, and it didn’t work at all. We just forgot batch norm, it didn’t work, it’s just like random. And then we put batch norm in, and we’re like, okay, it works. And so we were like, why, what the heck? That’s so weird. So huge performance difference for batch norm versus not batch norm and we went off on this deep exploration of why exactly,
Kelvin Guu: [00:40:33] Yeah, those things happen all the time. I think that the way we incrementally work from working code onwards is just an indication of the fact that we don’t understand a lot of things that are going on. And that’s why we have such a conservative recipe ahead. So it’s not that I have a great intuition for why this works.
Josh Albrecht: [00:40:52] And that was actually my second question – do you feel like you’ve seen any recent papers or recent things that helped make things click , that took it a little bit away from the heuristic and a little bit closer to the like, Oh, maybe I have some better intuition for how or why this is working?
Kelvin Guu: [00:41:04] I can actually think of a paper that broke my intuitions even further.
Josh Albrecht: [00:41:11] Perfect.
Kelvin Guu: [00:41:13] But there was a paper about reordering the layers in the transformer. So typical transformer is attention pulling feed forward, attention, pulling feed forward, and this paper reordered, everything I think, and did did a mini architecture search over what happens if you just do a feed forward feed forward, and then attention or any combination of those. I forget what the final architecture was, but it was like attention pooling all the way up until nearly the top. And then just feed forward all the way to the end. this thing performed better than a standard transformer on some language modeling task. And I thought that was great, but it just further broke my understanding of how things work.
Josh Albrecht: [00:42:00] Any other ones like that?
Kelvin Guu: [00:42:04] There’s also the paper showing that large networks can learn completely random mappings of labels and yet still generalize in other settings.
Josh Albrecht: [00:42:13] Yeah. I saw a theory paper on something similar on generalization and memorization from ICLR2021, very similar idea about when do things memorize and how do they still generalize, which gave a pretty interesting theoretical reason for it too. The quick overview is in their particular setting, they looked at the manifolds of information that’s learned and the information manifold is much smaller dimensional at the later layers for a classifier than at the earlier layers. For images for example you have millions and millions of pixels, all sorts of crazy stuff going on at the end, you’re trying to decide, is this a cat or a dog? so it has this really small dimensional thing and it’s very easy for it to overfit to that. They found that just by rewinding those later layers they could basically do just as well. They could undo the memorization by just fixing up those later layers. To me, it also felt a little bit connected to the success of things like SimCLR and BYOL and LoCo where there are layers for prediction that you throw away afterwards. So it’s almost like, by making these throw away layers as part of the loss, you can do the overfitting over there where it doesn’t matter and then get rid of it. And you’re left with the much more robust lower level layers. So I thought that was interesting. Who knows if that’s the right intuition or that’s exactly how it works, but I thought it was an interesting idea. Wrapping up a little bit and looking forward, other areas that you see out there that you’re really excited by or interested in moving towards, over the next year or so?
Kelvin Guu: [00:43:39] So I think a lot of the things that I mentioned earlier are some areas that I’m looking forward to seeing a lot of progress in over the next few years, so maybe I’ll mention a few things that aren’t the next few years, but even farther out that I’m excited about. I think that now that we are getting models that, very robustly have a lot of knowledge about the world there should be an opportunity to use them more for tasks that you wouldn’t traditionally think of as knowledge heavy tasks. And I’m thinking here of things more in reinforcement learning land. If you think about a human, they can pick up a new game and figure out how to play that game relatively quickly. And they’re drawing on a lot of prior knowledge. Both from playing other games, but also from the fact that that game is a metaphor for something in real life, the sorts of affordances that an object has and the properties that it has all come somewhere from your prior world knowledge. And it seems that there could be a lot of connection between, work in language understanding and work in game playing and other settings that are actually more involved the real world. So not just games, but robotics. If you’re a robot, do you really have to have interacted with a particular door knob before to know that it is a door knob?
Josh Albrecht: [00:45:06] That’s really fascinating. Now I’m envisioning the robot taking a picture, classifying it like, Oh, it’s a doorknob. Okay, cool. I know what to do with door knobs! And using some off the shelf object detector knowing, Oh, door knobs can be opened. Doorknobs are made of x particular types of material, that’s really interesting . Or in games knowing oh, it’s food, so it probably gives me help.
Kelvin Guu: [00:45:24] And I think that today, that’s just a pretty heavy lift because first you need a robot . If you’re doing that particular setting, but more generally it takes a lot just to do reinforcement learning already, incorporating a very expensive language model whose knowledge is still not that easy to extract, like it has to be prompted in various ways, and furthermore, that knowledge is in text, which is not the feature representation that you can immediately use for planning or policy learning so it does seem like there are a lot of unanswered questions there.
Josh Albrecht: [00:45:55] Yeah, I think one of the things that’s really fascinating to us also is the connection between these language models and the real world, the grounded language understanding the connection between the words and what it actually means in the real world. I think the reinforcement learning applications you gave examples of are a pretty interesting and fascinating place, and then I would expect to see a lot more progress in it as well.
Kelvin Guu: [00:46:14] Yeah. Certainly things have been getting easier and easier over time. So I think some people worry that, Oh, like machine learning is going to put itself out of a job soon. And maybe that’s true. I don’t know. But I also think that, there are just many more sophisticated tasks that we can start to work on if things really are working. And if we can’t work on those then we should find out why we can’t.
Josh Albrecht: [00:46:36] Are there any things that people listening, might be able to help you with like future collaborators, data sets, tools, open questions, open problems?
Kelvin Guu: [00:46:45] Yeah. So I think I’ve said, I don’t know, many times throughout this session already. And so if anyone does know any of those things that I said, I’d love to get an email. As well as corrections to anything I said wrong
Josh Albrecht: [00:47:03] Yeah, actually it more generally, we would love that as well. And, we’ll put up a list of references and links and everything. So if anyone sends anything, we’ll put it there. So feel free to reach out to us.
Kelvin Guu: [00:47:12] Thanks a lot for having me on the podcast, this was fun.
Kanjun Qiu: [00:47:15] Yeah, this was really great, really fun. I’m really happy that you were able to join.