Sam Charrington: [00:00:00] All right, everyone. I am here with Jabran Zahid, Senior Researcher with Microsoft Research. Jabran, welcome to the TWIML AI Podcast.

Jabran Zahid: [00:00:09] Thank you very much, Sam. It’s a pleasure to be here.

Sam Charrington: [00:00:11] Great to have you on the show. I’m really looking forward to digging into our conversation. To get us started. I’d love to have you share a little bit about your background and how you came to work at the confluence of biology and artificial intelligence.

Jabran Zahid: [00:00:26] Oh, thank you very much for this opportunity to share with you what we’ve been working on here at Microsoft. By training, I’m an astrophysicist, and prior to coming to Microsoft a year and a half ago, I was working on understanding galaxy evolution and cosmology, largely trying to look at galaxies. The most recent stuff I was working on is look at galaxies and try to develop techniques to tie those galaxies to the dark matter distribution in the universe.

I was interested in mapping the dark matter in the universe using the galaxies as little beacons of light in this sea of dark matter. It was a real privilege to be able to study astrophysics. It’s a beautiful subject, but as I’ve gotten older, one of the things that started to become a higher priority for me, personally, was to be able to have a greater impact with the work I was doing, and by impact, I meant, that meant to me, the ability to impact people’s lives on the day to day.

While astronomy is a beautiful subject, it’s not the most practical in terms of every people’s day-to-day lives. It has important cultural impact. It doesn’t have that impact on everyone’s lives from the day to day, so I started to look for opportunities in one place that made perfect sense to look towards was industry, where not only is there interesting projects and interesting things being done, there’s the opportunity and ability to have reach if you work at the right place that has the reach to individuals.

When, one of my former colleagues who was also an astrophysicist himself, went to Microsoft Research. She told me about the position within the Immunomics group and told me a little bit about the details. It was just my bread and butter. It was a science project mixed with a very, very, if successful, could potentially have a huge impact could even change the world if we succeed at what we’re doing in this project.

That just really got me excited. Once I had learned more about the project and brought my skills to the table, it made sense. I was a good fit for the role, and I ended up at Microsoft Research at the end of January last year, six weeks before the pandemic hit.

Sam Charrington: [00:02:26] Wow. Did you say Immunomics?

Jabran Zahid: [00:02:30] That’s what we call it.  It’s immunology mixed with genomics basically. Our project essentially is, we’re trying to map the immune system and the way we do that is we genetically sequence the T-cells of the human immune system, which we’ll go into details on what that means. We’re essentially trying to learn how to read the immune system from the genes themselves.

Sam Charrington: [00:02:50] You mentioned that you started just before the pandemic. Did that influence the evolution of the project at all?

Jabran Zahid: [00:02:57] Absolutely. We have been engaged in helping adaptive biotechnologies. The project I worked on, The Antigen Map Project, is a collaboration between Microsoft Research and Adaptive Biotechnologies.

We’ve been helping them make diagnostics, and when COVID hit it presented a very unique opportunity for us to turn all of our efforts or a big fraction of our efforts, towards trying to diagnose COVID, which we did successfully. Adaptive Biotechnologies has a FDA authorized diagnostic on the market, which you could order today if you wanted to.

COVID not only provided a very strong impetus in regards to the fact that it was just one of the most pressing human problems that we were facing, but also, it provided a unique opportunity to really bring together many, many aspects of our project. It’s a great test case for understanding what we do in our project, what the antigen map is. It really accelerated our research. I anticipate that when we look back at last year, it will be seen as a watershed moment in our project, simply because of the accelerant that COVID was for our project.

Sam Charrington: [00:04:15] Awesome. We’ll dig into the machine learning aspect of the project and how you apply ML, but I think I’d like to hear a bit more about the biology and understand the problem that you’re trying to solve at a more fundamental level. Immunomics, how does it work? What specifically are you trying to do with The Antigen Map Project?

Jabran Zahid: [00:04:39] Yeah. Thank you for asking about that, Sam. I’m really happy to share this and I should, first of all, say that what I’m discussing now is a representation of 50 or so people’s work. It’s not just me who’s carrying this out. This is a large collaboration. It really is an effort that spans multiple companies and really builds on decades of research and immunology.

The human immune system is an amazing system. The adaptive immune system specifically is something that have started evolving about 300 million years ago.

What the adaptive immune system is, is the part of the human immune system that has a memory. When you’re a kid, you get sick with, let’s say measles or something, your immune system will eventually will respond to that, and the adaptive immune system will retain a memory of having seen the measles. You will not get sick with the measles again if you’ve had it in the past, because the second your body gets exposed to the measles, your adaptive immune system is ready to go. It remembers what it looks like, what the pathogens from measles looks like, and it springs into action.

A big part of that immune system is the T-cells. The T-cells essentially are floating around in your blood and then in some of your organs. When they recognize, they have a little receptor on their surface, and that’s actually what we sequence is the T-cell receptor. We get a genetic sequence of the T-cell receptor and that genetic sequence encodes, more or less, the shape of that receptor. Like a key fitting into a lock, if that receptor’s T-cell finds its lock that it fits into, if it finds the pathogen that it binds, it’ll basically trigger an immune response. After that, immune response, the virus or bacteria is cleared from the body, it will remember. Those T-cells, that special T-cell, will stick around in your body for much longer than the rest of the T-cells.

These T-cells, the adaptive immune system itself, is produced by a stochastic quasi random process in which different combinations of amino acids are put together producing a huge number of possible shapes for the T-cell receptor. That’s where the complexity of the problem comes in, and that’s where machine learning is required.

The space of possible T-cells is something like 10 to the 15, and you yourself have hundreds of billions of these things in your body. Trying to use sequencing, which is what Adaptive Biotechnologies secret sauce is, their ability to genetically sequence a large number of T-cells.

For an individual, I can tell you from a  vial of blood, you can sequence something like 500,000 to a million T-cells. and then we can read those in our computer and we have that for tens of thousands of individuals. You can imagine, now you have all these strings of letters floating around that represent T-cells. You want to read, what do those  letters mean, becase those T-cells encode. the memory of all the things you’ve been exposed to in your past If we can successfully read that book of your immune system we will be able to tell you all the things you’ve been exposed to in the past, and things you may be actively fighting which is the area we’ve been mostly focused on which is building diagnostics of things you’re actively fighting now

Sam Charrington: [00:07:53] A couple of questions based on your explanation. The first is, you mentioned that T-cell production is, in many ways random. The result of some stochastic process, so the 500,000 T-cells that you mentioned you might pull from a vial of my blood isn’t some historical DNA record of 500,000 diseases . There’s some number of diseases that have created T-cells but then there’s a lot of randomness built-in Am I getting that right

Jabran Zahid: [00:08:23] That’s a wonderful question What it really is is the process by which these T-cells are produced is called VDJ recombination Essentially in your thymus different groupings of amino acids are inserted to create the T-cell receptor . Now, those are naive T-cells they don’t know what their cognate pathogen is You just have a huge number of them This is the beauty of the adaptive immune system It just creates a huge number of them It’s only when those random ones of the naive ones encounter a pathogen to which they latch so that key fitting into the lock, that’s when they proliferate. They clonally expand they start reproducing themselves, and they retain a memory, and they become what are called memory cells This is a very simplified version of it but essentially what happens at that stage those will stick around in your blood far longer than the ones that are naive To your question specifically,when we draw the vial of blood we have a huge number of these naive cells The vast majority are naive cells actually but not all of them One to some fives of percent are these memory cells and discriminating between the memory and naive is one of the major challenges of our project and that’s something we’re very actively engaged in.

Sam Charrington: [00:09:38] We’ll come back to that in a second I want to ask another question I had about this. Maybe it is the same question, when you’re doing the sequencing, is the sequence of proteins directly telling you the receptor or something about the receptor or is there something more fundamental about a T-cell that is coming out of the sequencing?

Jabran Zahid: [00:10:02] That’s great question The sequencing is what’s known as the CDR3 region which encodes the receptor itself The sequence is just amino acids 20 different possibilities ofA’s C’s T’s G’s whatever but

amino acids are encoding for proteins which then make up the structure of the receptor. In your mind, the picture you should have is literally the lock and key picture? That there is a structure to this receptor. It has to physically fit the pathogen that it’s trying to bind in a way that it binds through a physical chemical bond, essentially.

If the shape is right, then those two things will come together and it’ll be a good fit, and that’s when the immune response starts. Otherwise, nothing happens. Those cells just float around.

Sam Charrington: [00:10:51] When you’re using machine learning to distinguish between the random T-cells and the ones that are activated and have identified their pathogen, it’s not within that protein sequence because the receptors are the same. Is there some other flag or characteristic that distinguishes the two?

Jabran Zahid: [00:11:12] Generally, if one really wanted to get the ground truth, you would go and you would look at surface markers on this T-cell so not the receptor itself, but the T-cell that would help you distinguish between whether it’s a memory or naive cell. The way we go about understanding that issue is by looking at other characteristics. One of the primary characteristics is what’s known as the publicity of the T-cells. These T-cells have a range of generation, but probabilities of occurring in any individual, which is referred to as a generation probability.

The probability is generated by this random process of VDJ, and for ones that have reasonably high generation probabilities, there’s a good chance you’ll see them in a number of individuals. One of the standard ways that we set up our experiments or are the methods by which we get and arrive at finding the collection of T-cells that are both memory and specific to a disease is, COVID’s a great example. You have thousand individuals that have COVID, we’ve drawn their blood. We’ve sampled their T-cells. We compare that against a thousand people, a control sample that don’t have COVID, and we simply ask the question, which T-cells appear in a statistically significantly higher frequency amongst the individuals that have COVID as compared to the individuals that don’t.

That gives you your set of T-cells that may potentially be T-cells that are actively fighting COVID and then you do all your machine learning and things like that from there. That’s the starting point of our diagnostic procedure.

Sam Charrington: [00:12:47] Got it. It sounds like a great application for some pattern matching.

Jabran Zahid: [00:12:51] Yeah, absolutely. You can really imagine some of the tools of natural language processing coming into here, because these are literally just strings, but you got throw in a little bit of physics too, because they’re encoding for physical properties of a thing.

It’s a complicated problem, which we’re just scratching the surface of right now, but really have made enough progress that it’s clear to us this is going to be something that’s going to yield very important techniques for us understanding human health.

Sam Charrington: [00:13:18] Before we dig into the technology aspect, I just want to hit pause briefly and ask you, you talked about your background as a astrophysicist and cosmologist, I did not hear doctor, biologist, any of that, but yet you’re speaking very fluently about the biology. I’m just curious about that process for you coming up to speed in this domain and how you approached it, and if there’s anything interesting from your background that you brought to this problem area?

Jabran Zahid: [00:13:52] Starting out on this project I had a high school biology understanding of the immune system, and then whatever Wikipedia told me. I didn’t have any sophisticated knowledge. That was the primary challenge. The tools that I had learned along the way for studying galaxies and cosmology were very applicable and translated, very straightforwardly to the problem, and the techniques, and the training, and the craft of doing research was something. I had been doing research for 20 years. I understood and had great mentorship that really gave me those skills, but the domain specific knowledge was the greatest challenge, and remains my greatest challenge to this day.

You may say I speak of it fluently, but in my mind I feel that ignorance is outweighing the knowledge that I have on this subject. I appreciate you saying that, but the reality is that that’s been the challenge. Basically, the way you approach a science problem is you got to start playing with the data, but at the same time, you got to contextualize that exploration of the data in what is known in the field. The way I’ve gone about doing that is of course reading a huge amount of the papers that have the 30 years of immunological, 40 years of immunological research on the subject, going to conferences when possible, that’s been a little bit more difficult these days, but scientists have made huge strides in virtual conferences. One of the most important things is talking to my colleagues that are immunologists and just asking what, sometimes it may seem like a stupid question or a dumb question, but it’s really just a reflection of my own ignorance and trying to fill that in. That’s what’s gotten me this far and I feel that filling in those gaps, combined with the techniques that we’re developing as a team, using tools of machine learning, are really the things that are going to be required to take this project to the next level.

Sam Charrington: [00:15:50] Let’s talk about some of those techniques. You describe the setup, at least high level of this pattern matching problem. You’ve got your folks with an identified disease. You’ve got your control group. You take a bunch of T-cells from all of them, and you’re trying to figure out which T-cells are more significantly evidenced in your exposed group.

What machine learning approaches do you apply to a problem like that? Even the step before that, what is the data collection, and so many of these are supervised techniques, labeling process look like for this kind of problem?

Jabran Zahid: [00:16:32] We can take COVID as an example, it varies from disease to disease.

COVID encapsulates much of the process, which is, in some sense, a process that’s ubiquitous in any machine learning process. You collect your data, which is drawing vials of blood, and for COVID, the way we did that was Adaptive has partners throughout both industry and academia, so the ground truth, oftentimes, not always, but most of the times, was taken as a PCR test. If someone had a PCR positive test, we know this person has the virus in their body, and therefore they not only exposed or infected. Let’s draw their blood, that’s where the labels are typically coming from. There are other subtleties involved, which we don’t need to go into. Then, you get your label data, and now we have a huge number of…

Sam Charrington: [00:17:25] If I can jump in quickly there. These PCR tests aren’t perfect. They have whatever the false positive rate is for the PCR test, false negative rate. Do you try to adjust for that in the process, or either by some kind of quorum technique, multiple tests or mathematically somewhere?

Jabran Zahid: [00:17:48] Yeah. Different ways, depending on the circumstances in which we address that issue. Oftentimes what we see is that these false  negatives, which are somewhere at the level of 5% or so, I think that’s typically the number. You see them as outliers, but we have large enough samples and that’s  just part of the game. There’s going to always be…

Sam Charrington: [00:18:06] Another source of noise.

Jabran Zahid: [00:18:08] Yeah. There’s always noise and you just deal with it and it depends on the circumstances and how it’s affecting your system, so it’s certainly an issue, but we are well equipped to handle that.

Sam Charrington: [00:18:16] Okay.

Jabran Zahid: [00:18:17] Yeah. Then we have our label data. In any machine learning project, one of the things you really want to do next is, once you collect the data, is determine your features. At the highest level, our features are these public sequences, the sequences of these T-cells that appear in multiple individuals in a statistically higher frequency in the individuals who have whatever endpoint we care about. In the case of COVID, people who have COVID versus individuals in our control sample, and then we just count those sequences. How many of those are occurring in an individual, and then do a simple logistic regression model, and that gets you pretty far.

It’s impressive how far that can get you. Just like any machine learning application, usually the simplest models gets you 90% of the way there. You have to start with the simplest models because you have to have a baseline, and you can interpret them much more easily, so that’s where we’re at in terms of our diagnostic. We have the simple model that we can submit to the FDA and it has been authorized by the FDA, but of course you want to extend on that. We have this enormous data set, and how do you push that further? We don’t care about just whether you have COVID or not. We want to know other things that we can learn from this data.

One interesting application is in addition to these tests where we just sequenced what we call the repertoire, so the T-cells. There’s laboratory experiments in which we take the actual pieces of the virus of COVID and put them in test tubes and throw a bunch of T-cells at them and see what sticks to what. One of the issues with the diagnostic approach that I described is you see that these T-cells are statistically occurring in a higher statistical frequency in the cases versus the controls, but you don’t really know for sure whether they’re specifically  attacking COVID.

These laboratory experiments allow us to make that test, which is take those pieces of the virus, when the virus enters your body, the way your immune system responds, as it chops up the virus and then presents it essentially on a surface of a cell to the T-cell to come along and grasp onto it.

There’s a presentation step and that presentation is usually about 10 or so amino acids of the virus. It gets chopped up. We chop up the virus, throw it in a test tube, throw a bunch of T-cells at it, figure out which ones stick and then ask the question: of the ones that are sticking, how many of these do we see in our diagnostic?

In the public cells that we comprise our diagnostic. The upshot of all of this is now we have the ability to both know that the T-cells that we have in our diagnostic are attacking COVID, but not only that, but what they are attacking in COVID. What part of the virus are they attacking.

Sam Charrington: [00:21:06] Meaning which 10 protein sequence is the receptor latching onto in particular?

Jabran Zahid: [00:21:13] Exactly. 10-ish. That’s just a rough number.

One upshot of this is we can distinguish now between whether this T-cell is hitting the spike protein, which is the protein that encodes the spikes on the surface of Coronavirus or the envelope protein, which creates something else.

If you follow the vaccine development, and one thing you note is that almost all the vaccines, certainly all the ones that have been approved in the United States, all target the spike proteins.  They don’t introduce the  whole Coronavirus. They just cut out the spike protein and whether it’s mRNA virus where they just indirectly introduced that RNA into your body or whether it’s something like the Johnson & Johnson, which they attach it to a vector, like a common cold virus and they attach it. In any case, that’s what your body is building your immune response up to, and the fact that we can discriminate between what the T-cells are responding to means that our diagnostic has the power and we’re working on this very diligently, to discriminate, whether you have had a vaccine or a natural infection. That has important implications for things like trying to understand people who get reinfected after a vaccine, for example, and vaccine manufacturers will really care about that.

COVID whether we like it or not, it’s going to be here for a while, so this is really providing an ability for us to begin to understand and dissect the disease in a way at level of resolution that hasn’t been previously possible.

Sam Charrington: [00:22:49] I’m not sure I’m following that. How does this technique allow you to differentiate between folks that have T-cells because they were vaccinated versus the naturally occurring virus? Before you do that, I love that you refer to the set of T-cells that a person has as a repertoire, like it’s a certain set of skills.

Jabran Zahid: [00:23:15] That’s what the field refers to them. That’s a bit of jargon, but I love that too. I’m glad you picked up on that. That’s cool, right? That’s that’s the technical term for it. Again, the diagnostic that we build works by counting up the T-cell response. You count up the different T-cells and now that we think are specific to COVID. Now what we can say is these T-cells are specific to this subset of all of our T-cells that we think are in our diagnostic.

Let’s say we have 10,000 T-cells in our diagnostic, some fraction of those are attacking the spike protein, and some fraction of those are not attacking spike they’re attacking the envelope, and the spike protein is a small fraction of the genome of Coronavirus. There’s something like 10,000 amino acids and the spike is only a few hundred to a few thousand. I don’t remember the exact number, but if we know which T-cell is attacking, what, in people who have vaccination. We only observed those T-cells that are targeting spike in those individuals. It’s actually quite, it’s amazing how robustly we can do that.

Whereas someone who has a natural infection will have a response that covers a much broader range of the T-cells.

Sam Charrington: [00:24:26] It’s really speaking to both the granularity of the problem, and I’ll elaborate on this in a second, but also the diversity of T-cells that you are speaking to, it’s not the case that there is a Coronavirus T-cell and there’s one and only one. It’s that there’s a family of T-cells that attack different aspects of the Coronavirus, and maybe even multiple that attack the spike, and the population that someone has of each of the, possibly many in this family, can tell you a lot about how they acquired the virus.

Jabran Zahid: [00:25:04] Absolutely. That’s partly where the machine learning comes in.

How the immune response was triggered and that’s really where that machine learning comes in. Finding those deep, deep patterns encoded in those receptors. What makes these T-cells specific to COVID, and what’s similar about these two that we know are hitting the spike protein and things like that. That’s really where. The next step of the project really requires this very sophisticated modeling.

A problem we haven’t cracked by the way, despite many, many, many different attempts, so it’s a very difficult problem and can only be addressed with the tools and sophistication of machine learning algorithms.

Sam Charrington: [00:25:52] We started out talking about logistic regression and the supervise problem where you’ve got the test results as labels, and now you’re starting to talk about things that sound like clustering and unsupervised types of problems. Is that the general direction that you’re heading with this kind of analysis?

Jabran Zahid: [00:26:11] Absolutely. The unsupervised techniques provide a means for clustering. For example, dimensionality reduction. The standard approaches is that one would throw at any problem with very, very high dimensionality and large parameter space, but that’s only the first step. The real question, the heart of it all, is we want to read the immune system. What we call the antigen map is I give you a T-cell and its receptor and you tell me what antigen that T-cell will bind to because it’s only then that we can read off your immune history. When we draw your blood, we may know this T-cell is a memory cell, but we won’t know if it’s a memory cell to the common cold or to Coronavirus or to some bacteria. We won’t know that just from looking at it, we’ll have to use the sequence and understand how that sequence encodes the information about what it has attached to in the pas; what it’s bound to in the past. That’s where the machine learning really comes in and you can imagine the complexity of the problem. We’re literally trying to read the immune system in a way that allows us to read your immune history.

It’s just a bunch of strings when you look at it on this computer screen, and so the challenge is going from that bunch of strings on your computer screen to a physical mechanism and physical system and the physical properties of that T-cell that really give us the information about what it’s binding.

Sam Charrington: [00:27:47] You’ve tried a lot of things and have a list of things that haven’t worked. What are some of those things?

Jabran Zahid: [00:27:54] That’s a great question. It’s pretty interesting because a few researchers have come onto this problem since I have and everyone treads the same path in some sense, which is, you come in and you say logistic regression, how are you still using logistic regression to do this?

That’s that naivete that’s required to really try some interesting, crazy things in science. One of the obvious things is how far could we carry this analogy of we’re trying to read the immune system.

One of the things I tried was to take BERT, which is a well-known natural language  processing model. It’s called a transformer. It’s a model that’s essentially it’s used in natural language processing tasks for questions and answers on a bot or translation. It’s a very  multi-faceted tool. Natural language processing is a field in which machine learning has really matured and they have techniques and approaches by which what they called transfer learning, where you can take a model trained in one domain, this happens in image analysis as well, but you take the model trained in one domain, let’s say all of the web pages of Wikipedia, and then apply it in another domain. You do this training in this huge data set, and then you fine tune it to your specific problem. It works to varying degrees depending on the nature of the problem, but that’s besides the point.

The question I asked is, can we just use this model, this transformer type, natural language processing model to read the sequences and see if we can get somewhere? It turns out it just doesn’t work. It doesn’t work at least in the way that we set it up. It’s not surprising.

These sequences and the analogy breaks down between natural language and biophysics and biochemistry. Understanding where that breakdown happens is one of the most critical questions to really figuring out what the right set of algorithms and the right set of constraints and the right data.

In some way,s the right setup of the problem, that’s one of the most difficult tasks and machine learning is setting up the problem appropriately. Hopefully these failures will help guide us to the path that’s going to lead us to success.

Sam Charrington: [00:30:14] Are there specific things that you can share that you learned in attempting to apply BERT to this problem or specific places that it broke down?

Jabran Zahid: [00:30:24] I didn’t push it too far. I would say that the one thing that immediately stood out to me was it worked to a degree. At first, I was very excited. I was like, “Wow, this has predictive  power on specific tasks,” and so, “Hey, let’s publish this or let’s use it,” but it turned out, BERT is something like a hundred million parameter model. It’s a really, really huge model, which, unless you have a lot of data it’s not really justified. The reason it was working is basically the way BERT is designed, as I understand it is, typically you have an embedding, a layer that does all the embedding, and then you have this layer that, you attach to it on the end, that does essentially the decoding slash whatever task you care about and more or less most of the interesting stuff was happening in those surface layers.

You could really reduce the model down, take away the 700 odd hidden layers and still get the same level of accuracy, and then in fact, what that led me to realize was there’s actually even simpler models like Random Forest, embarrassingly, that will get you that same level of accuracy that was in BERT, and one of the lessons I honestly took away from that was don’t rush to the most complicated models, start with simple models and build up from there.

That’s what we’ve been doing. One of the ways we’ve been approaching this problem and one of the things we’ve learned by going to this approach is that, you have these strings of amino acids, you cannot just substitute in random positions, new amino acids, and think that it will bind to the same thing.

The places where substitutions can happen in the amino acids is very specific places and only changes from very specific amino acids to different amino acids, and this of course begs the question why is that the case? We suspect this has to do with the physical properties of being amino acids themselves.

Some are interchangeable. This is known because of the physical, chemical properties of these amino acids have been measured in the laboratory. Putting that physical picture together, which came into sharp relief when we started by using complex models, but understood that actually simpler models can get us there has really guided us on the path of understanding the problem.

It’s not just  enough what we’re dealing with human health. It’s not just enough to predict things. We need to understand why those predictions are happening the way they are, otherwise we run a serious risk of producing essentially a black box, and we found in human health, often you have confounding signals. You think you’re seeing one thing, but it’s actually being caused by something completely unrelated, and when you don’t fully understand what your model is doing, you can fall into those types of traps.

Sam Charrington: [00:33:15] With regard to BERT, it sounds like you are using, you mentioned a transfer learning, sounds like you were using a pre-trained BERT model and trying to fine tune. Did you also try to train from the ground up?

Jabran Zahid: [00:33:31] Yeah, we did. The thing that we took from BERT was the unsupervised training step, which was, what BERT does is it would take a sentence and it would mask out random words in that sentence and then try to reproduce what was masked out and that’s unsupervised because it…

Sam Charrington: [00:33:47] It would seem to preserve some of the positionality that you require for proteins?

Jabran Zahid: [00:33:54] Exactly. We would mask out random amino acids and then try to reproduce the sequence on the other side, you start with that unsupervised task.

That’show you do the pre-training, so to speak, and then you slap on a layer for a classifier or whatever your specific task is, we definitely tried that and it was successful, as I said, but what we came to learn was something like a Random Forest is a lot easier to interpret. What is it that’s what we’re learning and through that procedure we learned that, “Oh, it’s actually positional information and very specific types of substitutions that are allowed.”

It was a lesson that I’ve learned many times you’re doing machine learning, which is don’t go to the complex models. Don’t go to what’s sexy necessarily right away, unless it’s warranted, but we also follow our passions, and sometimes you see the new shiny new model and you want to try it.

BERT may make it easy, and natural language processing community in general makes it very easy to take models out of the box and use them. Something I think that the rest of the sciences and certainly immunology would benefit greatly from is making progress in that way as well.

Sam Charrington: [00:35:01] Awesome. Tell us about where you are with this project relative to where you want to be and what the future path is?

Jabran Zahid: [00:35:11] Yeah. We have made significant progress in the last year, driven by COVID not only the fact that it’s one of was, and remains one of the greatest immediate challenges facing humanity, but also it provided an accelerant for us to bring together all our techniques that we’ve been working on. I described, for example, these laboratory techniques where we throw a bunch of T-cells at the pieces of the virus, bringing that together with our diagnostic approaches has demonstrated  this application I was describing for discriminating between vaccine versus natural infection, et cetera.

We really brought together a lot of the different techniques and demonstrated the power of these techniques. Not only to ourselves, which is the one of the most important things, but to the world, by having these diagnostics that are authorized by the FDA, and I may be wrong about this, but I’m pretty confident that these are some of the very first, if not the first, COVID diagnostic machine learning diagnostic approved by the FDA.

That in and of itself is an amazing accomplishment, and there’s a lot of back and forth on how do you do that and things like that validated, et cetera. That’s an interesting  side note. We made enormous progress. The ultimate goal is the antigen map. As I’ve described in the beginning, which is this ability to take any T-cell and understand what it’s meant to target.

My hope is that five years from now, when we look back at this moment, we’ll see it as a watershed moment. We will have arrived at a firm understanding of whether that is even possible, whether the antigen map is possible because the reality is, we often refer to it internally as a moonshot.

It’s a high risk, high reward venture, but if we are able to succeed in this, we will have the ability to understand immune risk to the human health in a way that humans have never had before. It will impact therapeutics, diagnostics, every aspect of how we treat human health. I’m excited to be a part of this. I hope we succeed. I hope we are able to provide this great benefit to the world and we’ll see if we can succeed or not. That’s the question that we’ve set out to answer and hopefully in five years, we’ll have an answer to that question.

Sam Charrington: [00:37:30] Awesome. Well, Jabran, thanks so much for doing the work, but also coming on the show to share a bit of it with us.

Jabran Zahid: [00:37:38] Sam, thank you so much for this opportunity to share the amazing work we’re doing on our team. Thank you.

Sam Charrington: [00:37:43] Thank you. All right, everyone. That’s our show for today to learn more about today’s guest or the topics mentioned in this interview, visit Of course, if you like what you hear on the podcast, please subscribe, rate and review the show on your favorite podcatcher. Thanks so much for listening and catch you next time.