The Journey of a Tech-Savvy Doctor Who Won’t Let Chat GPT Make His Bed (Yet)
This week I am talking to, Justin Norden, MD (@JustinNordenMD), Partner at GSR Ventures (@GSRVentures) who is a digital health investor and educator at Stanford Medicine. Like so many of my guests, Justin’s career start was different and interesting leading him down a path of medicine blended with technology and business and included a Master’s in Computer Science and Computational Biology.
ChatGPT Just Passed the Medical Exam to Become a Doctor
With an extensive background in AI and technology and watching the excitement around ChatGPT Justin noticed that the latest release failed to include testing this new Large Language Model on the exams for doctors, so he fed ChatGPT4 a final medical exam. In the announcement from Open.Ai for the release of the GPT4 model they had several exam results but medical exams were not among them
What were the results you ask – well you will need to listen to the podcast to hear but let’s just say they were not disappointing. For context, the previous model GPT3.5 (ChatGPT Release Model) performed at 57% correct which corresponds to a 3rd percentile, near the passing threshold, or a score of ~209.
As you will hear Justin will be teaching a course at Stanford University (MED 216: Generative AI and Medicine) which was filled in under 3 minutes which speaks to the popularity of the topic and area
Things are changing fast and the power of these models is undeniable. Medical knowledge and interpretation historically was the province of experienced clinicians after years of training. We are watching as access to medical knowledge and interpretation is getting democratized in front of us. Those of us in medicine need to think and react thoughtfully as these models start to interact with patients and providers in the real world.
Listen in to hear our discussion on the hype and excitement tempered with some keen insights on what all this can and will mean for medicine now and in the future
Listen live at 4:00 AM, 12:00 Noon, or 8:00 PM ET, Monday through Friday for the next week at HealthcareNOW Radio. After that, you can listen on demand (See podcast information below.) Join the conversation on Twitter at #TheIncrementalist.
Listen along on HealthcareNowRadio or on SoundCloud
Raw Transcript
Nick van Terheyden
And today I’m delighted to be joined by Dr. Justin Norden. He is a partner at GSR ventures. Justin, thanks for joining me today.
Justin Norden
Nick, thanks so much for having me on.
Nick van Terheyden
So if you would, please give us a little bit of a summary of how you arrived at this point what your career steps were to get you to this point in your career if you would.
Justin Norden
Perfect happy to. So my journey begins growing up in Seattle, I came from a family of physicians, but also went to the grade school where you know, Microsoft was the bar for success. So really, if you kind of pull those two things all the way through kind of that helps to explain kind of how I got here. So my undergrad and Master’s were in computer science and computational biology, where I focused on machine learning engineer omics, and then later got my medical degree in MBA from Stanford. I ended up going from there to found my own company in the AI space, which we sold to Waymo Google self driving car company, worked on the health care team at Apple for a couple years where we launched some of the Women’s Health features that are live on a billion phones across the world today, helped launch the Stanford Center for Digital Health. There were doing some of the first telemedicine visits out of epic, evaluating technologies for the health system, and eventually found myself to GSR ventures where I’ve been for over four, almost four years now, where we’re a team of physicians, technologists, with prior experience founding and running our own startups, where we have the same vision for you know how healthcare could be improved by technology. Additionally, I teach at Stanford, I’m an adjunct professor in the Department of Biomedical Informatics, where I teach on digital health and AI. And you know, really, you know, all that said, one of my goals are to do now is how do we really democratize healthcare access? How do we bring technology into medicine in the right way? And I’m super excited to talk with you more about that today.
Nick van Terheyden
All right. So I impressive background. And I always like the fact that there’s there’s normally some specific element in people’s history that sort of drives them to what they do. And, you know, I’m just imagining that whole Microsoft impact and you know, all the story is around Bill Gates and him programming and, you know, there you are, and then yet you come from a family of physicians, which allowed you to blend these two. So, you know, fascinating background that really explains a lot about you. So, one of the threads that you’re on the show for is the presence of chat GPT. And of course, it’s just constantly being talked about doing all sorts of, I mean, essentially, it does everything, including making your bed, you know, delivering food, you name it, it’s a superhero. But you did something interesting you actually said, or how would it do with the medical test for medical students to qualify? Tell us a little bit about that, and what you managed to achieve. And for the benefit of some people, they may not know what step two is just explain where that sits in the context of qualifying as a doctor.
Justin Norden
Absolutely. So, you know, as you’ve heard, you know, everyone has heard, you know, going through medical school is one of the most challenging professional degrees one can do constantly studying, memorizing, taking tests, culminating at the end of your four years of medical school, or more, in many cases, with taking the USMLE, step two clinical knowledge exam. So this is the final exam teaching across all medical information. You know, that’s not just basic science information, but then also the start of clinical management. What do you actually do? What is the right diagnosis? What should you do next? With it with a patient case. So this is a multiple choice exam, many, many hours where, you know, students will go to a test center, and then eventually go out and get the results, you know, needing a passing score, to get their medical degree and move on to residency training. So really, just the combination exam of medical school. And so, you know, using that as a benchmark, you know, we wanted to say, How does GPT four do on this exam? So previously, as when chat GPT came out in November, some researchers look to answer this question, and I’ll talk about this exam, as well as which in chat UBT, which is from open AI, as well as what Google did with med Paul, because these things really happen in parallel. So in back in November, when chat GPT came out, some researchers looked and tested this, you know, we literally copy and paste, you know, question stems from these USMLE step two, practice questions paste Didn’t to catch up to you how did it do. And Google more or less did the same thing. And what we found was at that time, which was amazing, it basically reached a passing or near passing threshold caught around 60%. So this was an incredible result. You know, we hadn’t really seen this before with any of these AI tools. And you know, people started talking about it. And everyone, you know, around the world mostly has heard now about chat GPT. Fast forward just a few months. Last Last week, we had GPT, for the next version, launch for open AI, as well as med POM to from Google. Interestingly, if you look at the open AI post, they talk about the incredible, incredible performance of GPT for across all sorts of exam questions and domains, everything from the bar, LSAT, GRE, high school, APS, etc, with huge, huge performance gains, notably missing from all of these graphs, was anything medical, anything on step two, and how it would do on licensing exams? I’m sure we’ll talk about that in a second for why, but let’s talk about how it did. We took this we took very similar methods testing the same way to how these researchers previously tested it on chat GPT, which was launched as a GPT 3.5 model. And what happened with GPT. Four, it performed much, much better. We saw at nine.
Nick van Terheyden
How much better Justin?
Justin Norden
Great question. Thanks. Thanks, Nick. Oh, much, much better. We saw 89% Correct, which roughly corresponds to a 95th percentile on the exam. Basically, over the span of a few months, open AI went from around just passing your bottom of a class to near near top of the class performance. Granted, there are many, many caveats and we could spend the whole podcast talking about this, it’s stochastic, it’s not going to get the same performance every time on the same questions. There’s questions of data contamination, leakage, is it quote unquote, cheating in any ways? Where, where it maybe already was trained on these answers? Separately, Google basically had the same jump and performance, basically, from 60% 85%, was what the Google researchers published on their blog post. And to me, what’s really exciting is just the speed and pace that these systems are improving. Does this mean we’re gonna replace doctors? No. Does this mean, we should maybe change how we’re thinking about these tests? I think so. But let me pause
Nick van Terheyden
my thunder there. That’s exactly my point here. I think that represents from my experience, and I’ve sort of lived through this a little bit through a family member who’s been through all of this. And I have to say, extraordinarily terrible measure of a physician, in my view, it allows them to answer, you know, memorization questions, and quite honestly, it’s designed to fool you, it feels like a puzzle that, you know, it’s deliberately obfuscates the content. And, you know, it’s interesting that the whole committee gets involved in single questions. But it just demonstrates that that’s a really poor assessment, because I think you’re right, and tell us why you think this? That’s not the the the bells tolling for the end of medicine as we know it, and physicians practicing medicine?
Justin Norden
Absolutely not. So I think one of the things that, you know, not everyone outside of medicine realizes, even if, you know, going to their own doctor’s appointments, but so much of these exams are okay, what is the diagnosis? What is the next best thing? Much of the time when we’re actually you know, in a patient encounter, you already actually know the diagnoses of the patient, they’re given to you. And it’s very obvious, right? If someone’s coming in for diabetes, what you should be doing that they should be on insulin, Metformin, etc. It’s how you work together to figure out what is the plan that a patient can do? How does this fit in all of the other contexts of their life, other diagnoses, ability to pay social determinants, etc, that really determine how effective the entire treatment plan is going to be for a patient. And so it’s funny, oh, my gosh, you can get the correct diagnosis. Often you start with that, in many, many clinical encounters across all of medicine. And so, you know, I think these tools just highlight, I think, incredible knowledge recall incredible ways to kind of surface potentially, the right information that’s coming up. But medicine is so so much more than that, for what clinicians are doing on the front line. So in no way do I think this is going to be replacing any of our providers. I do think how My work is going to start to change how we train and test our providers. And it is going to start to change what’s happening right now with how providers look for and interact with information.
Nick van Terheyden
Yeah. So you were, I think dynamic in your approach to this, obviously, with that your background history, you pulled all this together and did contextual to and posted about this, I think some of the commentary that ensued was quite informative in the sort of different perspectives that people have of that, tell us a little bit about the sort of competing interest or viewpoints that you saw as part of that.
Justin Norden
Absolutely. It’s, it’s been fun to see kind of something, you know, as it’s called, like, go go viral, in some sense, and, and kind of gather people’s attention. And I think that just speaks to the fact that people are watching, I think this is a unique, it really is a unique time in history where things that previously were impossible, or things that almost looked like magic are now happening with technology today. And so when that happens, I think you get a lot of fun responses. So So for example, a few of them are, you know, you know, Justin, you’re arguing that, you know, this is going to replace clinicians, and you know, well, how did it do on an abdominal exam, or you know, how to do on these things? And you know, any part of a physical exam or surgery? And, you know, of course, of course, it’s not doing that, and, yes, it’s pointed out, but I think it touches an interesting nerve. And I think, especially people who are a little bit more removed from the technology, you know, I sit at Stanford at GSR ventures in the heart of Silicon Valley. And these new things coming out, you know, trained as an AI, computer scientist, most of us in medicine don’t come from that background don’t really know what this technology is capable of. And they’re seeing, you know, different media responses for, oh, doctors are going to replace my jobs are at risk and different things like that. And so I think it touched a little bit on that kind of cultural nerve that started almost a decade ago of some people saying, oh, my gosh, is gonna replace clinicians. And, you know, I really think that’s not the case. But I think there’s some kind of uneasiness on kind of what is really what are these tools going to do? And how is this going to affect me? And my job? So I think, you know, you saw those those comments on one side, I think there are other comments around, you know, hey, like, Is this really the performance? You know, how do you know, it really did that? Well, you know, if you test it multiple times, you’ll get different answers. You know, is this method a good way of doing this? And the simple answer is no, actually, you know, we really need to think about better benchmarks, and better ways to measure and test these systems. You know, this is a, you know, a LinkedIn post, this was a quick analysis, this wasn’t, you know, a rigorous in depth, scientific endeavor, pressure testing every part of the system, right, actually, that, you know, part of that comes from my background, you know, my startup we’re working on, you know, algorithm safety and trust, how do we push algorithms towards failures, find edge cases, and think about that in a quantitative way. This is by no means that, and as a field, as these tools start to get used, you know, I hope this, you know, posts and talking about it starts to have everyone think about how really do we test these things, because patients are going to start to use them on themselves anyway, even though they’re not medical devices, which we can talk about later. But how do we think about testing these things? So those were just a few of the threads. I think that were interesting from the comments that ensued.
Nick van Terheyden
So for those of you just joining, I’m Dr. Nick the incrementalist today I’m talking to Dr. Justin Norton, he is a partner at GSR ventures, we were just reviewing his chat GPT for performance in the step to final medical exam and some of the different perspectives. And, you know, you highlight a couple of things that I think, you know, worth sort of diving into in a little bit more detail, because I think everybody gets the excitement. And you know, the fear, but somewhere in the middle ground of this is, well hold on a second, what is that? You know, and you use the term magic? I think we’ve heard that in the past over, you know, what seems like magic in the past, you know, then turns into reality. And, you know, we’re seeing a little bit of that. But in that middle ground, how do we make an assessment of this technology? I mean, it’s essentially a black box. And one of the things that strikes me and I’m I know I’m gonna go back and do this, I’ll find a test question and pose it, but you can regenerate. And if you regenerate, does it come up with a different answer, because you’ve asked it to and if so, then, whoa. So how do you think about that from a quality regulatory issue? Because that’s a huge issue when you get in medicine and probably one of the reasons we see them not testing in the medical space.
Justin Norden
You bring up such several key points on you know, What does this really mean? How do we test this? What does this look like? So let’s talk about the regeneration and kind of stochastic measure meaning, if you put the same thing into chat GPT, you can get out different answers every time. In some cases, right? If you keep testing it on a test question, you can get a different answer, right? This will change performance. If someone else goes through and did exactly what I did, maybe you’ll get maybe 5%, maybe you’ll get 90%. You know, you can get a different, you will see different answers and even different explanations that come with those answers for kind of why I chose that. Further, there’s different ways of what’s called prompt engineering, or ways that people even ask the chat GBT output to be formulated in different way. There’s things where you can use chain of thought saying, Actually, don’t just answer this question, don’t just paste in the exact question prompt, which is what we did, but explain in detail step by step your reasoning, and then answer the question. And in some cases, we’ve seen on different exams, this actually improves performance of what these models are capable of doing. And so if you change how you prompt, or how you ask, what the machine is going to be doing, you can actually change performance as well. All of this to be said is we really need to think about different ways that we’re going to get into what you mentioned, is the black box of GPT. Four. And when it’s doing, you know, eventually, I think regulators are going to be having conversations, you know, with these large tech companies around, we need to think about different ways for performance, let’s think about extremely clear, you know, how about test sets and other ways, where things you know, we are sure there’s kind of not data contamination or anything, as well. And then we can just start to think about, you know, safety and kind of guardrails for what these systems are not doing. Today, in terms of that regulatory framework, it fits into a bit of a gray area, especially if you look at how the FDA thinks about, you know, software and algorithms. If you’re doing image analysis on medical data and giving a suggestion. My read is that kind of puts you kind of, maybe, Hey, your medical device, and we didn’t test GPT, four on this, as it’s not live yet, on the image analysis to the general public. But the model and the other tests that they talked about open AI, do actually take an images now and give you an answer and give you analysis on that. And so I don’t think we have yet the right regulation in place to really think about how we’re going to be measuring and evaluating the safety of these models.
Nick van Terheyden
So I, there’s so many different areas to explore as part of this and, you know, limited amount of time. I think it’s important, you know, recognize the need for guardrails. I’ve heard that in a number of interviews around this, I think, you know, I’ve also heard the, just get out there and try, which I, I’m a big fan of, I think people need to experience it, you know, in part, because I think we all need to be chat GPT whispers and you described it, it’s sort of, you know, crafting the way that you get the answer out, can change that answer, which I think is important to understand. But let’s talk about the value proposition and how you see that, because ultimately, what we’re trying to do in healthcare, I mean, I think you can I know, I’ve been described, as you know, it’s not a great healthcare system, you know, at an individual level is my experience. But in specific cases, we deliver fantastic health care, we do amazing things. It’s just not evenly distributed, and not equally distributed. How is this going to contribute to change that? And what have you seen in your experiences? To towards that end?
Justin Norden
I think you, you nailed it, Nick. And that’s what gets me so excited about the potential of technologies like this is, can we use systems like this, to really flatten out a baseline level of care that every single person can get all the time, really kind of flattening out that kind of equity and access to reliable information? And I think, eventually, and when that’s the big question of when eventually that’s where systems like this are going to go, is being able to deliver you know, medical information and knowledge. Next Steps potential triage so patients can get into the right person at the right time. Are we there yet? No. And I wish I could say yes, there’s startups already working, you know, doing this to a fantastic degree. And the truth is, we’re not there yet. We’re there. Our patients are interacting with these tools on a daily basis, you know, asking their own questions putting in their own medical history. are getting a wide variety of responses. Some of those are correct, some of those are going to be wildly inaccurate. And so the truth is, we’re not there yet. However, as we start to tackle, let’s call it some of the easier problems in healthcare. I think eventually we’ll work up to that. So, you know, what’s an easier place to start? Let’s talk about summarization of a patient history of patient, a patient case, can we go from a lot of information, you know, maybe sent by fax to something shorter, more digestible? How are we thinking about searching for relevant research or articles or highlighting information, maybe to a clinician or a patient on what things might be going on? How can we think about improving the messaging experience? How can we offload some of that burden from clinicians? As patient messaging volumes and things go up? How can we help with some of these writing and other tasks? How can we help with research? There’s many kinds of earlier steps, I think, before we get to kind of that holy grail that you were talking about.
Nick van Terheyden
So I think, you know, some good examples of potential opportunity. You know, the challenge of how you integrate that is really, you know, one of the areas that I think we see consistently pay, people just don’t go outside of the existing infrastructure. So it has to be incorporated in so that it then becomes part of it. But the one thing that, you know, troubles me a little bit is, it’s just an amplified version of access to information that is unfiltered and uncurated, in some respects, and is providing more fuel to a fire that believes that, you know, has has some misplaced beliefs in terms of the science of medicine? How do we address that? How do you see as addressing that, over the course of, you know, the coming hype, as we’re in, I think,
Justin Norden
yeah, it’s certainly in a hype and figuring out how to address this, you know, again, machine learning models, right, they’re built off of the data that they came from. And, you know, these GPT four, and these large language models and foundation models are just the next generation of that, and realizing that the data that they’re being trained on is not perfect, not representative. And to figure this out, we really need to measure how it’s doing and how it’s doing, you know, across different populations. So that, to me is the is the first step, we can’t just go back and say, Oh, we’re going to generate a bunch of fake data on different patients. I don’t I don’t think that’s the answer. But we should start to, you know, immediately think about well, how is this you know, potentially amplifying bias, amplifying misinformation, different things that that we that we certainly don’t want. So that’s, that’s the first step in my mind. As we go forward, and we get better at building these models, there’s more partnership, we’re able to build it on smaller datasets, we can start to curate and tweak and tweak the performance to ensure that this really is a bit more fair.
Nick van Terheyden
Who do you think, should be the responsible party to drive that? Is that a government regulatory FDA type activity? Can we? Or is that going to stagnate? The innovation? Which, you know, it has a tendency to do it? I? Who, who should be doing that?
Justin Norden
That’s a great question. And, you know, I’ll go with what my bias. Now obviously, I’m on the West Coast, right here, you know, a lot a long ways from from from DC, and I hope it’s many parties coming together. Because what happens if it’s, you know, just industry off and running, then you worry about some of those things you brought up around fairness, equality, access is this, you know, being built in the right way to, you know, not just serve a profit motive, but maybe, you know, our most vulnerable populations. But then also from the other side, you know, sometimes, maybe I’ll pick up on that sometimes DC gets it wrong, when it comes out too early on a regulatory pathway, if it doesn’t have those conversations, and so I really think and even one of the reasons I’m excited to talk to you and all your listeners today, I hope we can come together, kind of across the field, to start talking about these things, because they are going to be important, they’re already starting to affect patient decisions, and where whether or not patients might go in to see their provider. So I really hope everyone can come together to have these conversations.
Nick van Terheyden
So unfortunately, as we do each and every week, we’ve run out of time, so I unfortunately, I’m gonna have to say thank you to you, but before I do, I want to sort of highlight a couple of things. I agree with you 100% I think this is tremendous opportunity. And the importance of everybody coming together. I take the hype as positive because that’s more and more people involved playing hopefully not, not, you know, not directed incorrectly. You know, it reached a record 100 million users vastly faster than the closest competitor in that instance, which was Tic Toc, which you know, is a term we shouldn’t even use would probably be banned for that. But it was two months. And I think what’s really interesting and exciting and I’m going to call it out was the fact that your course on chat GPT sold out in three seconds sold out in in Stamford. So congratulations on that. I’m excited to see what comes out of that. And the fact that there’s all this interest at that level is really positive. Justin, thanks for joining me today.
Justin Norden
Nick, thanks so much for having me on.