Microsoft Bing Chatbot passes a medical graduation exam and helps to find questions with flaws.

Stefan Morreel, Veronique Verhoeven

Keywords: exam, artificial intelligence, ChatGPT, Bing


Chatbots using large language models have raised much public and scientific attention. ChatGPT can pass exams in various fields with 40 to 70% correct answers. ChatGPT passed an undergraduate primary care exam but was outperformed by 98% of the students. Often, ChatGPT answers with hallucinations (a confident response not justified at all by the current state of the art). More recent bots have not been extensively evaluated.

Research question(s):

Can the new Microsoft Bing Chatbot pass the multiple-choice medical license exam at the university of Antwerp? What is the proportion of hallucinations? Can incorrect AI answers be used to detect questions with flaws (question is unclear or answer is disputable)?


The exam was translated using Deepl followed by human adaptation. Questions containing images/tables and questions concerning frameworks/models that are only used locally were excluded. The remaining 95 multiple choice questions were copied to Bing, each one to a new chat in the precise mode. In case of wrong answers, the authors screened the answer for hallucinations and the question for flaws.


Bing passed the exam with a score of 72/95 or 76% (cum laude). A wrong answer was given for 13 questions, no answer for four questions, an unclear answer for five questions and in one case, two answers were given. Among the 22 incorrect answers, two hallucinations were found. Three questions were unclear, and two answers were disputable.


The new Microsoft Bing chatbot passed the university of Antwerp medical graduation exam. Medical teachers can use AI bots to detect those questions that need careful review. More research is necessary in the field of general practice teaching.
Note: Because AI is evolving at an exceptional pace, recent results from multiple bots will be presented at EURACT.

Points for discussion:

Should we use AI to detect questions that need a review?

Can we use AI to make better exams?

How should we use large language model in medical teaching?


EURACT Twitter Feed
EURACT Facebook Feed