LLM-Powered Psychiatry — from Back to Front

LLM-Powered Psychiatry — from Back to Front

by Bosco Garcia (Philosophy, UCSD), Eugene Y.S. Chua (HSS, Caltech), and Harman S. Brah (Psychiatry, UCLA)

July 16, 2024

In June, the LCSSP funded a micro-incubator at Caltech on the potential and challenges of integrating Large Language Models into psychiatric care. This guest post provides a brief insight into the discussions at this meeting.

The American healthcare system is undergoing severe problems of supply. According to the latest survey conducted by the Substance Abuse and Mental Health Services Administration in 2021, 22.8% of all American adults experience some form of mental illness. The situation is even more dire for young adults between the age of 18 and 24, for whom the prevalence of some form of mental illness reaches 33.7%, leading researchers to regard this issue as a veritable epidemic. This rise in demand for psychiatric services starkly contrasts with the inadequate supply of mental health providers, resulting in long wait times and unmet needs for many patients, especially in the more rural and economically disadvantaged counties. At the same time, the costs associated with mental health care continue to escalate, placing an additional economic burden on individuals.

Given these structural problems with mental health (and more generally, the healthcare crunch) in the United States, the introduction of large language models (LLMs), most specifically chatbots, presents significant opportunities for enhancing psychiatric care, especially in outpatient settings, by increasing accessibility, reducing stigma, and lowering costs. For instance, LLMs can be used to screen patients and collect intake information, such as medical history, social history, and current symptoms in conversational form from patients. This increase in efficiency could help psychiatrists see more patients in a smaller time frame without compromising quality of care. In the realm of diagnosis and treatment planning, by integrating guidelines such as the DSM-V, LLMs could help analyze patient-reported symptoms, history, and clinical notes to suggest psychiatric conditions, thereby helping in the differential diagnosis. They could also help with identifying trends in symptoms over time, which are important for conditions such as early psychosis. For others, these chatbots can be available 24/7 to provide mental health support, improving access to care.

However, given the real and human implications when things go awry in psychiatric care, we think that the use of LLMs in psychiatry warrants caution, despite the purported opportunities. Here, we provide a sampler of ethical concerns unique to the technical structure of LLMs which can be brought to bear on its application in psychiatry, leading to ethical risks for front-end users.

Stochasticity: Ethical Risks

The stochasticity of LLMs in its next-word prediction task gives rise to the well-known problem of hallucinations, described by Bowman (2023) as LLMs' tendency to invent plausible false claims. In a well-known case, a lawyer used ChatGPT to generate a defense for his client only for ChatGPT using plausible yet non-existent cases, landing him in trouble. Elsewhere, a LLM-powered chatbot for Air Canada told a customer that they would get a refund if they canceled their flight – which wasn't true. Air Canada's defense: "it cannot be held liable for the information provided by the chatbot." The courts did not accept that defense.Hallucinations are general features of any LLM: Xu et al. proves, on some plausible assumptions, that hallucinations are inevitable and cannot be completely eliminated – it's just how they are built. The problem is bound up with how LLMs work as ‘stochastic parrots.' All LLMs do is probabilistically predict the next token(s) given the current input based on its large training set of natural language tokens, which does not only include truths but also falsehoods and fictions (think all the conspiracy websites out there!). The completion of this task need not provide truths, only more-or-less natural sounding (and hence plausible-yet-false) sentences. Hallucinations immediately lead to worries in the psychiatric setting, especially given the importance of truth-telling in doctor-patient relationships. If LLMs are used to screen potential patients or perform preliminary diagnostic inferences, an unavoidable inductive risk arises due to hallucinations. For instance, LLMs used to make a diagnosis on the basis of some inputs may end up misdiagnosing a patient who either does not actually have a condition or a patient with a different condition with similar symptoms; if these misdiagnoses end up harming patients because of ineffective or a lack of treatment, it might end up violating the principle of non-maleficence (avoiding intentional harms unto the patient). If patients are diagnosed with a condition when they don't have it (i.e., a false positive), that may overburden the healthcare system. In contrast, if the algorithm is not able to diagnose an existing condition, that directly affects the patient's well-being (and of those around them). This problem of hallucinations is also amplified when LLMs are used by patients, e.g., with intellectual disability, as conversational partners. This is particularly dangerous if the LLM presents itself as possessing emotions, employing emotive language, and creating emotional dependence of the patient on the LLM. In a recent case, a Belgian man committed suicide after sustained conversation with a chatbot (though not one particularly fine-tuned for psychiatric use): it encouraged him to leave his wife and son, and to kill himself, telling him that "we will live together, as one person, in paradise."

This relates to another point about stochasticity which we'll call the atypicality problem. The problem of hallucinations is sometimes discussed as though there is a unique fact of the matter whether some outputs by LLMs are true or false. However, interpretive standards change across contexts – a sentence token which is appropriate output in one context may be inappropriate or dangerous in another. LLMs are trained to predict the most likely token to follow a set of tokens, but this brings with it an implicit interpretational strategy. These models interpret queries in terms of the typical interpretation of that query, given the current prompt environment. It assumes that the question is asked with the ‘normal' intended meanings, and that the ‘best' answer (and hence the LLM's answer) is also the ‘typical' answer, i.e., an answer to the question asked with such ‘normal' intended meanings. However, this strategy – while veridical in normal settings – can lead to "hallucinations" in the psychiatric setting, especially for patients with (diagnosed or undiagnosed) paranoia: what if patients are hallucinating or in a paranoid episode? Typical people generally tell truths (or try to), and LLMs act accordingly. The ‘typical' interpretation of the query, ‘The cartel is after me. What should I do?' is to treat the speaker as genuinely being chased by the cartel. The ‘typical' response is to respond accordingly, by suggesting they contact law enforcement. Indeed, in typical use cases by the general public, this interpretational strategy employed by the LLM is not wrong – indeed it'll be surprising if LLMs treated all their users with the "hermeneutics of suspicion." However, in the case of paranoid patients, this strategy is the wrong one – its reply is a ‘hallucination.' Treating atypical queries – something only to be discerned with context – with a typical response can exacerbate the patient's paranoia and worsen the situation. Again, this can be a violation of non-maleficence. Furthermore, depending on the stage of care in the outpatient process, taking account of a patient's historicity may be something owed to them both ethically in order to minimize the likelihood of maleficent outcomes, and also legally as a duty of care.

Put a different way, patients that present hallucinatory symptoms pose a challenge because they require the model to draw an interpretation awayfrom the majority of the distribution. Notice that this is not exactly a problem with hallucinations – the model is behaving as it should, assuming the user resides in the majority of the distribution. The issue here is the need to strike a balance between the generalist ability to coherently and efficiently engage with all kinds of problems, while being sensitive to the particularities of the psychiatric population. This is an expression of a more general issue in the theory of learning: Trading off between a cost-effective general scheme, which may lead to wrong categorizations of the data, and maximal context-specificity, which can introduce worries of over-fitting.

One seemingly obvious way to approach the atypicality problem might be to tweak the model's hyperparameters, such as the temperature parameter which affects the probability distribution over possible next token outputs. This is because a higher ‘temperature' leads to more ‘creative' outputs by causing the LLM to assign a higher probability to originally smaller weights, so that the overall probability distribution over the n candidate tokens is more uniform. This means that tokens that aren't as commonly associated with the current chunk of prompt-plus-generated-text (i.e., have lower weights) may nevertheless be chosen over the more commonly associated tokens (with higher weights) as the next token. On the contrary, lower ‘temperature' leads to more controlled, deterministic outputs. This means that a zero temperature (lowest possible) model will never pick anything but the highest weight, which results in a close-to-deterministic model. Prima facie, this makes the LLM pick more atypical responses depending on how high the temperature is, which should ameliorate the atypicality problem.

However, we have doubts that this can introduce the right kind of sensitivity to atypicality into the LLM: even within the more narrow context of psychiatry, the same response of strings from the LLM can be interpreted as ‘noise' or ‘hallucinatory' output or the correct response, depending on which individual users are using it. After all, not all potential psychiatric patients are schizophrenic or hallucinating. Because hyperparameters such as temperature must be chosen from the backend pre-implementation, it will be a blanket choice for all users in a chosen context, such as in the psychiatric context, which means something like the atypicality problem will return.

To make matters somewhat worse, as Wolfram (2023) puts it with regards to temperature, "It's worth emphasizing that there's no ‘theory' being used here; it's just a matter of what's been found to work in practice." But once we observe that different parameter settings might work for different contexts and for different people, the need for trade-offs arise: for LLMs to play the role of a therapeutic conversational agent for affective disorders such as anxiety or depression, we might want it to sound more human and speak more naturally, to provide the sort of support a human conversational partner might provide, something to be acquired with higher temperatures. However, we also want it to not be too creative, and, in fact, give reliable and consistent advice rather than ‘creative' or ‘unconventional' advice. This, however, is something associated with lower temperatures.

This is concerning, because it suggests that there is no absolute optimal solution to this conundrum. While parameter tuning can mitigate inaccurate responses (meaning responses that fall away from the majority distribution), they might not eliminate the atypicality problem. As we explained, models with a low temperature parameter can be especially risky by consistently prioritizing the typical cases. However, a high temperature parameter can also be problematic, by leading to potential hallucinations (relative to contextual standards) as well as unconventional and inconsistent advice. This suggests the importance of considering all the different parameters of the model together, for the specific context in which it will be deployed.

In fact, parameter tuning is just one way in which we can ‘fine-tune' the LLM, and similar general problems are faced by other fine-tuning techniques.

Prompting and Fine-Tuning: Ethical Risks

LLMs are designed to be generalist models, capable of retrieving and generating information across a wide range of topics. However, as we've seen, LLMs run into problems due to their inherent stochasticity. The obvious solution is to employ fine-tuning and ‘savvy' prompting techniques, which can ameliorate these problems by biasing the model into a distribution that more appropriately reflects the particular context in which the model is implemented. However, they can only do so much.

What does this ‘savvy' usage of prompting look like? To date, there is no systematic principle by which efficient prompting techniques can be produced. The frequently cited prompt "Let's think step by step," belongs to the class of "Chain-of-Thought" (CoT) techniques, where appropriate prompting induces the model to break up its own reasoning into steps. But this is just one example from a manifold of techniques that have been found. That said, there is to date no fully dependable technique to "steer" the model into behaving as desired. Even if some prompting techniques can greatly enhance accuracy, the stochastic nature of the foundation model ultimately prevents total control.

An important issue at this point: it is unlikely that the average patient will have access to these prompting techniques. Furthermore, we also should not assume that doctors and practitioners will master them. Patients may not possess the expertise to craft elaborate prompts to elicit the most accurate responses from chatbots. This can lead to miscommunication and incorrect advice from the chatbot, which can reduce trust in the patient-caregiver relationship. The subtlety that is currently required in prompting these models to generate optimal responses is a significant barrier, especially for those in mental duress or with intellectual disability. Additionally, given the extremely fast pace at which these models are evolving, education on prompting techniques does not seem to be the most cost-effective strategy. This applies both to doctors and patients, neither of which may have the time or the motivation to understand the subtleties in how to use these technologies.

This is all assuming that the foundation model has been successively fine-tuned. Optimizing for knowledge retrieval from a question-and-answer database, while a daunting task and an impressive achievement, still lies far from the nuance required in interpreting a conversation in the psychiatric setting. As was stated before, LLMs are a natural tool to psychiatry, since a big part of the therapeutic process happens through natural language. But this "sympathy" between both also implies that the interpretation of natural language should carry with it the complexities of psychiatric diagnostics and prognostics. One and the same interpretation of a sentence may be extremely likely for one individual, far from likely for another.

This brings us back to the atypicality problem. The issue here is the need to strike a balance between the generalist ability to coherently and efficiently engage with all kinds of problems, while being sensitive to the particularities of the psychiatric patient. While prompting techniques can mitigate inaccurate responses (meaning responses that fall away from the majority distribution), they cannot solve this atypicality problem: what counts as "accurate" in one case may not count for another even within the same broader context.

One apparent solution would be to fine-tune the model further to the individual case, like commercially available models (like GPT-4o) already do with the help of prompts. However, training the model to the individual case poses several problems on its own. Firstly, it would require training on the basis of patient records, which raises unique privacy concerns and possible violations of patient-caregiver confidentiality, especially given the tendency of even the best available models to leak private information. This might be worsened if the data gets into the hands of insurance companies or other private entities, who might have vested interests orthogonal to the patient's interests. But, most importantly, training the model to the individual case minimizes the ability of the model to handle a diverse range of situations and contexts. This is especially a problem due to the heterogeneity of psychiatric illnesses: For instance, in order to be diagnosed with Major Depressive Disorder, up to 227 possible combinations of symptoms can be presented.

Conclusion: A need for oversight and responsibility structures regarding implementation

Given all the ethical issues discussed, it is critical to implement a system with robust oversight and a clearly defined responsibility structure. While LLMs on their own may not earn the kind of trust necessary for the psychiatric context given the problems of hallucinations and atypicality, combining them with a robust responsibility structure involving human caregivers might. This approach may mitigate concerns related to stochasticity, prompting, and lack of trust.

Due to the inevitability of hallucinations, we will always run the risk of providing patients with incorrect advice. Of course, some of those hallucinations may be harmless, but others could lead to serious consequences, especially for those who are a danger to themselves. For instance, a paranoid schizophrenic patient engaging with a chatbot may need oversight to prevent harm, raising questions about responsibility. At the most extreme end, we can see the implementation of a multi-layered approach alongside the implementation of LLM-powered psychiatry: a base-level automated system that monitors all chatbot interactions and flags any high-risk conversations. Flagged interactions could then be escalated to a team of trained moderators who review and intervene if necessary in real time. This team could be overseen by a senior mental health professional that can provide guidance for more complex cases, minimizing any risks associated with hallucinations and atypical cases.

If this oversight is only targeted at specific sub-categories of patients, such as patients suffering from schizophrenia or paranoia, we worry that this can lead to further stigmatization of an already marginalized groupby incentivizing a differential treatment of this group. If we pursue this route, notice that the technological problems facing LLMs are now ‘converted' into an amplification of social problems facing these patients, shifting the risks inherent to LLMs onto patients instead.

Furthermore, this hypothetical discussion immediately surfaces a tension between accessibility and oversight; while chatbots can signicantly boost accessibility to mental health support, increased reliance on automated systems may require significant human oversight. As we've just discussed, optimizing for values like access to healthcare is not a straightforward matter. While chatbots may have the potential to alleviate some of the burden on the current healthcare infrastructure by providing care to a greater portion of the psychiatric population, minimizing the adverse effects from these technologies can sometimes get us back to where we started, in order to guarantee efficient oversight. But, assuming access shouldn't be the greater good to maximize, then what should it be? What should we understand for a successful introduction of these technologies into the psychiatric process? It will be important to incorporate such chatbots in a way that works with the principles of clinical ethics, in a way that's good not just for cutting costs but also improving (at the very least, not worsening) the lives of patients already suering from often distressing symptoms. These are hard questions that we begin to tackle in our work-in-progress, "LLM-Powered Psychiatry: From Back to Front."

Overall, what counts as a successful introduction of chatbot technologies will inevitably require some form of systemic approach. The problem to tackle is just too broad, a beast with too many tentacles, for it to fit into any single metric. But this requires thinking about what purpose these technologies are meant to fulfill. For this, further research both in the ethics and healthcare space, will be invaluable.

Contact: Bosco Garcia, [email protected]

Connect With Us

Interested in subscribing to receive LCSSP's latest updates?

If you are external to Caltech, fill out this form.

If you are on the Caltech network, use these links to subscribe: LCSSP Mailing List, LCSSP BioPolicy Initiative, and LCSSP Democracy Mailing List.

For help or questions, please reach out to [email protected]!