Brian Christian on the Challenges of Integrating Human Values into AI Systems

Brian Christian on the Challenges of Integrating Human Values into AI Systems

Brian Christian is the author of “The Most Human Human,” which was named a Wall Street Journal bestseller, a New York Times Editors’ Choice, and a The New Yorker favorite book of the year. He is the author, with Tom Griffiths, of “Algorithms to Live By,” a #1 Audible bestseller, Amazon best science book of the year, and MIT Technology Review best book of the year.

His third book, “The Alignment Problem,” has just been published. Brian is a visiting scholar at the CITRIS Policy Lab and an affiliate of the Center for Human-Compatible Artificial Intelligence and was recently the scientific communicator in residence at the Simons Institute at UC Berkeley.

Brian Christian recently presented insights from his new book, “The Alignment Problem” at the CITRIS Research Exchange.

In your book “The Alignment Problem,” you explore the complexity of integrating societal norms and values into AI systems. What do you think are the most effective strategies to ensure AI systems perform in ways that maximize societal benefit?

There are many ways to approach this problem, and in some ways this is the question that the whole book seeks to answer. It’s going to require, I think, not just a single approach but a lot of different things, some of which are breakthroughs in the actual science, some of which will take the form of good engineering practices, and some of which will be more about governance and who gets a seat at the table.

But to offer you a couple concrete ideas, I think a good starting point for many systems that affect the public is transparency. To the extent possible, datasets used to train models should be made public, and in particular, I think that people generally have a right to understand the particular data that a model has about them. There have been cases where criminal defendants suspected that they were being wrongfully detained because of a simple typo or data-entry issue, but it required a great deal of legal wrangling to find out what the input data looked like, let alone how it might have affected the judicial outcome.

Deep neural networks have a reputation for being “black boxes,” but this reputation is going away. People like OpenAI’s Chris Olah and Google Brain’s Been Kim have done marvelous work “popping the hood,” so to speak, and visualizing the inner workings of a model. People like Microsoft Research’s Rich Caruana are developing model architectures (such as generalized additive models) that are competitive in performance with deep neural networks but much more easily inspected. And researchers like Duke University’s Cynthia Rudin are showing that it’s possible to use algorithmic techniques to identify provably optimal simple models, that contain just a handful of parameters and can be easily computed by hand on a piece of paper, but are competitive with some of the most complex neural networks.

The computer science progress on one hand, and the development of a more robust legal framework on the other, while not a full solution by any measure, will nonetheless go a long way to making sure we understand the models that affect our lives.

To what degree is the alignment problem — the difficulty of ensuring that AI systems actually behave as we intend, and in ways that are congruent with our norms and values — a result from developers’ inexperience in social science training? How should this be addressed in teaching and practice?

I think that to some degree it’s true that computer-science and machine-learning curricula have tended to frame problems in such a way that both the training data and the objective are taken as givens, and the hard part is simply finding a model that maximizes that objective. In reality, data provenance is a hugely consequential issue, as is finding an objective function that captures what it is that we really want our model to do.

For instance, in the criminal-justice setting, a model might be developed to predict “risk scores” for a defendant in the context of a pretrial detention decision; one score might estimate their risk of committing a nonviolent offense while awaiting trial, and another might estimate their risk of failing to appear at court for their trial. It’s significant that one of these variables is perfectly observed — if you fail to appear before the court, by definition the court knows about it — while the other is very imperfectly observed: the vast majority of nonviolent crimes never become known to the police at all.

Increasingly, computer science departments are building out curricula for ethics and real-world-impacts into their undergraduate majors, for instance, and we are also seeing AI textbooks like “Artificial Intelligence: A Modern Approach” begin to shift their focus from “How can we optimize for objective x?” to “How can we determine what we ought to be optimizing for in this situation?”

However, just as computer scientists, software engineers, and ML practitioners are increasingly thinking about the broader social context of their work, so too are policymakers, lawyers, and social scientists finding themselves needing to sharpen their technical fluency, as ML systems are increasingly becoming part of their work. “The Alignment Problem” is something that I hope can help on both counts, by offering something to each of those groups: both the technical folks looking out beyond the narrower framing of their field, and those outside the field looking in. I think we build this bridge from both sides.

Most biases in AI models surface when deployed in real-life. How do we identify and address these biases during the model development stage?

First, better understanding where the training data comes from is a starting point. Something along the lines of what’s suggested in “Datasheets for Datasets,” for instance, would be helpful.

Second, transparency methods can give us a sense of whether the model is generalizing as we expect. Visualization techniques like “inceptionism” can reveal visually what superstimuli for various category labels might look like. For instance, here is what a Google model from 2015 generated for the category “dumbbell”:

Image Source: Google

As the researchers note: “There are dumbbells in there alright, but it seems no picture of a dumbbell is complete without a muscular weightlifter there to lift them. In this case, the network failed to completely distill the essence of a dumbbell. Maybe it’s never been shown a dumbbell without an arm holding it. Visualization can help us correct these kinds of training mishaps.”

There are clear uses for techniques like this in identifying bias of all kinds; for instance, if the network was asked to generate novel images of faces, and all of the faces were of a single gender or skin tone, then that would suggest that a similar bias existed in the model’s training data.

Transparency techniques like TCAV can suggest how high-level concepts inform a network’s categorizations. A group from Google in 2018 looked at several widely-used models of the time and showed, for instance, that the color red was extremely significant in the model’s ability to identify something as a fire truck. This would suggest that such a model may be unsafe to deploy in a country where fire trucks are not reliably red: for instance, Australia, where they are often white and neon yellow.

A third key component is developing models that “know when they don’t know.” This is sometimes referred to as the problem of “robustness to distributional shift.” The basic idea in a bias context is to make a kind of last-line-of-defense failsafe, such that even if there is a problem in the training data, which transparency methods fail to identify, then the model itself would be able to identify when it’s operating in a situation that doesn’t match what it’s seen before, and would either refuse to take an action, or would defer to human experts, etc. There is a great deal of work in this area, including work by people like Oregon’s Tom Dietterich and others on the “open-category problem” — how models trained to categorize images into one of n categories can account for the fact that the vast majority of possible inputs (combinations of pixels, for instance) will belong to none of those categories. There is also work by people like Oxford’s Yarin Gal and Cambridge’s Zoubin Ghahramani on using techniques like dropout to get an estimate of model uncertainty. This has been taken up in medical diagnostics and in robotics to generate models that can gauge when they are operating outside of their training distribution and defer to humans accordingly. Techniques like this might have helped to prevent the Google Photos “gorillas” incident, where a model’s uncertainty about what the picture contained might have led it to avoid applying a caption altogether rather than taking a guess. They are also helpful in autonomous vehicles, and may have helped prevent deaths like that of Elaine Herzberg, who was killed by a self-driving Uber in 2018 while walking a bicycle across the street, after the car failed to determine whether she was a “pedestrian” or a “cyclist.”


The CITRIS Policy Lab, headquartered at CITRIS and the Banatao Institute at UC Berkeley, supports interdisciplinary research, education, and thought leadership to address core questions regarding the role of formal and informal regulation in promoting innovation and amplifying its positive effects on society.