AI Safety: A Partial Critique of Russell’s “Provably Beneficial AI”

With all of the recent advancements in AI, especially the November release of ChatGPT and last week’s upgrade of the underlying GPT engine to version 4, there is a lot of ongoing discussion about AI Safety.

AI Safety is a term that has come to mean a lot of things, and as such, it requires some further elaboration so we know what we are actually talking about. According to Wikipedia (Mar 21, 2023): “AI safety is an interdisciplinary field concerned with preventing accidents, misuse, or other harmful consequences that could result from artificial intelligence (AI) systems. It encompasses machine ethics and AI alignment, which aim to make AI systems moral and beneficial, and AI safety encompasses technical problems including monitoring systems for risks and making them highly reliable. “

For this post, I’ll be having a conversation with ChatGPT 4 to provide an introduction to one concept that lives within AI Alignment. The concept is called “Provably Beneficial AI” and has been proposed, refined, and advocated primarily by Stuart Russell of the University of California at Berkeley, where he is the Director of the Center for Human-Compatible AI, a professor of Computer Science, and the Smith-Zadeh Chair in Engineering. He literally wrote the book on AI.

I did the interview twice, once with the GPT 3.5 engine and once with the GPT 4 engine. GPT 4 generally provided more detail in its responses, with a strong tendency to use numbered lists. GPT 3.5 was more conversational, but provided less detail. I used the same prompts for both, and since I did GPT 3.5 first, the GPT 4 prompts might feel a little clunky, but I wanted to be able to compare the two and get a feel for the differences. In the interview below, I am S: and ChatGPT-4 is GPT.

Enjoy!


S: Briefly explain Russell’s concept of “provably beneficial AI”

GPT: Russell’s concept of “provably beneficial AI” refers to the development of artificial intelligence systems that can be mathematically proven to act in ways that are beneficial to humans. This concept aims to ensure that AI systems consistently work towards the best interests of humanity, avoiding unintended negative consequences.

In order to achieve this, the AI must be designed with certain principles in mind, such as uncertainty about its objectives and deference to human preferences. By constructing AI systems that acknowledge their own uncertainty and learn from human input, we can work towards creating AI that remains aligned with human values and provides benefits without causing harm.


S: What is the strongest critique of Russell’s “provably beneficial AI?”

GPT: One of the strongest critiques of Russell’s “provably beneficial AI” concept is the challenge of defining and quantifying human values and preferences. Human values are complex, diverse, and often contradictory, making it difficult to create a universally accepted model that can guide AI systems in a provably beneficial way. Some specific concerns include:

  1. Value alignment: Aligning AI systems with human values is challenging because values can vary significantly between individuals, cultures, and over time. It is unclear how an AI system can be designed to account for these variations and adapt to changing preferences without risking unintended consequences.
  2. Incompleteness and ambiguity: Human preferences are often incomplete or ambiguous, which makes it hard for AI systems to make decisions that are guaranteed to be beneficial. A provably beneficial AI needs a way to handle such situations while remaining aligned with human values.
  3. Preference manipulation: A learning AI system designed to be deferential to human preferences may inadvertently encourage humans to express preferences that are not in their long-term best interests, leading to potential manipulation of the AI system for short-term gains.
  4. Moral dilemmas and ethical trade-offs: Real-world situations often involve moral dilemmas and ethical trade-offs that can’t be easily resolved. Designing a provably beneficial AI that can navigate these complex scenarios in a way that respects diverse perspectives and ethical considerations is a significant challenge.
  5. Technical difficulties: The development of provably beneficial AI requires significant advancements in AI safety research, including robust methods for ensuring AI systems learn and follow human values, even as they become more capable. This is a difficult technical problem that has not yet been fully solved.

S: Didn’t Russell specify a big data approach to defining human values and objectives by essentially looking at what decisions humans made historically and inferring the underlying values and objectives?

GPT: Yes, Stuart Russell did suggest an approach to defining human values and objectives by examining the decisions humans have made historically, with the idea that AI systems can learn these values and preferences through a big data analysis. This approach is part of the broader concept of “inverse reinforcement learning” (IRL), where AI systems learn to infer the underlying objectives or values that drive human decision-making by observing human behavior.

However, despite the promise of this big data approach, it is not without challenges and limitations. Some concerns include:

  1. Biases and inaccuracies: Historical data can be biased or incomplete, reflecting cultural, social, or political norms of a particular time period. These biases may not accurately represent the values and objectives of all humans, potentially leading to AI systems that reinforce existing inequalities or injustices.
  2. Changing values: Human values and preferences evolve over time, and what might have been considered acceptable in the past might not be considered acceptable today. AI systems need to be adaptable and able to recognize these shifts to remain beneficial.
  3. Overgeneralization: Inferring human values and objectives from historical data can lead to overgeneralization, where AI systems may not be able to distinguish between genuinely shared values and preferences that vary between individuals, cultures, or contexts.
  4. Incompleteness and ambiguity: As mentioned earlier, human preferences can be incomplete or ambiguous, making it difficult to infer clear, universally applicable objectives from historical data.
  5. Ethical dilemmas and trade-offs: Learning human values from historical data does not necessarily provide AI systems with a clear framework for resolving complex ethical dilemmas or making trade-offs between competing values.

Overall, while Russell’s big data approach to defining human values and objectives offers a potential path forward, it is not without significant challenges that need to be addressed in order to develop truly provably beneficial AI systems.


S: Isn’t there a more fundamental problem with this big data approach? Specifically, that there are many cases where humans were forces to choose a course of action from a set of possibilities, all of which were bad options and not aligned with our values. For example, it is hard to argue that Truman’s decision to use nuclear weapons against Japan was his “preference,” but neither was a mass invasion of the Japanese home islands. In fact, all really hard ethical decisions are between bad alternatives, so using a big data approach to categorize the decision that was made as a “preference” seems to be very problematic, if we want AI to avoid making the same types of mistakes as humans.

GPT: You raise a valid concern about the limitations of the big data approach to defining human values and objectives. Indeed, there are situations where humans face difficult ethical decisions and must choose between suboptimal options, none of which align perfectly with their values. In these cases, inferring preferences from historical data may not accurately represent human values and could lead to AI systems making similar mistakes or perpetuating undesirable outcomes.

This fundamental problem highlights the importance of considering additional factors when designing provably beneficial AI:

  1. Context sensitivity: AI systems should be able to understand the broader context of human decision-making and recognize that certain decisions may have been made under duress or constrained by limited options, rather than as a true reflection of human preferences.
  2. Ethical reasoning: AI systems should be equipped with robust ethical reasoning capabilities that allow them to navigate complex moral dilemmas, even when historical data may not provide clear guidance on the best course of action.
  3. Human guidance and collaboration: Developing AI systems that work closely with humans and actively seek their input in situations with ambiguous or conflicting preferences can help ensure that the AI’s actions remain aligned with human values.
  4. Ongoing learning and adaptation: AI systems should be able to learn from new information, societal shifts, and evolving human values, allowing them to update their understanding of human preferences and make better decisions over time.

In summary, while the big data approach to inferring human values and objectives has its limitations, it can still be a useful starting point for understanding human preferences. However, to develop truly provably beneficial AI, it is crucial to combine this approach with other techniques that address the inherent complexity and nuance of human decision-making and values.


S: Human guidance and collaboration seems to imply that moral values are relative to who we ask. Is Russell’s approach inherently one of moral relativism?

GPT: Stuart Russell’s approach to AI value alignment does not explicitly endorse moral relativism, but it does recognize the diversity and complexity of human values, preferences, and cultural norms. By emphasizing the importance of human guidance and collaboration, Russell’s approach aims to create AI systems that respect and adapt to this diversity, rather than adhering to a single, predefined set of moral absolutes.

In practice, this means that AI systems designed using Russell’s approach may exhibit a degree of moral relativism, as they attempt to align with the values of the humans they interact with or the societies they operate within. However, the goal is not to embrace moral relativism for its own sake, but rather to ensure that AI systems are beneficial and aligned with human values, which can be diverse and even contradictory.

It’s important to note that balancing between moral relativism and moral absolutism is a challenge when designing AI systems. To create provably beneficial AI, researchers and developers must grapple with the complexities of human values and ethics, and strike a balance that respects diverse perspectives while still upholding some level of universally accepted moral principles.

S: How might we decide what are universally accepted moral principles? Who should help decide that? It seems like the greatest philosophical minds of all time have been working on this problem since before Plato, and we haven’t reached any agreement yet.

GPT: You are correct that determining universally accepted moral principles is a complex and longstanding challenge, one that philosophers have been grappling with for centuries. While there may never be a complete consensus on universally accepted moral principles, there are some approaches to consider when determining which principles to incorporate into AI systems:

  1. Cross-cultural consensus: Analyzing moral principles and values that are widely shared across different cultures and societies can help identify common ethical themes that may serve as a basis for universally accepted principles.
  2. Interdisciplinary collaboration: Involving experts from various fields, including philosophy, ethics, sociology, psychology, and computer science, can help develop a more comprehensive understanding of moral principles and their applicability to AI systems.
  3. Public involvement and deliberation: Engaging a diverse range of stakeholders, including the general public, in discussions and deliberations about moral principles and AI ethics can help ensure that multiple perspectives are considered and that the resulting principles are more widely accepted.
  4. Iterative refinement: Recognizing that universally accepted moral principles may evolve over time, it is important to establish ongoing processes for revising and updating these principles as societal values and norms change.
  5. International frameworks and guidelines: Building on existing international ethical frameworks, guidelines, and conventions related to human rights, environmental sustainability, and other global concerns can provide a foundation for developing universally accepted moral principles for AI systems.

Ultimately, determining universally accepted moral principles for AI is a collaborative and ongoing process that requires input from a diverse range of stakeholders, from philosophers and ethicists to policymakers and the general public. While achieving perfect consensus may be impossible, striving for a shared understanding of core moral principles can help guide the development of AI systems that are more broadly beneficial and aligned with human values.


Overall, not a bad summary of the goals and some of the potential pitfalls of Russell’s approach.