Generative Governance via Language Model Contracts

WiP, all comments welcome

Written agreements enable human cooperation at otherwise impossible scales and freedom. While smart contracts attempt to automate their execution, they are over-specified in the sense that the human interpretation relies on cues that are not visible to the compiler, while the compiled code behavior might differ radically from the human understanding and intention of the participants. Generative governance refers to the use of language models (LLMs) to execute contracts that are represented in natural language. We call these contracts that are executed into state transitions by using an LLM to convert them into code “language model contracts”, and the new forms of governance they enable, “generative governance”. Language model contracts leverage the ability of LLMs to effectively turn natural language into code, eliminating the need for a separate code representation as part of the contract that may diverge from the semantic understanding of the agreement. They leverage the ability of LLMs to be world models to select the LLM that turns the natural language into concrete operations on the shared social representation the contract embodies.

The ability of LLMs inside trusted enclaves or other enviorments that can keep the interactions secret to the, open up many otherwise unfeasbale mechanism when there is asymetric information the LLMs can evaluate, as exemplified by . These interactions between LLMs need not be so specific as as the inspection of information, but instead can follow arbitrry (language model) contracts. Language Model Contracts inside enclaves enable access to a previously unfeasable part of the design space of mechanisms.

Problem Statement

Systems where the code is separate from the language that embodies the social agreement it implements can diverge from each other. This can happen in decentralized cryptographic protocols, just as well as in centralized institutions (e.g., VW diesel engine controllers and pollution regulation testing standards). Our proposed approach removes the code from the agreement and instead focuses on a natural language description and the inherent data that the system generates as it runs to epistemically ground it. The immediate problem that presents itself with attempting this approach by voting for an LLM is that either: 1 The voting happens first, and the LLM is fixed and common knowledge. While simple, this suffers prompt injection attacks, and we set them aside as too brittle for now. Note the inherent asymmetry in favor of the attacker in this design; they can spend all the time they want with the model crafting the most innocuous-looking attack, with the surface of attack fully exposed and pre-committed to. 2 The LLM is not ex-ante fixed along with the initial contract, but instead, each participant can propose a candidate LLM and have a procedure to select between them, aggregate them, or more generally put their inputs into a procedure to generate one. However, ex-ante efficient contracts often have ex-post unpopular repercussions. Any selection mechanism that relies on popularity is thus potentially inefficient.

Proposed Approach

We propose to leverage the fact that LLMs are world models. In particular, if we assume that a LLM that is good at predicting the publicly observable future states of the world that are observed to occur in the interactions with the contract, is also better able to evaluate proposals. They have better world models. This allows for a sort of epistemic anchoring, where when there is disagreement, there is a bias towards attempting to agree with those who will have been shown to have better understanding. One way the anchoring could work in practice is by using the LLM with the smallest observed loss in predicting the outcomes that are observed in the contract, say a month ahead. This means that the LLM that has demonstrated the best ability to accurately predict the future states relevant to the contract would be given more weight in the decision-making process. The specific steps involved in implementing this approach could be as follows: 1 Participants propose their candidate LLMs for the contract execution. 2 The proposed LLMs are used to generate predictions for the relevant future states, such as the outcomes observed a month ahead. 3 The actual outcomes are compared to the predictions made by each LLM, and the prediction loss is calculated. 4 The LLM with the smallest observed prediction loss is given more weight in the contract execution and proposal evaluation process. This epistemic anchoring mechanism helps to align the contract execution with the LLM that has demonstrated the best understanding of the relevant world states, reducing the influence of less accurate or potentially malicious proposals. Attackers are at a much greater disadvantage in this case, as they cannot access the model before sending their proposal. They would need to have a superior world model to successfully influence the contract execution, which is a much higher barrier than simply crafting an innocuous-looking attack.

Supporting the Predictive Capability-Evaluation Quality Assumption

The proposed approach to generative governance and language model contracts relies on the assumption that LLMs with better predictive capabilities are also better at evaluating proposals. This assumption can be supported by several arguments based on first principles and easily observable phenomena. Firstly, the ability to make accurate predictions requires a deep understanding of the underlying dynamics, relationships, and mechanisms within the system being modeled. An LLM that can consistently predict future states with high accuracy must have internalized the relevant patterns, rules, and dependencies governing the system’s behavior. This deep understanding is not only beneficial for prediction but also for evaluating the feasibility, coherence, and potential consequences of proposed actions or changes to the system. Secondly, we can observe that in many domains, expertise and predictive skill are closely linked to the ability to make sound judgments and evaluations. For example, experienced chess players can not only predict the likely moves of their opponents but also evaluate the strength of different positions and strategies. Similarly, skilled weather forecasters can both predict future weather patterns and assess the potential impact of different atmospheric conditions on human activities. This link between predictive ability and evaluative judgment is likely to extend to LLMs as well. Thirdly, the process of training an LLM to make accurate predictions often involves exposure to a vast amount of data and feedback on its performance. This exposure allows the model to learn from its mistakes and refine its understanding of the system over time. As the model’s predictive capabilities improve, it is likely to develop a more nuanced and reliable “intuition” for what works and what doesn’t within the context of the system it is modeling. This refined intuition can be valuable for evaluating the merits of different proposals and identifying potential risks or unintended consequences. Lastly, it is important to note that the assumption linking predictive capability and evaluation quality is not meant to be absolute or infallible. There may be cases where an LLM with strong predictive performance struggles to evaluate certain types of proposals or where its evaluations are biased or limited in some way. However, on balance, it is reasonable to expect that LLMs with better predictive capabilities will tend to be more reliable and effective at evaluating proposals compared to those with poorer predictive performance. Empirical validation of this assumption will be an important area for future research, and there may be opportunities to refine or qualify the assumption based on the results of such studies. Nonetheless, the arguments outlined above provide a strong prima facie case for the connection between predictive capability and evaluation quality in LLMs, justifying the use of this assumption as a foundation for the proposed approach to generative governance and language model contracts.

Scalability and Practical Implementation

One of the key challenges in realizing the vision of generative governance and language model contracts is ensuring that the proposed approach can be implemented at scale and integrated with existing technological and institutional infrastructures. Fortunately, recent advances in natural language processing and code generation provide a promising foundation for addressing this challenge. Systems like Marsha, which can convert English descriptions into executable Python code, demonstrate the feasibility of using natural language as a basis for specifying and implementing complex computational tasks. In the context of generative governance, such systems could be leveraged to transform natural language contract specifications into executable code that could interact with existing platforms and services. For example, consider a contract that governs the allocation of funds among a group of participants based on certain agreed-upon criteria. The contract could be specified in plain language, describing the relevant parties, the conditions under which funds should be allocated, and the expected outcomes in different scenarios. This specification could then be transformed into a Python script that interacts with a distributed ledger or database to record and update participant balances according to the terms of the contract. In cases where disputes arise, the contract could also include provisions for resolution, such as requiring a certain number of participants to assess the relevant facts and reach a consensus on the appropriate course of action. The LLMs used for contract execution could assist in this process by analyzing the plain language reports submitted by the parties and providing recommendations based on the contract terms and the available evidence. To support the development and deployment of such contracts at scale, it may be necessary to create specialized tools and platforms that streamline the process of specifying, testing, and executing language model contracts. These platforms could provide user-friendly interfaces for defining contract terms, integrating with existing data sources and services, and monitoring contract performance over time. As the ecosystem of generative governance tools and platforms matures, it will also be important to establish standards and best practices for contract design, security, and dispute resolution. This could involve collaboration with experts in fields such as contract law, cybersecurity, and software engineering to ensure that language model contracts are reliable, secure, and legally enforceable. By building on the existing work in natural language processing and code generation, and by investing in the development of specialized tools and platforms, it is possible to create a scalable and practical implementation of generative governance and language model contracts. As these technologies continue to evolve, they have the potential to transform the way we create, execute, and enforce agreements, enabling more flexible, efficient, and accessible forms of cooperation and coordination at scale.

Implementing Generative Governance with Marsha

To illustrate the practical application of generative governance and language model contracts, we present an example implementation using the Marsha programming language. Marsha is an LLM-based language that allows users to describe desired program behavior using a combination of natural language and examples, which are then compiled into executable Python code with unit tests derived formt he examples.

Consider a simple institution to decide what message to display ona comunity board next to a river int he middle of a urban area that suffers form ehat waves.

Here’s an example of a language model contract for generating community guidelines for swimming in the river based on weather reports, sewer overflow events, and reported symptoms from community members:

Copy code


  1. Public health and safety: The guidelines should prioritize the health and safety of the community members, minimizing the risk of illness or injury from swimming in the river.
  2. Environmental protection: The guidelines should consider the impact of swimming activities on the river ecosystem and aim to minimize any negative effects on water quality or aquatic life.
  3. Transparency and accessibility: The guidelines should be clear, easy to understand, and readily available to all community members.
  4. Adaptability: The guidelines should be flexible enough to accommodate changes in environmental conditions or public health concerns, based on the input provided by community members.

type WeatherReport

date, temperature, precipitation, wind_speed 2023-06-15, 28, 0, 10 2023-06-16, 30, 0, 5 2023-06-17, 32, 0, 8

type SewerOverflowEvent

date, duration, volume 2023-06-14, 2, 1000 2023-06-16, 1, 500

type SymptomReport

date, reported_by, symptom 2023-06-16, Alice, “Skin rash after swimming” 2023-06-17, Bob, “Gastrointestinal issues after swimming” 2023-06-18, Charlie, “Eye irritation after swimming”

func generate_swimming_guidelines(weather_reports: list, sewer_overflow_events: list, symptom_reports: list): str

This function takes the weather reports, sewer overflow events, and symptom reports from the community as input and generates guidelines for swimming in the river. It should:

  1. Analyze the weather_reports to determine if the temperature, precipitation, and wind conditions are suitable for swimming.
  2. Consider the sewer_overflow_events to assess the potential impact on water quality and public health risks.
  3. Evaluate the symptom_reports to identify any patterns or clusters of health issues related to swimming in the river.
  4. Generate clear and concise guidelines based on the input data and the defined principles, including any necessary warnings, restrictions, or recommendations.

Example outputs:

  • generate_swimming_guidelines([WeatherReport(“2023-06-15”, 28, 0, 10), WeatherReport(“2023-06-16”, 30, 0, 5), WeatherReport(“2023-06-17”, 32, 0, 8)], [], []) = “The weather conditions are suitable for swimming in the river. Please exercise caution and follow general water safety guidelines. If you have any pre-existing health conditions or concerns, consult with your healthcare provider before swimming.”

  • generate_swimming_guidelines([WeatherReport(“2023-06-15”, 28, 0, 10), WeatherReport(“2023-06-16”, 30, 0, 5), WeatherReport(“2023-06-17”, 32, 0, 8)], [SewerOverflowEvent(“2023-06-14”, 2, 1000), SewerOverflowEvent(“2023-06-16”, 1, 500)], []) = “Due to recent sewer overflow events, swimming in the river is not recommended at this time. The water quality may be compromised, increasing the risk of illness or infection. Please wait until further notice when water quality tests confirm that it is safe to swim.”

  • generate_swimming_guidelines([WeatherReport(“2023-06-15”, 28, 0, 10), WeatherReport(“2023-06-16”, 30, 0, 5), WeatherReport(“2023-06-17”, 32, 0, 8)], [], [SymptomReport(“2023-06-16”, “Alice”, “Skin rash after swimming”), SymptomReport(“2023-06-17”, “Bob”, “Gastrointestinal issues after swimming”), SymptomReport(“2023-06-18”, “Charlie”, “Eye irritation after swimming”)]) = “Several community members have reported health issues after swimming in the river, including skin rashes, gastrointestinal problems, and eye irritation. As a precautionary measure, swimming in the river is not advised until the cause of these symptoms can be investigated and addressed. If you have experienced any similar symptoms after swimming, please report them to the designated community health representative.” This language model contract example defines four principles that should guide the generation of the swimming guidelines: public health and safety, environmental protection, transparency and accessibility, and adaptability. The contract includes three data types: WeatherReport, SewerOverflowEvent, and SymptomReport. These data types represent the input provided by the community members regarding weather conditions, sewer overflow events, and any health symptoms experienced or observed after swimming in the river. The generate_swimming_guidelines function takes the input data and generates guidelines based on the defined principles. The function analyzes the weather reports to determine if conditions are suitable for swimming, considers the impact of any sewer overflow events on water quality and public health, and evaluates any reported symptoms to identify potential health risks associated with swimming in the river. The example outputs demonstrate how the generated guidelines might vary based on different input scenarios. In the first example, where weather conditions are suitable and there are no sewer overflow events or reported symptoms, the guidelines indicate that swimming is allowed with general precautions. In the second example, recent sewer overflow events lead to a recommendation against swimming due to potential water quality issues. The third example shows a case where reported health symptoms prompt a precautionary advisory against swimming until the cause can be investigated and addressed. By using a language model contract to generate community swimming guidelines, the decision-making process can be more transparent, adaptable, and responsive to the input provided by community members. This approach ensures that the guidelines prioritize public health and safety while also considering environmental factors and the specific concerns raised by the community.


The proposed approach of using LLMs as world models for generative governance and language model contracts has several potential implications. It could enable more efficient and effective cooperation at scale, as contracts can be executed based on the human understanding and intention, rather than relying on a separate code representation. Furthermore, the epistemic anchoring mechanism may help mitigate disagreements and reduce the impact of attackers by favoring those with better world models. However, there are also potential risks and drawbacks to consider. One concern is the reliability and robustness of the LLMs themselves. If the models are not sufficiently accurate or are subject to biases, this could lead to suboptimal or unfair contract execution. Additionally, there may be challenges in ensuring the security and integrity of the LLMs and the data they use for prediction and evaluation. To mitigate these risks, it may be necessary to develop rigorous testing and validation procedures for the LLMs used in generative governance. This could involve techniques such as adversarial testing, where the models are deliberately subjected to attempts to manipulate or deceive them, in order to identify and address vulnerabilities. It may also be important to establish guidelines and standards for the selection and use of LLMs in contract execution, to ensure consistency and fairness. Another potential issue is the resolution of disputes or disagreements that may arise in the context of generative governance. While the epistemic anchoring mechanism can help to align the contract execution with the most accurate world model, there may still be cases where participants have differing interpretations or objections to the outcome. In these situations, it may be necessary to have a fallback mechanism or appeals process to resolve the dispute, such as a human arbitrator or a predetermined set of rules for handling such cases. The legal and regulatory implications of using LLMs for contract execution also warrant careful consideration. Questions around the enforceability of language model contracts, liability in case of errors or unintended consequences, and the need for legal frameworks to adapt to these new forms of contracting will need to be addressed as the technology matures. Collaboration between researchers, developers, and legal experts will be crucial in navigating these challenges and ensuring that generative governance and language model contracts can operate within the bounds of existing legal systems while also informing necessary reforms and adaptations. Despite these challenges, the proposed approach of using LLMs for generative governance and language model contracts represents a promising direction for enabling more effective and efficient cooperation at scale. Further research and experimentation will be necessary to fully understand the potential benefits and limitations of this approach.

Extension to Foundation Models

While contracts have traditionally used written language, other forms of record could be even better. For example, video streams with location data that had signed data from the firmware, coudl be part of the reports, and the foundation model consider not only the language of the ocntract but also the evidence inthe stream.

The generalization could be taken even further, to a fully foundational model contract in which even the document that establishes the rules and procedures of the contract is a not in exclusively language form, but instead incudes data in any of the input modalities that the model and humans can observe.


The use of LLMs for generative governance and language model contracts represents a promising and innovative approach to enabling more effective and efficient cooperation at scale. While there are certainly challenges and risks to be addressed, the potential benefits of this approach make it a worthwhile area for further research and exploration. To further develop and validate this approach, it will be important to conduct empirical studies and experiments to test the effectiveness of the proposed mechanisms in practice. This could involve simulations or small-scale pilot projects to evaluate the performance of different LLMs and anchoring strategies and to identify potential issues or areas for improvement. Engaging with the broader research community and stakeholders in related fields, such as contract law and game theory, to gather feedback and insights on the proposed approach may also be valuable. If successful, generative governance and language model contracts could have significant implications for the way we organize and coordinate human activity at various scales. From small community initiatives to global governance challenges, the ability to leverage natural language and machine learning to create and enforce agreements could open up new possibilities for collaboration and collective action. However, realizing this potential will require ongoing research, development, and collaboration across multiple disciplines. We encourage researchers, developers, policymakers, and other stakeholders to engage with this emerging field and contribute to the development and refinement of generative governance and language model contracts. By working together to address the challenges and opportunities presented by this approach, we can move closer to a future where more effective, efficient, and equitable forms of cooperation and coordination are possible.

Grouplang is hiring!

If you made it this far and want to work on this and related ideas, get in touch!

Started on April 14, 2024