Enhancing reliability and efficiency in JSON output generation with LLMs

30 Mar 2025

Large Language Models (LLMs) have emerged as crucial tools in the automation of various processes, owing to their versatility and capacity to produce remarkable results. The outputs generated by LLMs can serve as an input for subsequent tasks. However, the reliability of these models, particularly in terms of the outputs they produce, continues to pose a substantial challenge. When addressing structured text extraction tasks, these models can often produce inaccuracies, such as hallucinated data, malformed data structures, and omitted entries. While certain simpler tasks may be adequately handled by smaller LLMs, the reliability of such solutions is generally lower. This issue has been previously addressed, with tools like the outlines library offering a flexible interface for generating structured outputs. This article will explore two distinct methods inspired by n-gram speculative decoding [1] [3] techniques. These methods aim to mitigate the aforementioned reliability issues while also enhancing the efficiency of the output generation process.

First approach: full input - partial output

Let us first examine a basic prompt engineering setup as an illustrative example. Suppose we have a collection of sentences, and our objective is to identify the city and time mentioned in each one. We can construct the following prompt:

## User
Extract the 'id', 'city', and 'time' from the following text and return them in a JSON format.
The text consists of an 'id' and a message separated by a colon. Do not generate anything other then JSON.

90: At midnight in Paris, the harbor lights reflect beautifully on the calm waters.
215: The bustling streets of Amsterdam are particularly vibrant at 5:00 PM on weekdays when office hours end.

## Assistant
<LLM>

In the prompt, the indicator <LLM> signifies the position where the LLM will insert its response. The model will continue generating the response at this location until it encounters a special token or reaches the maximum number of output tokens. It is important to note that this special indicator is not part of the prompt or the final output; it merely marks the point at which the LLM will be activated.

Similar to n-gram speculative decoding, our goal is to pre-fill as much of the model’s output as possible. This approach not only accelerates the inference process but also allows us to employ greater control over the model’s output. We can enhance our previous method by designating specific locations where the LLM will be activated. For instance, consider the following template:

## User
Extract the 'id', 'city', and 'time' from the following text and return them in a JSON format.
The text consists of an 'id' and a message separated by a colon. Do not generate anything other then JSON.

90: At midnight in Paris, the harbor lights reflect beautifully on the calm waters.
215: The bustling streets of Amsterdam are particularly vibrant at 5:00 PM on weekdays when office hours end.

## Assistant
[
  {
    "id": 90,
    "city": "<LLM>",
    "time": "<LLM>"
  },
  {
    "id": 215,
    "city": "<LLM>",
    "time": "<LLM>"
  }
]

We can now observe that our goal is for the LLM to generate only a specific parts of the output, while the majority of the other fields are pre-filled. It is crucial to note that at the point when the LLM is prompted to generate output, it considers only the preceding tokens. The model will continue generating output until it encounters the " symbol, indicating that the string has been fully generated. At this stage, we can proceed to add new tokens until another <LLM> token is encountered.

The basic implementation of this code can be found here. For instance, executing the code will yield the following output using the Llama-3-8B-Instruct model.

[
  {
    "id": 90,
    "city": "Paris",
    "time": "midnight"
  },
  {
    "id": 215,
    "city": "Amsterdam",
    "time": "5:30 PM"
  }
]

Second approach: no input - full output

It has been noted that LLMs may struggle with handling extensive contexts, and previous approaches may not yield optimal results. For instance, the paper titled “Lost in the Middle: How Language Models Use Long Contexts” [2] observed that models tend to produce higher-quality outputs when information is presented at the beginning or the end of the context. To enhance the performance of LLMs in information extraction tasks, we can refine the structure of our output prompts accordingly.

The key strategy is to avoid including important information in the user prompt. Instead, we should integrate it into the assistant’s output, ensuring that the text is positioned near the location where information needs to be extracted. Building on previous examples, we aim to adhere to the following template:

## User
Create a JSON with keys:
id: Unique identifier
text: Original text content.
city: City name from the text.
time: Time or date from the text.

## Assistant
[
  {
    "id": 90,
    "text": "At midnight in Paris, the harbor lights reflect beautifully on the calm waters",
    "city": "<LLM>",
    "time": "<LLM>"
  },
  {
    "id": 215,
    "text": "The bustling streets of Amsterdam are particularly vibrant at 5:30 PM on weekdays when office hours end.",
    "city": "<LLM>",
    "time": "<LLM>"
  }
]

The setup remains consistent with previous approaches, with the key distinction that we now ensure the text is positioned close to the labels. This adjustment helps LLMs better associate recent context with the output. In practice, I’ve observed that smaller models generate higher-quality and more accurate outputs when presented in this format, particularly when the initial template discussed in the previous section yielded poor results.

The code for both versions can be accessed here.

Code

The most basic version of the first approach: https://gist.github.com/itdxer/fc96b8861422b7d504b2b7d121a440d7
Main code which includes more examples and has a more complicated way of handling predictions: https://gist.github.com/itdxer/942e61cb2eb254c9ef2a472076103793

References

[1] Yaniv Leviathan, Matan Kalman, Yossi Matias. Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192

[2] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang. Lost in the Middle: How Language Models Use Long Contexts. https://arxiv.org/abs/2307.03172

[3] J. Ou, Y. Chen, and W. Tian, Lossless acceleration of large language model via adaptive n-gram parallel decoding. https://arxiv.org/abs/2404.08698.