LLMs in Investment Research (II) - Navigating Structured Data

neuralgap.io

LLMs in Investment Research (II) - Navigating Structured Data

In the previous article we looked at the numerous challenges involved in navigating excel for LLMs and in this article we are going to try to understand how LLMs can understand tabular data. LLMs have been primarily trained on unstructured data, so having to parse structured data format – i.e., the first layer of reading an Excel sheet – would require rethinking the paradigm. Think of structured data as a highly organized, predictable format where information is arranged in a tabular manner with clearly defined rows and columns. Each cell has a specific meaning based on its position within this grid. In contrast, unstructured data, such as the text in this article, follows a more fluid, freeform structure. While there are grammatical rules and conventions, the information is not neatly compartmentalized into predefined slots. For an LLM, transitioning from the world of unstructured text to the rigidly structured realm of spreadsheets is akin to learning a new language with a completely different syntax and grammar. It requires a fundamental shift in how the model processes and interprets information.

For this article, we are going to draw heavily from the comprehensive work “Table Meets LLM: Can Large Language Models Understand Structured Table Data?” by Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang, which was published in the Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24).

Unraveling the Structural Understanding Capabilities of LLMs

The Structural Understanding Capabilities (SUC) benchmark, introduced by Yuan Sui et al., serves as a powerful tool to evaluate the ability of Large Language Models (LLMs) to comprehend and process structured tabular data. This benchmark is designed to assess various aspects of an LLM’s understanding of table structures through a series of seven carefully crafted tasks. These tasks range from simple challenges like table partition and size detection to more complex problems such as cell lookup and row/column retrieval. By subjecting LLMs like GPT-3.5 and GPT-4 to this benchmark, researchers can gain valuable insights into the current state of these models in handling structured data.

The SUC benchmark not only provides a standardized way to measure an LLM’s performance but also helps identify areas where these models excel and where they struggle. For instance, the authors found that even for seemingly trivial tasks like table size detection, LLMs are not perfect, highlighting the need for further improvements in their structural understanding capabilities. Moreover, by breaking down the problem of understanding tabular data into distinct tasks, the benchmark allows for a more nuanced analysis of an LLM’s strengths and weaknesses. This granular approach can guide researchers in developing targeted strategies to enhance the models’ performance on specific aspects of structured data comprehension.

Role of Input Design Choices in Enhancing Table Comprehension

One of the key findings of the study is the significant impact that input design choices have on an LLM’s ability to understand and process structured tabular data. The authors explore a wide range of input design options, each with its own unique characteristics and potential benefits. These choices include using natural language with separators, various markup languages (such as HTML, XML, and JSON), format explanations, role prompting, partition marks, and even the order in which the content is presented. The study reveals that the most effective input design choice is using HTML markup language along with format explanations and role prompts, while keeping the content order unchanged. This particular combination achieves the highest accuracy, 65.43%, across the seven tasks in the Structural Understanding Capabilities (SUC) benchmark. Moreover, including prompt examples – i.e. also known as ‘few shot learning’ significantly increases LLM’s performance suggesting a strong dependence on learning from examples within the context.

Furthermore, the study also shows other prompt engineering aspects. A few observations include:

Including external information ahead of the tables can lead to better generalization and context understanding.
Having partition marks and format explanations may hinder an LLM’s search and retrieval capabilities while still improving its overall performance on downstream tasks.

Empowering LLMs with Self-Augmented Prompting

The core idea behind self-augmented prompting is to motivate LLMs to generate intermediate structural knowledge by internally retrieving and utilizing the information they have already acquired during training. The process of self-augmented prompting involves a two-step approach. First, the LLM is prompted to generate additional knowledge about the table, focusing on identifying critical values, ranges, and other relevant structural information. This step essentially unlocks its reasoning abilities and enables it to extract meaningful insights from the tabular data. In the second step, the generated intermediate knowledge is incorporated into the prompt, guiding the LLM to produce a more accurate and contextually relevant final answer.

One of the key advantages of self-augmented prompting is its versatility. It can be easily integrated with various input design choices, such as markup languages, format explanations, and role prompting, to further optimize the performance. Essentially, motivating LLMs to retrieve and utilize their own knowledge, this technique effectively bridges the gap between the models’ natural language understanding capabilities and their ability to comprehend and reason over tabular information.

When combined with carefully selected input designs, self-augmented prompting has demonstrated significant improvements across a wide range of tabular tasks. For instance, it has led to notable gains in accuracy on benchmark datasets like TabFact, HybridQA, and SQA, which involve question answering and fact verification based on structured data. Similarly, self-augmented prompting has proven effective in enhancing the performance of LLMs on tasks like Feverous and ToTTo, which require the generation of natural language descriptions from tabular information.