neuralgap.io
In the previous article we looked at the numerous challenges involved in navigating excel for LLMs and in this article we are going to try to understand how LLMs can understand tabular data. LLMs have been primarily trained on unstructured data, so having to parse structured data format – i.e., the first layer of reading an Excel sheet – would require rethinking the paradigm. Think of structured data as a highly organized, predictable format where information is arranged in a tabular manner with clearly defined rows and columns. Each cell has a specific meaning based on its position within this grid. In contrast, unstructured data, such as the text in this article, follows a more fluid, freeform structure. While there are grammatical rules and conventions, the information is not neatly compartmentalized into predefined slots. For an LLM, transitioning from the world of unstructured text to the rigidly structured realm of spreadsheets is akin to learning a new language with a completely different syntax and grammar. It requires a fundamental shift in how the model processes and interprets information.
For this article, we are going to draw heavily from the comprehensive work “Table Meets LLM: Can Large Language Models Understand Structured Table Data?” by Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang, which was published in the Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24).
One of the key findings of the study is the significant impact that input design choices have on an LLM’s ability to understand and process structured tabular data. The authors explore a wide range of input design options, each with its own unique characteristics and potential benefits. These choices include using natural language with separators, various markup languages (such as HTML, XML, and JSON), format explanations, role prompting, partition marks, and even the order in which the content is presented. The study reveals that the most effective input design choice is using HTML markup language along with format explanations and role prompts, while keeping the content order unchanged. This particular combination achieves the highest accuracy, 65.43%, across the seven tasks in the Structural Understanding Capabilities (SUC) benchmark. Moreover, including prompt examples – i.e. also known as ‘few shot learning’ significantly increases LLM’s performance suggesting a strong dependence on learning from examples within the context.
Furthermore, the study also shows other prompt engineering aspects. A few observations include:
©2023. Neuralgap.io