Difference between revisions of "Meeting Schedule"

From CompSemWiki
Jump to navigationJump to search
 
(45 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''Location:''' Hybrid - Buchanan 430, and the zoom link below
+
'''Location:''' Hybrid - Muenzinger D430, and the zoom link below
  
'''Time:''' Wednesdays at 10:30am, Mountain Time
+
'''Time:''' Wednesdays at 11:30am, Mountain Time
  
 
'''Zoom link:''' https://cuboulder.zoom.us/j/97014876908
 
'''Zoom link:''' https://cuboulder.zoom.us/j/97014876908
Line 13: Line 13:
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 01/24/2024 || '''Planning, introductions, welcome!'''
+
| 08/28/2024 || '''Planning, introductions, welcome!'''
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 01/31/2024 || Brunch Social  
+
| 09/04/2024 || Brunch Social  
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 02/07/2024 || '''No Meeting''' - Virtual PhD Open House
+
| 09/11/2024 || Watch and discuss NLP keynote
 +
 
 +
'''Winner:''' Barbara Plank’s “Are LLMs Narrowing our Horizon? Let’s Embrace Variation in NLP!”
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 02/14/2024 || ACL paper clinic
+
| 09/18/2024 || CLASIC presentations
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 02/21/2024 || Cancelled in favor of LING Circle talk by Professor Gibbs
+
| 09/25/2024 || Invited talks/discussions from Leeds and Anschutz folks: Liu Liu, Abe Handler, Yanjun Gao
  
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 02/28/2024 || Short talks by Kathy McKeown and Robin Burke
+
| 10/02/2024 || Martha Palmer, Annie Zaenen, Susan Brown, Alexis Cooper.
  
Kathy's web page: ''' https://www.cs.columbia.edu/~kathy/
+
'''Title:''' Testing GPT4's interpretation of the Caused-Motion Construction
  
Title: Addressing Large Language Models that Lie: Case Studies in Summarization
+
'''Abstract:''' The fields of Artificial Intelligence and Natural Language Processing have been revolutionized by the advent  of  Large Language Models such  as  GPT4.  They  are  perceived  as  being  language  experts and there is a lot of speculation about how intelligent they are, with claims being made about “Sparks of  General  Artificial  Intelligence.”  This  talk  will  describe  in detail  an  English  linguistic  construction, the Caused Motion Construction, and compare prior interpretation approaches with current LLM interpretations.  The  prior  approaches  are  based  on  VerbNet. It’s unique  contributions  to  prior  approaches  will  be  outlined.  Then  the  results  of  a  recent  preliminary study  probing  GPT4’s  analysis  of  the  same  constructions  will  be  presented.  Not  surprisingly,  this analysis  illustrates  both  strengths  and  weaknesses  of  GPT4’s  ability  to  interpret  Caused  Motion Constructions and to generalize this interpretation.
  
Kathleen McKeown
+
Recording: https://o365coloradoedu-my.sharepoint.com/:v:/r/personal/mpalmer_colorado_edu/Documents/BoulderNLP-Palmer-Oct2-2024.mp4?csf=1&web=1&nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=aCHeN8
Columbia University
 
 
The advent of large language models promises a new level of performance in generation of text of all kinds, enabling generation of text that is far more fluent, coherent and relevant than was previously possible. However, they also introduce a major new problem: they wholly hallucinate facts out of thin air. When summarizing an input document, they may incorrectly intermingle facts from the input, they may introduce facts that were not mentioned at all, and worse yet, they may even make up things that are not true in the real world. In this talk, I will discuss our work in characterizing the kinds of errors that can occur and methods that we have developed to help mitigate hallucination in language modeling approaches to text summarization for a variety of genres.
 
 
Kathleen R. McKeown is the Henry and Gertrude Rothschild Professor of Computer Science at Columbia University and the Founding Director of the Data Science Institute, serving as Director from 2012 to 2017. In earlier years, she served as Department Chair (1998-2003) and as Vice Dean for Research for the School of Engineering and Applied Science (2010-2012). A leading scholar and researcher in the field of natural language processing, McKeown focuses her research on the use of data for societal problems; her interests include text summarization, question answering, natural language generation, social media analysis and multilingual applications. She has received numerous honors and awards, including 2023 IEEE Innovation in Societal Infrastructure Award, American Philosophical Society Elected member, American Academy of Arts and Science elected member, American Association of Artificial Intelligence Fellow, a Founding Fellow of the Association for Computational Linguistics and an Association for Computing Machinery Fellow. Early on she received the National Science Foundation Presidential Young Investigator Award, and a National Science Foundation Faculty Award for Women. In 2010, she won both the Columbia Great Teacher Award—an honor bestowed by the students—and the Anita Borg Woman of Vision Award for Innovation.
 
 
 
  
Title: Multistakeholder fairness in recommender systems
 
  
Robin Burke
+
|- style="border-top: 2px solid DarkGray;"
University of Colorado Boulder
+
| 10/09/2024 || NAACL Paper Clinic: Come get feedback on your submission drafts!
 
Abstract: Research in machine learning fairness makes two key simplifying assumptions that have proven challenging to move beyond. One assumption is that we can productively concentrate on a uni-dimensional version of the problem: achieving fairness for a single protected group defined by a single sensitive feature. The second assumption is that technical solutions need not engage with the essentially political nature of claims surrounding fairness. I argue that relaxing these assumptions is necessary for machine learning fairness to achieve practical utility. While some recent research in rich subgroup fairness has considered ways to relax the first assumption, these approaches require that fairness be defined in the same way for all groups, which amounts to a hardening of the second assumption. In this talk, I argue for a formulation of machine learning fairness based on social choice and exemplify the approach in the area of recommender systems. Social choice is inherently multi-agent, escaping the single group assumption and, in its classic formulation, places no constraints on agents' preferences. In addition, social choice was developed to formalize political decision-making mechanisms, such as elections, and therefore offers some hope of directly addressing the inherent politics of fairness. Social choice has complexities of its own, however, and the talk will outline a research agenda aimed at understanding the challenges and opportunities afforded by this approach to machine learning fairness.
 
 
Bio: Information Science Department Chair and Professor Robin Burke conducts research in personalized recommender systems, a field he helped found and develop. His most recent projects explore fairness, accountability and transparency in recommendation through the integration of objectives from diverse stakeholders. Professor Burke is the author of more than 150 peer-reviewed articles in various areas of artificial intelligence including recommender systems, machine learning and information retrieval. His work has received support from the National Science Foundation, the National Endowment for the Humanities, the Fulbright Commission and the MacArthur Foundation, among others.
 
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 03/06/2024 || '''Jon Cai''', CU Boulder Computer Science, PhD proposal defense
+
| 10/16/2024 || Senior Thesis Proposals:
 +
 
  
'''Title:'''  
+
'''Alexandra Barry'''
Learning Fast and Slow with Semantics
 
  
'''Abstract:'''
+
'''Title''': Benchmarking LLM Handling of Cross-Dialectal Spanish
Abstract Meaning Representation(AMR) is a linguistic formalism that capture and encode semantics of natural language. It is one of the most widely accepted implementation over the truth value based theory of meanings. The impact of AMR has broadened since its introduction from its original design objective to help machine translation to more NLP tasks such as information extraction, summarizations and multi-modality semantic alignments etc. Meanwhile, AMR serves as a theoretical tool for computational semantics researches to advance semantic theories.  Being able to model holistic semantics thus become one of the ultimate goal for NLP and computational linguistics community. Despite the amazing advancement of LLMs in recent years, we still see gaps between shallow and deep semantic understanding of machine learning models. In this proposal, we go through the generalization issues that AMR parsing models renders and our proposed solutions over how could we design new methodologies and analytical tools to help us navigate the labyrinth of modeling semantics via AMR.  
+
 
 +
'''Abstract''': This proposal introduces current issues and gaps in cross-dialectal NLP in Spanish as well as the lack of resources available for Latin American dialects. The presentation will cover past work in dialect detection, translation, and benchmarking in order to build a foundation for a proposal that aims to create a benchmark that analyses LLM robustness across a series of tasks in different Spanish dialects
 +
 
 +
 
 +
 
 +
'''Tavin Turner'''
 +
 
 +
'''Title''': Agreeing to Disagree: Statutory Relational Stance Modeling
 +
 
 +
'''Abstract''': Policy division deeply affects which bills get passed in legislature, and how. So far, statutory NLP has predicted voting breakdowns, interpreted stakeholder benefit, informed legal decision support systems, and much more. In practice, legislation demands compromise and concession to pass important policy, yet models often struggle to reason over the whole act. Leveraging neuro-symbolic models, we seek to intermediate this challenge with relational structures of statutes’ sectional stances – modeling stance agreement, exception, etc. Beyond supporting downstream statutory analysis tasks, these structures could help stakeholders understand how a bill impacts them, litmus the cooperation within a legislature, and reveal patterns of compromise that aid a bill through ratification.
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 03/13/2024 || Veronica Qing Lyu,
+
| 10/23/2024 || '''Ananya Ganesh''''s PhD Dissertation Proposal
  
'''Title:'''Faithful Chain of Thought Reasoning.  (''' https://aclanthology.org/2023.ijcnlp-main.20/ }
+
'''Title''': Reliable Language Technology for Classroom Dialog Understanding
  
'''Abstract:'''
+
'''Abstract''': In this proposal, I will lay out how NLP models can be developed to address realistic use cases in analyzing classroom dialogue. Towards this goal, I will first introduce a new task and corresponding dataset, focused on detecting off-task utterances in small-group discussions. I will
While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy.
+
then propose a method to solve this task that considers how the inherent structure in the dialog can be used to learn richer representations of the dialog context. Next, I will introduce preliminary work on applying LLMs in the in-context learning setting for a broad range of tasks pertaining to qualitative coding of classroom dialog, and discuss potential follow-up work. Finally, keeping in mind our goals of serving many independent stakeholders, I will propose a study to incorporate differing stake-holder’s subjective judgments while curating gold-standard data for classroom discourse analysis.
  
'''Bio:'''
 
Veronica Qing Lyu is a fifth-year PhD student in Computer and Information Science at the University of Pennsylvania, advised by Chris Callison-Burch and Marianna Apidianaki. Her current research interests lie in the intersection of linguistics and natural language processing, explainable AI, and reasoning. Her paper "Faithful Chain-of-Thought Reasoning" received the Area Chair Award at IJCNLP-AACL 2023 (Interpretability and Analysis of Models for NLP track). She will co-organize a tutorial on “Explanations in the Era of Large Language Models” in NAACL 2024. Before Penn, she studied linguistics as an undergraduate student at the Department of Foreign Languages and Literatures at Tsinghua University.
 
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 03/20/2024 || Cory Paik's Area Exam
+
| 10/30/2024 || '''Marie McGregor''''s area exam
 +
 
 +
'''Title''': Adapting AMR Metrics to UMR Graphs
 +
 +
'''Abstract''': Uniform Meaning Representation (UMR) expands on the capabilities of Abstract Meaning Representation (AMR) by supporting document-level annotation, suitability for low-resource languages, and support for logical inference. As a framework for any sort of representation is developed, a way to measure the similarities or differences between two representations must be developed in tandem to support the creation of parsers and for computing inner-annotator agreement (IAA). Fortunately, there exists robust research into metrics to assess the similarity of AMR graphs. The usefulness of these metrics to UMRs depends on four key aspects: scalability, correctness, interpretability, and cross-lingual suitability. This paper investigates the applicability of AMR metrics to UMR graphs along these aspects in order to create useful and reliable UMR metrics.
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 03/27/2024 || '''No Meeting''' - Spring Break
+
| 11/06/2024 || Short presentations / discussions: Curry Guinn, Yifu Wu, Kevin Stowe
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 04/03/2024 || CLASIC Industry Day
+
| 11/13/2024 || Invited talk by '''Nick Dronen''' and '''Seminar Lunch'''
 +
 
 +
'''Title''': SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models
 +
 +
'''Abstract''': Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs’ algorithmic abilities under simple lexical or semantic variations. To this end, we present the SETLEXSEM CHALLENGE, a synthetic benchmark that evaluates the performance of LLMs on set operations. SETLEXSEM assesses the robustness of LLMs’ instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SETLEXSEM, we find that they exhibit poor robustness to variation in both operation and operands. We show – via the framework’s systematic sampling of set members along lexical and semantic dimensions – that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of "deceptive" sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently.
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 04/10/2024 || Rehan's Dissertation Defense 
+
| 11/20/2024 || '''Abteen’s proposal'''
 +
 
 +
'''When''': Wed. Nov 20, 11:30 am
 +
 
 +
'''Where''': MUEN D430 and zoom https://cuboulder.zoom.us/j/97014876908
 +
 
 +
'''Title''': Extending Benchmarks and Multilingual Models to Truly Low-Resource Languages
 +
 +
'''Abstract''': Driven by successes in large-scale data collection and training efforts, the field of natural language processing (NLP) has seen a dramatic surge in model performance. However, the vast majority of the roughly 7,000 languages spoken across the globe do not have the necessary amounts of easily available text resources and have not been able to share in these advancements. In this proposal, we focus on how best to improve pretrained model performance for these languages, which we refer to as truly low-resource. First, we discuss model adaptation techniques which leverage unlabeled data and discuss experiments which evaluate these approaches in a realistic setting. Next, we address a limitation of prior work, and describe two data collection efforts for low-resource languages. We further present a synthetic evaluation resource which tests a model's understanding of specific linguistic phenomenon: lexical gaps. Finally, we propose additional analysis experiments we aim to address disagreements across prior work, and extend these experiments to include low-resource languages.
 +
 
 +
 
 +
 
 +
'''Alex’s area exam''':
 +
 
 +
'''When''': Wed. Nov 20, 1:30 pm
 +
 
 +
'''Where''': MUEN E214 and zoom https://cuboulder.zoom.us/j/97014876908
 +
 
 +
'''Title''': Computational Media Framing Analysis through Rhetorical Devices and Linguistic Features
 +
 
 +
'''Abstract''': Over the past decade, there has been an increased focus on media framing in the Natural Language Processing (NLP) community. Framing has been defined as “select[ing] some aspects of a perceived reality and mak[ing] them more salient in a communicating text, in such a way as to promote a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described” (Entman, 1993). This computational work generally seeks to quantify framing on a large scale to raise awareness about media bias. A prevalent paradigm for computational framing analysis focuses on studying high-level topical information. Though highly generalizable, this approach addresses only emphasis framing: when a writer or speaker highlights particular aspect of a topic more frequently than others. However, prior framing work is broad, encompassing many other facets and types of framing present in the media. In recognition of this, there has been a recent line of work seeking to subvert the earlier focus on topical information. In this survey, we present an analysis of work which is both in line with goals of expanding the breadth of computational framing analysis and is generalizable. We focus on work which analyzes the role of rhetorical devices and linguistic features to reveal insights about media framing.
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 04/17/2024 || Maggie's Proposal
+
| 11/27/2024 || '''No meeting:''' Fall break
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 04/24/2024 || Téa's Senior Thesis Defense
+
| 12/04/2024 || Enora's prelim
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 05/01/2024 || Sagi's Proposal
+
| 12/11/2024 ||  
  
 
|- style="border-top: 2px solid DarkGray;"
 
|- style="border-top: 2px solid DarkGray;"
| 05/08/2024 || Mary's Prelim
+
| 1/23/25|| Chenhao Tan CS Colloquium, 3:30pm
 
 
  
  
Line 100: Line 125:
  
 
=Past Schedules=
 
=Past Schedules=
 +
* [[Spring 2024 Schedule]]
 
* [[Fall 2023 Schedule]]
 
* [[Fall 2023 Schedule]]
 
* [[Spring 2023 Schedule]]
 
* [[Spring 2023 Schedule]]

Latest revision as of 15:45, 18 November 2024

Location: Hybrid - Muenzinger D430, and the zoom link below

Time: Wednesdays at 11:30am, Mountain Time

Zoom link: https://cuboulder.zoom.us/j/97014876908

Date Title
08/28/2024 Planning, introductions, welcome!
09/04/2024 Brunch Social
09/11/2024 Watch and discuss NLP keynote

Winner: Barbara Plank’s “Are LLMs Narrowing our Horizon? Let’s Embrace Variation in NLP!”

09/18/2024 CLASIC presentations
09/25/2024 Invited talks/discussions from Leeds and Anschutz folks: Liu Liu, Abe Handler, Yanjun Gao


10/02/2024 Martha Palmer, Annie Zaenen, Susan Brown, Alexis Cooper.

Title: Testing GPT4's interpretation of the Caused-Motion Construction

Abstract: The fields of Artificial Intelligence and Natural Language Processing have been revolutionized by the advent of Large Language Models such as GPT4. They are perceived as being language experts and there is a lot of speculation about how intelligent they are, with claims being made about “Sparks of General Artificial Intelligence.” This talk will describe in detail an English linguistic construction, the Caused Motion Construction, and compare prior interpretation approaches with current LLM interpretations. The prior approaches are based on VerbNet. It’s unique contributions to prior approaches will be outlined. Then the results of a recent preliminary study probing GPT4’s analysis of the same constructions will be presented. Not surprisingly, this analysis illustrates both strengths and weaknesses of GPT4’s ability to interpret Caused Motion Constructions and to generalize this interpretation.

Recording: https://o365coloradoedu-my.sharepoint.com/:v:/r/personal/mpalmer_colorado_edu/Documents/BoulderNLP-Palmer-Oct2-2024.mp4?csf=1&web=1&nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=aCHeN8


10/09/2024 NAACL Paper Clinic: Come get feedback on your submission drafts!
10/16/2024 Senior Thesis Proposals:


Alexandra Barry

Title: Benchmarking LLM Handling of Cross-Dialectal Spanish

Abstract: This proposal introduces current issues and gaps in cross-dialectal NLP in Spanish as well as the lack of resources available for Latin American dialects. The presentation will cover past work in dialect detection, translation, and benchmarking in order to build a foundation for a proposal that aims to create a benchmark that analyses LLM robustness across a series of tasks in different Spanish dialects


Tavin Turner

Title: Agreeing to Disagree: Statutory Relational Stance Modeling

Abstract: Policy division deeply affects which bills get passed in legislature, and how. So far, statutory NLP has predicted voting breakdowns, interpreted stakeholder benefit, informed legal decision support systems, and much more. In practice, legislation demands compromise and concession to pass important policy, yet models often struggle to reason over the whole act. Leveraging neuro-symbolic models, we seek to intermediate this challenge with relational structures of statutes’ sectional stances – modeling stance agreement, exception, etc. Beyond supporting downstream statutory analysis tasks, these structures could help stakeholders understand how a bill impacts them, litmus the cooperation within a legislature, and reveal patterns of compromise that aid a bill through ratification.

10/23/2024 Ananya Ganesh's PhD Dissertation Proposal

Title: Reliable Language Technology for Classroom Dialog Understanding

Abstract: In this proposal, I will lay out how NLP models can be developed to address realistic use cases in analyzing classroom dialogue. Towards this goal, I will first introduce a new task and corresponding dataset, focused on detecting off-task utterances in small-group discussions. I will then propose a method to solve this task that considers how the inherent structure in the dialog can be used to learn richer representations of the dialog context. Next, I will introduce preliminary work on applying LLMs in the in-context learning setting for a broad range of tasks pertaining to qualitative coding of classroom dialog, and discuss potential follow-up work. Finally, keeping in mind our goals of serving many independent stakeholders, I will propose a study to incorporate differing stake-holder’s subjective judgments while curating gold-standard data for classroom discourse analysis.

10/30/2024 Marie McGregor's area exam

Title: Adapting AMR Metrics to UMR Graphs

Abstract: Uniform Meaning Representation (UMR) expands on the capabilities of Abstract Meaning Representation (AMR) by supporting document-level annotation, suitability for low-resource languages, and support for logical inference. As a framework for any sort of representation is developed, a way to measure the similarities or differences between two representations must be developed in tandem to support the creation of parsers and for computing inner-annotator agreement (IAA). Fortunately, there exists robust research into metrics to assess the similarity of AMR graphs. The usefulness of these metrics to UMRs depends on four key aspects: scalability, correctness, interpretability, and cross-lingual suitability. This paper investigates the applicability of AMR metrics to UMR graphs along these aspects in order to create useful and reliable UMR metrics.

11/06/2024 Short presentations / discussions: Curry Guinn, Yifu Wu, Kevin Stowe
11/13/2024 Invited talk by Nick Dronen and Seminar Lunch

Title: SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models

Abstract: Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs’ algorithmic abilities under simple lexical or semantic variations. To this end, we present the SETLEXSEM CHALLENGE, a synthetic benchmark that evaluates the performance of LLMs on set operations. SETLEXSEM assesses the robustness of LLMs’ instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SETLEXSEM, we find that they exhibit poor robustness to variation in both operation and operands. We show – via the framework’s systematic sampling of set members along lexical and semantic dimensions – that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of "deceptive" sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently.

11/20/2024 Abteen’s proposal

When: Wed. Nov 20, 11:30 am

Where: MUEN D430 and zoom https://cuboulder.zoom.us/j/97014876908

Title: Extending Benchmarks and Multilingual Models to Truly Low-Resource Languages

Abstract: Driven by successes in large-scale data collection and training efforts, the field of natural language processing (NLP) has seen a dramatic surge in model performance. However, the vast majority of the roughly 7,000 languages spoken across the globe do not have the necessary amounts of easily available text resources and have not been able to share in these advancements. In this proposal, we focus on how best to improve pretrained model performance for these languages, which we refer to as truly low-resource. First, we discuss model adaptation techniques which leverage unlabeled data and discuss experiments which evaluate these approaches in a realistic setting. Next, we address a limitation of prior work, and describe two data collection efforts for low-resource languages. We further present a synthetic evaluation resource which tests a model's understanding of specific linguistic phenomenon: lexical gaps. Finally, we propose additional analysis experiments we aim to address disagreements across prior work, and extend these experiments to include low-resource languages.


Alex’s area exam:

When: Wed. Nov 20, 1:30 pm

Where: MUEN E214 and zoom https://cuboulder.zoom.us/j/97014876908

Title: Computational Media Framing Analysis through Rhetorical Devices and Linguistic Features

Abstract: Over the past decade, there has been an increased focus on media framing in the Natural Language Processing (NLP) community. Framing has been defined as “select[ing] some aspects of a perceived reality and mak[ing] them more salient in a communicating text, in such a way as to promote a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described” (Entman, 1993). This computational work generally seeks to quantify framing on a large scale to raise awareness about media bias. A prevalent paradigm for computational framing analysis focuses on studying high-level topical information. Though highly generalizable, this approach addresses only emphasis framing: when a writer or speaker highlights particular aspect of a topic more frequently than others. However, prior framing work is broad, encompassing many other facets and types of framing present in the media. In recognition of this, there has been a recent line of work seeking to subvert the earlier focus on topical information. In this survey, we present an analysis of work which is both in line with goals of expanding the breadth of computational framing analysis and is generalizable. We focus on work which analyzes the role of rhetorical devices and linguistic features to reveal insights about media framing.

11/27/2024 No meeting: Fall break
12/04/2024 Enora's prelim
12/11/2024
1/23/25 Chenhao Tan CS Colloquium, 3:30pm


Past Schedules