Accepted Abstracts

Responsible AI Symposium 2025

Research Spotlight Talks

Near-Optimal Decision Trees in a SPLIT Second

Hayden McTavish (Duke)*, Varun Babbar (Duke)*, Cynthia Rudin (Duke), Margo Seltzer (University of British Columbia)

View Abstract →

Decision tree optimization is fundamental to interpretable machine learning. The most popular approach is to greedily search for the best feature at every decision point, which is fast but provably suboptimal. Recent approaches find the global optimum using branch and bound with dynamic programming, showing substantial improvements in accuracy and sparsity at great cost to scalability. An ideal solution would have the accuracy of an optimal method and the scalability of a greedy method. We introduce a family of algorithms called SPLIT (SParse Lookahead for Interpretable Trees) that moves us significantly forward in achieving this ideal balance. We demonstrate that not all sub-problems need to be solved to optimality to find high quality trees; greediness suffices near the leaves. Since each depth adds an exponential number of possible trees, this change makes our algorithms orders of magnitude faster than existing optimal methods, with negligible loss in performance. We extend this algorithm to allow scalable computation of sets of near-optimal trees (i.e., the Rashomon set).

Can AI Model the Complexities of Moral Decision-Making?

Vijay Keswani (Duke)*, Vincent Conitzer (CMU), Cyrus Cousins (Duke), Hoda Heidari (CMU), Breanna K. Nguyen (Yale), Jana Schaich Borg (Duke), Walter Sinnott-Armstrong (Duke)

View Abstract →

A growing body of work in Ethical AI attempts to capture human moral judgments as a way of building human-centered AI systems whose behavior is aligned with user values. The key question in our work is whether currently popular AI modeling approaches capture the critical nuances of human moral decision-making. We focus on the use case of kidney allocation and study people's moral reasoning to assess how well it aligns with AI models in this decision setting. Qualitative data was collected from 20 interviews where participants explained their rationale for their judgments about who should receive a kidney in hypothetical comparisons of kidney patients. Quantitative data was collected from more than 350 participants using surveys that presented participants with several kidney allocation scenarios across many days. We observe that our participants: (a) value patients' morally relevant attributes to different degrees; (b) use diverse moral reasoning, citing heuristics to reduce decision complexity; (c) change their moral judgment for the same scenario presented at different times, especially for difficult or complex scenarios; and (d) express enthusiasm along with concern regarding AI assisting humans in kidney allocation decisions. Based on these observations, we find that AI moral decision models are limited in many ways. Human moral decision-making is nuanced, non-linear, and dynamic. In contrast, current AI approaches to capture moral decision-making assume stable and static preferences, often modeled using misrepresentative computational classes, e.g. linear models. We also find that off-the-shelf AI models trained on participants' moral judgments achieve high average predictive accuracy (>85%) but perform relatively worse for scenarios participants found difficult or took longer to respond to. Overall, our findings highlight specific challenges of computationally modeling moral judgments as a stand-in for human input, suggesting several future research pathways and providing tangible directions to embed moral values and preferences in AI systems.

Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders

Suneel Nadipalli (Duke)*

View Abstract →

Fine-tuning pre-trained transformers has long been a powerful technique for enhancing the performance of base models on specific tasks. From early applications in models like BERT to now fine-tuning Large Language Models (LLMs) such as GPT, this approach has been instrumental in adapting general-purpose architectures for specialized downstream tasks. Understanding the fine-tuning process is crucial for uncovering how transformers adapt to specific objectives, retain general representations, and acquire task-specific features. This paper explores the underlying mechanisms of fine-tuning, specifically in the BERT transformer, by analyzing activation similarity, training Sparse AutoEncoders (SAEs), and visualizing token-level activations across different layers. Based on experiments conducted across multiple datasets and BERT layers, we observe a steady progression in how features adapt to the task at hand: early layers primarily retain general representations, middle layers act as a transition between general and task-specific features, and later layers fully specialize in task adaptation. These findings provide key insights into the inner workings of fine-tuning and its impact on representation learning within transformer architectures.

The Duke Humanoid: an open-source platform For Energy Efficient Bipedal Locomotion Using Passive Dynamics

Boxi Xia (Duke)*

View Abstract →

To make humanoid locomotion research accessible, we developed the Duke humanoid as an open-source bipedal locomotion platform. Our design mimics human physiology, with minimized leg distances and symmetrical body alignment in the frontal plane to maintain static balance with straight knees. We have developed a deep reinforcement learning policy that can be deployed directly on our hardware for walking tasks without fine-tuning. Additionally, to enhance energy efficiency in locomotion, we propose an end-to-end reinforcement learning algorithm that encourages the robot to leverage passive dynamics. Our experiment results show that our passive policy reduces the cost of transport by up to 50% in simulation and 31% in real-world testing. Our website is http://generalroboticslab.com/DukeHumanoidv1/. We hope our open-source hardware and software can lower the entry point of humanoid locomotion research.

Creating a Responsible AI Dataset: Tracing Bias, Misinformation, and Source Influence

Anastasiia Saenko (Duke)*

View Abstract →

Large language models (LLMs) increasingly shape public discourse, yet their tendency to amplify misinformation and biases remains a critical challenge. This project introduces a comprehensive, metadata-enriched dataset designed to trace the origins of AI-generated content, identify biases, and assess misinformation propagation. By integrating diverse real-world sources—including verified news, social media, and misinformation repositories—this dataset facilitates the study of AI behavior across public health, political discourse, conflict narratives, and social issues. Our methodology combines data collection, enrichment, and analytical pipelines, incorporating misinformation labels, political bias indicators, sentiment analysis, and demographic markers. This structured approach enhances AI transparency and explainability, revealing how models rely on and transform specific sources. Additionally, we demonstrate how this dataset can inform bias mitigation strategies, prompt engineering frameworks, and Responsible AI benchmarking. This work has practical and ethical significance in: 1) Detecting and mitigating biases in AI-generated content, 2) Enhancing transparency by tracing how AI models utilize and amplify information from specific sources, 3) Advancing Responsible AI practices with real-world applications in public health, political discourse, and media integrity. The primary contribution of this project is a robust dataset, complemented by analytical pipelines and prompt engineering frameworks that deepen our understanding of AI behavior and societal alignment. This research holds practical implications for media integrity, political transparency, and AI governance, offering tools for researchers and policymakers to assess and regulate the societal impact of AI-generated content.

Bot or Not? Identity Biases in Perceptions of Online Account Credibility

Ben Rochford (Duke)*, Cathy Lee (Duke), Michelle Qiu (Duke), Alexander Volfovsky (Duke), D. Sunshine Hillygus (Duke), Christopher Bail (Duke)

View Abstract →

A great deal of public discourse now occurs online, where platform design and the proliferation of AI language models create unique, technology-mediated settings for social interaction. These digital contexts fundamentally shape how people communicate, disrupting established norms of trust and credibility. Through a survey experiment using realistic social media conversation snippets—containing both human participants and responses from a pre-tested language model prompt persona—we examine how social identity displays influence judgments about whether social media accounts are real humans or automated bots. While existing literature suggests that out-group bias drives bot detection on social media, our findings reveal unexpected patterns. White Republican men displayed traditional out-group bias, being significantly more likely to label Black Democrat women as bots (37%) compared to profiles sharing their own identity characteristics (21%). However, other demographic groups showed a striking reverse pattern: people more frequently labeled as bots the accounts that displayed identity characteristics matching their own. Black Democrat respondents were 1.4 times more likely to judge profiles displaying Black Democrat characteristics as bots compared to white Republican profiles, with Black Democrat women showing an even stronger effect—these respondents were twice as likely to judge profiles sharing their identity characteristics as bots compared to white Republican men's profiles. We find that knowledge of AI systems showed no relationship with bot detection accuracy, which remained at 41% regardless of measured AI familiarity. As online platforms increasingly blend human and AI-mediated interactions, and as language models are integrated into traditionally human social spaces, these findings have important implications for understanding how people navigate trust and verify online identities. This is particularly consequential for groups who have historically faced greater risks from misrepresentation and impersonation.

Poster Presentations

Safety-Certification AI for Autonomous Systems

Amy Strong (Duke)*

View Abstract →

In simulated environments, reinforcement learning and imitation learning have achieved great success in controlling complex, dynamical systems to a desired behavior using only sampled data and/or interactions with the system. These AI control methods can achieve performance in model-free settings unlike many traditional control methods. However, real-world autonomous systems must adhere to constraints to ensure safe performance, and AI control is often only capable of providing probabilistic guarantees that constraints will be obeyed. So far, the cutting-edge of constrained AI control typically offers "approximate" safety, which is inadequate in real-world systems. In contrast, our work leverages invariant set theory to provide true, model-free safety certificates for autonomous systems. Our approach learns a safety certificate function from one-step trajectory samples of the autonomous system -- establishing the region for which an autonomous system will remain safe in perpetuity. First, we determine an algorithm to intelligently sample the state space of an autonomous system -- iteratively pruning regions of the state space that will violate constraints. Then, we use the sampled trajectories to learn a Lyapunov function -- demonstrating the asymptotic stability of the system in the safe region. Importantly, we require no prior knowledge about the autonomous system beyond Lipschitz continuity, which itself can be determined directly from data. The result is a certified safe region of the state space, enabling theoretical safety guarantees for AI controlled autonomous systems and laying the foundation for synthesis of safe AI controllers in the future.

Emergent effects of scaling on the functional hierarchies within large language models

Paul Bogdan (Duke)*

View Abstract →

Large language model (LLM) architectures are often described as functionally hierarchical: Early layers process syntax, middle layers begin to parse semantics, and late layers integrate information. The present work revisits these ideas. This research submits simple texts to an LLM (e.g., "A church and organ") and extracts the resulting activations. Then, for each layer, support vector machines and ridge regressions are fit to predict a text's label and thus examine whether a given layer encodes some information. Analyses using a small model (Llama-3.2-3b; 28 layers) partly bolster the common hierarchical perspective: Item-level semantics are most strongly represented early (layers 2-7), then two-item relations (layers 8-12), and then four-item analogies (layers 10-15). Afterward, the representation of items and simple relations gradually decreases in deeper layers that focus on more global information. However, several findings run counter to a steady hierarchy view: First, although deep layers can represent document-wide abstractions, deep layers also compress information from early portions of the context window without meaningful abstraction. Second, when examining a larger model (Llama-3.3-70b-Instruct), stark fluctuations in abstraction level appear: As depth increases, two-item relations and four-item analogies initially increase in their representation, then markedly decrease, and afterward increase again momentarily. This peculiar pattern consistently emerges across several experiments. Third, another emergent effect of scaling is coordination between the attention mechanisms of adjacent layers. Across multiple experiments using the larger model, adjacent layers fluctuate between what information they each specialize in representing. In sum, an abstraction hierarchy often manifests across layers, but large models also deviate from this structure in curious ways. Emergent phenomena such as this challenge attempts to interpret and steer LLMs toward desirable and safe goals.

"My Therapist is a Robot & That's Quite Alright": Understanding the Security & Privacy Perspectives of Users of Large Language Models for Mental Health

Jabari Kwesi (Duke)*

View Abstract →

Individuals are increasingly relying on large language model (LLM)-enabled conversational agents for emotional support. While prior research has examined privacy and security issues in chatbots specifically designed for mental health purposes, these chatbots are overwhelmingly "rule-based'' offerings that do not leverage generative AI. Little empirical research currently measures users' privacy and security concerns, attitudes, and expectations when using general-purpose LLM-enabled chatbots to manage and improve mental health. Through 21 semi-structured interviews with U.S. participants, we identified critical misconceptions and a general lack of risk awareness. Participants conflated the human-like empathy exhibited by LLMs with human-like accountability and mistakenly believed that their interactions with these chatbots were safeguarded by the same regulations (e.g., HIPAA) as disclosures with a licensed therapist. We introduce the concept of "intangible vulnerability,'' where emotional or psychological disclosures are undervalued compared to more tangible forms of information (e.g., financial or location-based data). This undervaluing stems from an inability to envision immediate and tangible harms tied to personal traumas or day-to-day distress, leaving these disclosures comparatively less protected - despite powerful insights into a user’s mental state. Addressing intangible vulnerability thus requires the re-framing of mental health data as equally vulnerable to misuse, even when it lacks the obvious exploit pathways of credit card numbers or addresses. In this direction, we propose recommendations to safeguard user mental health disclosures with general-purpose LLM-enabled chatbots more effectively: these include contextual nudges & just-in-time warnings, strong default protections & ephemeral storage, and targeted oversight & audits.

Deep learning-based phase aberration correction with paired microbubble data

Murong Yi (Duke)*

View Abstract →

Transcranial ultrasound imaging is a powerful brain imaging modality with diverse medical applications. However, the complex structure of the skull and its significant acoustic impedance mismatch with intracerebral tissue introduce phase aberrations that distort image reconstruction based on homogeneous sound speed assumptions. FMM-PAC[1][2] is widely adopted, but it relies heavily on the accuracy of the SoS map and employs a simplified wave propagation model that may not capture the complexities of real-world scenarios. Recent advancements in deep learning have enabled neural networks to directly infer aberration corrections from distorted data without requiring an SoS map. However, DL-PAC are highly data-driven. Prominent studies, such as CV-CNN[3] and MAIN-AAA[4], primarily rely on purely simulated datasets for training. Although these methods have demonstrated promising performance in invivo studies, the reliance on simulated aberrations raises concerns about their applicability to more complex, real-world conditions. In this study, we propose a novel training method of paired microbubble data framework to address these challenges. Our methodology involves generating training data by experimentally obtaining point spread functions (PSFs) from imaging studies of a real mouse skull. Specifically, microbubbles were diluted and scanned in both pure water and under a mouse skull. Aberrated and clear localized microbubbles were cropped and placed into a pure water noisy scan background to compose paired training samples. The lightweight network architecture consists of several residual convolutional blocks, and adversarial training was employed to account for PSF variations across different locations. A discriminator was used in conjunction with the network to enhance its robustness to spatial PSF discrepancies. Experimental results on intact skull imaging demonstrate that networks trained under this strategy maintain robust performance when transferred to invivo scans. For ultrasound localization microscopy (ULM) reconstruction[5], aberration-corrected in vivo images achieved a significantly higher true positive localization detection rate, yielding 3.63 times more tracks and 1.55 times longer tracks under identical reconstruction settings. These findings highlight the potential of this deep learning training strategy as a transformative approach for achieving clearer and more accurate transcranial ultrasound imaging.

ProtoEEGNet: An Interpretable Approach for Detecting Interictal Epileptiform Discharges

Dennis Tang (Duke)*

View Abstract →

In electroencephalogram (EEG) recordings, the presence of interictal epileptiform discharges (IEDs) serves as a critical biomarker for seizures or seizure-like events. Detecting IEDs can be difficult; even highly trained experts disagree on the same sample. As a result, specialists have turned to machine-learning models for assistance. However, many existing models are black boxes and do not provide any human-interpretable reasoning for their decisions. In high-stakes medical applications, it is critical to have interpretable models so that experts can validate the reasoning of the model before making important diagnoses. We introduce ProtoEEGNet, a model that achieves state-of-the-art accuracy for IED detection while additionally providing an interpretable justification for its classifications. Specifically, it can reason that one EEG looks similar to another “prototypical” EEG that is known to contain an IED. ProtoEEGNet can therefore help medical professionals effectively detect IEDs while maintaining a transparent decision-making process.

Medical Question Answering Using GraphRAG

Bob Zhang (Duke)*, Keon Nartey (Duke), Murphy Liu (Duke), Mahmoud Alwakeel (Duke), Suim Park (Duke)

View Abstract →

The rapid advancement of Large Language Models (LLMs) has led to their widespread adoption in various applications. However, their reliance on static training data and susceptibility to hallucinations limit their effectiveness in high-risk domains such as medicine. This study proposes GraphRAG, a novel hybrid retrieval-augmented generation (RAG) framework that integrates structured knowledge graphs with traditional vector-based retrieval to enhance the reliability and interpretability of LLM outputs. We construct a Neo4j-based knowledge graph from domain-specific medical literature, enabling explicit entity-relationship mapping for more precise contextual retrieval. Experimental evaluation on five expert-designed medical queries demonstrates that GraphRAG significantly outperforms traditional vector-based RAG in accuracy, semantic relevance, and factual consistency, reducing hallucination rates and improving retrieval coherence. Our hybrid approach achieves an average improvement of 12.5% in factual accuracy scores over vector-only methods, reinforcing the advantage of structured graph retrieval. Furthermore, we show that graph-based retrieval enhances query interpretability, allowing users to trace entity relationships and dynamically refine knowledge representations through data visualization interface. By improving the transparency and controllability of LLM-driven responses, GraphRAG offers a scalable solution for AI deployment in critical decision-making fields. Future work will focus on dynamic graph updates, multi-hop reasoning, and expert-in-the-loop validation to further enhance its applicability. This research contributes to the development of more explainable and reliable AI systems, particularly in high-stakes industries such as healthcare and scientific research.

PABE – Towards Robust Evaluation of Fairness in Chatbots

Choonghwan Lee (Duke)*

View Abstract →

The global chatbot market size was estimated to be USD 7.76 billion in 2024, and is projected to grow at 23.3% CAGR. This includes applications in high-leverage domains such as medical consultation, legal jurisdiction, and financial advice. While we do not want our chatbots to propagate bias against specific demographics, current methods for evaluating fairness in pre-trained LLMs are not conducive for evaluating realistic chatbot interactions. Most notably, they require the explicit presence of protected attributes in user queries. However, users rarely provide explicit demographic information during real-world chatbot conversations. In reality, demographic information is often implicitly embedded in a user's queries through their speech style. A robust fairness evaluation should ensure that chatbot responses remain both stylistically and factually consistent across various speech styles. To this end, we propose Persona-Aware Bias Evaluation (PABE), which aims to provide a solution to this problem. Specifically, we will compare various text style transfer approaches on their ability to implicitly embed protected attributes, using human assessments as an evaluation framework. In addition, we also propose methods to evaluate fairness using PABE and suggest various approaches to mitigate identified bias. While the paper will focus on applications that evaluate racial biases in mental health consultation, PABE's strength lies in its generalizability to any demographic and chatbot domains. We hope that PABE offers a robust yet easily replicable method to evaluate bias in industry chatbot applications.

A Clinically-Grounded Taxonomy for Assessing AI Risks to People with Eating Disorders

Amy Winecoff (Center for Democracy & Technology)*, Kevin Klyman (Center for Democracy & Technology)

View Abstract →

Eating disorders are serious mental health conditions with among the highest mortality rates of any psychiatric illness. While evidence suggests that AI systems can pose serious risks to individuals vulnerable to or experiencing eating disorders, current methods for assessing these risks often lack grounding in clinical understanding of how eating disorders develop and progress. Through semi-structured interviews with clinical experts and exploratory testing of publicly available AI systems, we seek to develop a clinically-informed taxonomy of potential AI-related harms to people with eating disorders. We identify six key categories of harmful interactions: generalized guidance on diet and exercise that fails to account for individual vulnerabilities, AI-generated content that promotes unhealthy social comparison and "thinspiration," support for concealing symptoms and dangerous behaviors, interactions that maintain or amplify negative affect, triggers for hyperfocus on the body, and narrow or stereotypical depictions of eating disorders. Our initial findings suggest that AI systems can deliver harmful content in ways that are more personalized and seem more authoritative than traditional sources, potentially increasing AI’s impact on vulnerable individuals. To avoid repeating the reactive approach taken by social media platforms in addressing similar harms, we argue for proactively identifying and addressing these risks during AI system development and deployment, particularly for systems likely to interact with vulnerable users.

Ensuring Responsible AI in Pedagogical Design

Hannah Rogers (Duke)*, Maria Kunath (Duke), Catherine Lee-Cates (Duke), Michael Hudson (Duke), Grey Reavis (Duke)

View Abstract →

How do we effectively and responsibly integrate AI tools into learning experiences? We will share how our team of learning designers developed and implemented a platform agnostic, values-centered evaluation framework to assess AI tools for pedagogy. This approach involves 1) describing core values that reflect the mission of our team, department, and institution; 2) highlighting the alignment (or lack thereof) of an AI tool with those values as well as intended learning outcomes; 3) testing designs collaboratively with faculty; 4) collaborating with vendors to adjust the tool to meet responsible AI standards; and 5) collecting learner feedback to help us make final decisions on tool usage. Through our pilot process, we have promoted trustworthy and transparent uses of pedagogy-focused AI. Based on early data collection from learners of our first pilot program, our work to establish trust with learners has paid off. Through qualitative and quantitative feedback, learners have expressed trust in the AI tool. For example, 80% of learners currently agree they would use the tool in a future course. Pedagogical AI tools exist in a complex ecosystem across industry, academia, and learners. As educators continue to explore responsible AI development in this context, it is critical to keep the human learner at the center. By designing and enacting a values-centered evaluation framework and inviting diverse stakeholders—most critically learners—into this effort, our work demonstrates how responsible and ethical AI development in education may be achieved. As AI becomes further entwined with learning, our human-centered and transparent process emphasizes the importance of creating connections with learners and, ultimately, reestablishing trust in education.

Computationally Modeling Human Moral Decision-Making Processes

Cyrus Cousins (Duke)

View Abstract →

Recent work in fair artificial intelligence has trended towards incorporating human-centric objectives, with the explicit goal of aligning AI models to societal values. However, both theoretical and applied work in this domain struggles to uniquely motivate any given fairness objective, thus we have a plethora of methods, each of which address some fairness concerns, but many of which are mutually incompatible and introduce their own shortcomings. In this work, we take an axiomatic approach to learning and eliciting human moral preferences from pairwise comparisons. We seek to address these conundra by learning simple models, where model classes are constructed via axiomatic assumptions motivated by neuropsychological analyzes. In particular, we seek models which humans can easily understand and audit, and which can easily be incorporated into existing AI and allocation systems as objectives. We define a class of factored models, in which one or more sets of features are considered independently for each alternative, where the factor model is learned, and then the factor model outputs are aggregated via a simple fixed rule, such as Bradley-Terry aggregation or a tallying heuristic. This rigid shepherding of information ensures such models have low cognitive load and are thus realistic and feasible candidates for heuristic human reasoning. In the extreme, we look at feature independent models, where each factor considers only a single feature, and monotonic models, which assume monotonicity of predictions in each factor, to rigorously and uniquely derive a straightforward and easily interpreted model of human decision making, and we find that human-decision making on real kidney- allocation problems is quite accurately captured by these models.

Content Curation Architecture for Decentralized Social Networks

Chou Hsuan-Yu (Duke)*, Jinyu Pei (Duke), Xiyuan Song (Duke), Weili Wang (Duke), Xiaowei Yang (Duke)

View Abstract →

Decentralized online social networks (OSNs), such as Mastodon, Bluesky, and Pixelfed, gained popularity due to widespread concerns over the transparency, privacy, and safety of traditional centralized OSN platforms and the undue influence of their owners. However, unlike centralized platforms, which have more data and resources to train machine learning models for content curation, decentralized platforms can only rely on much simpler mechanisms for content curation. OSN users are compelled to choose between the problematic content curation of centralized platforms and the naïve curation methods employed by decentralized platforms. We propose a personal content curation architecture for OSNs. Users specify personal preferences to a curation agent, and the agent curates content from OSNs automatically according to personalized rules, empowering users to freely and effectively control their feed. Our architecture allows users to either rely on a trusted third party to run the agent or run the agent locally. We have developed a working prototype of a local agent that, with the aid of large language models (LLMs), moderates feed from Mastodon, demonstrating the technical feasibility of our architecture. We have also evaluated various open-source LLMs for their accuracy and speed in this application, providing valuable insights for future research.

Eyes and Ears at Home: Exploring Users' Privacy and Security Concerns Toward Domestic Social Robots

Henry Bell (Duke)*

View Abstract →

Domestic social robots, equipped with artificial intelligence (AI) and advanced sensing capabilities, are gaining interest among consumers in the United States. While similar to traditional smart home devices and AI chatbots, social robots’ extensive data collection and anthropomorphic features amplify security and privacy risks, including data leakage, unauthorized sharing, and increased disclosure. As these technologies are still in the early stages of commercialization in the U.S. market, it is critical to investigate U.S. users’ security and privacy needs and concerns to guide their design. Through 19 semi-structured interviews with U.S. participants, we identified significant security and privacy concerns, highlighting the need for transparency, usability, and robust data protection to support adoption. Users’ concerns varied by use case. In educational applications misinformation and hallucinated responses were a major concern, and for health and medical purposes users worried most about the reliability of AI. Despite the practical risks, personal information leakage through the AI model memorizing user prompts was not a salient concern for participants. We found that users expect tangible privacy controls, indicators of data collection, and context-appropriate functionality.

Ethics of AI in Healthcare: A Scoping Review Towards a Unifying Framework

Aaron J Gorelik (Washington University in St. Louis), Mengyuan Li (Central South University, China), Jessica Hahne (Washington University in St. Louis), Junyi Wang (Central South University, China), Yongqi Ren (Central South University, China), Lei Yang (Central South University, China), Xin Zhang (Central South University, China), Xing Liu (Central South University, China), Xiaomin Wang (Central South University, China), Ryan Bogdan (Washington University in St. Louis), Brian D Carpenter (Washington University in St. Louis)

View Abstract →

Artificial Intelligence (AI) is increasingly being adopted across many industries including healthcare. This has brought forth the development of many new independent ethical frameworks for responsible use of AI within institutions and companies. These heterogenous guidelines can be redundant and/or conflicting and can lead to inconsistent ethical considerations and confusion. Risks associated with the application of AI in healthcare may have high stakes for patients, which inherently brings nuanced ethical considerations with broad implications. Here, we examined whether 4 established biomedical ethical principles (i.e., Beneficence, Non-Maleficence, Respect for Autonomy, and Justice) can be used as a framework for AI ethics in healthcare. To this end, we conducted a scoping review of 227 peer-reviewed papers using semi-inductive thematic analyses to categorize patient-related ethical issues in healthcare AI under these 4 principles of biomedical ethics. We found that these principles were consistently applicable to ethical considerations concerning healthcare AI across countries. These results suggest that the four principles of biomedical ethics can be generalized to AI in healthcare and provide a unifying framework from which ethical issues can be classified and evaluated for AI governance and regulation in healthcare.

From BERT to DistilBERT: Changes in Explainability and Decision Logic

Osama Ahmed (Duke)*, Lorna Aine (NVIDIA)

View Abstract →

We investigate the impact of model distillation on the explanations of deep neural networks across various natural language processing tasks; sentiment classification, machine translation, and named entity recognition, using BERT and its distilled counterpart DistilBERT. While model distillation improves computational efficiency by transferring knowledge from a larger model to a smaller one, this process introduces qualitative changes in the models’ explanations that are not well understood. Leveraging SHAP (SHapley Additive exPlanations) we compare feature attributions between BERT and DistilBERT. We examine how the distillation process alters decision-making logic, particularly in cases where small accuracy losses occur. Our results suggest that while both models prioritize similar key features, DistilBERT assigns higher magnitudes of importance to salient features, a shift that could raise concerns about generalization and fairness in high-stakes applications. This study underscores the importance of balancing efficiency with comprehensive representations when deploying compressed models. We advocate for further research into the trade-offs of model compression techniques.

Explainable Deep Learning for Pneumonia Detection in Medical Imaging

Luopeiwen Yi (Duke)*

View Abstract →

`The application of deep learning to medical imaging has enabled significant advances in diagnosing conditions like pneumonia. However, while metrics such as test accuracy may suggest strong performance, the "black-box" nature of these models often obscures critical flaws in their decision-making processes. This research investigates the use of Gradient-weighted Class Activation Mapping (Grad-CAM) to explain a pre-trained ResNet50 model's predictions when classifying chest X-rays as healthy or pneumonia-affected. Despite achieving a test accuracy of 77.40%, Grad-CAM visualizations revealed a crucial concern: the model frequently misfocused on irrelevant areas like the diaphragm and blank spaces outside the body instead of the lungs, where pneumonia-specific features reside. Using the Labeled Optical Coherence Tomography (OCT) and Chest X-ray dataset, the ResNet50 model was fine-tuned for binary classification, achieving a test accuracy of 77.40% in distinguishing healthy and pneumonia-affected X-rays. Grad-CAM was applied to generate feature importance heatmaps, which were statistically analyzed to compare attention between the two classes. Results show that pneumonia images consistently received higher feature importance scores, confirmed through a t-test (p less than 0.0001, Cohen's d = 1.3985), indicating a statistically significant and substantial difference in how the model interprets the two classes. However, Grad-CAM visualizations revealed the model often misfocused on irrelevant regions like the diaphragm and blank areas instead of the lungs, highlighting limitations of freezing convolutional layers during transfer learning. This study emphasizes the importance of explainability in AI, demonstrating how visual tools like Grad-CAM can uncover hidden issues that accuracy metrics alone fail to capture. Addressing such challenges is vital for developing AI systems that are both accurate and reliable, fostering trust in clinical decision-making and ensuring safe integration into healthcare workflows.

Reliability of Zero-Shot Prompted Large Language Models in Full-Text Screening for Literature Reviews

Bill Chen (Duke)*

View Abstract →

Background: The exponential growth of scientific publications poses significant challenges for researchers to consume literature and conduct systematic reviews, including increased time, effort, and cost. Large Language Models (LLMs) offer a potential solution, and their integration into review workflows is gaining attention [1][2]. Objective: This study evaluates the full-text screening capabilities of LLMs using data from an ongoing review, specifically focusing on their performance with zero-shot prompting which doesn’t prime the models for a specific task [3]. Methods: We randomly sampled 20 included and 20 excluded (with diverse exclusion reasons) articles from our ongoing review which had consensus decisions from two human reviewers. Three LLMs (ChatGPT-4o, Claude Haiku, and LLaMA 3.2) were tasked with deciding whether an article should be included, based on a prompt replicating the inclusion criteria provided to human reviewers. Preliminary Results: ChatGPT achieved the highest performance (F1-score=0.73, Cohen’s Kappa=0.30), followed by Claude (F1-score=0.68, Kappa=0.10), and LLaMA (F1-score=0.60, Kappa=-0.05). ChatGPT correctly excluded 7/20 papers, accurately identifying criteria like “sample size too small” or “not an original research article”. However, it appears to struggle with nuanced criteria such as “not human subject data” or “incorrect primary aim”. Despite their speed (ChatGPT response time 2.72 seconds), LLMs are still far from achieving the average inter-rater performance of human reviewers (Kappa=0.76) observed in the first week of full-text review. Implications for Responsible AI: While LLMs are being integrated into literature review workflows [4], zero-shot prompting demonstrated low agreement with human reviewers. Researchers must critically evaluate AI outputs to avoid compromising review integrity. Potential Impact: As LLM performance improves, guidelines will be essential to establish best practices and determine appropriate use cases. Premature deployment risks propagating inaccurate knowledge, potentially leading to significant harm across society. [1] Thomas, B. S. J. a. a. P. (n.d.). Publications output: U.S. Trends and International Comparisons | NSF - National Science Foundation. https://ncses.nsf.gov/pubs/nsb202333/publication-output-by-region-country-or-economy-and-by-scientific-field [2] Bornmann, L., Haunschild, R., & Mutz, R. (2021). Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications, 8(1). https://doi.org/10.1057/s41599-021-00903-w [3] Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S., & Wang, Y. (2024). An Empirical evaluation of prompting strategies for large language models in Zero-Shot clinical natural language processing: Algorithm Development and Validation study. JMIR Medical Informatics, 12, e55318. https://doi.org/10.2196/55318 [4] Luo, X., Chen, F., Zhu, D., Wang, L., Wang, Z., Liu, H., Lyu, M., Wang, Y., Wang, Q., & Chen, Y. (2024). Potential roles of large language models in the production of Systematic Reviews and Meta-Analyses. Journal of Medical Internet Research, 26, e56780. https://doi.org/10.2196/56780

Demographic Data Reporting in Predictive Models for Biosignals: Early Results of a Rapid Scoping Review

Anita Silver (Duke)*

View Abstract →

Background: AI models are increasingly being used to make predictions from human biosignal data collected by wearable devices. Model outputs may reflect bias from training data. Objective: This ongoing, rapid scoping review aims to assess the current status of demographic data reporting in literature on AI model development for wearable-derived-biosignals. Our main question is whether and how participant demographics are being reported by studies in this field. Design: A rapid scoping review is being conducted based on Joanna Briggs Institute Scoping Review methodology and will be reported according to PRISMA-ScR. The protocol was registered on Open Science Framework. Literature search: We searched MEDLINE (PubMed), Embase (Elsevier), IEEE Xplore Digital Library (IEEE.org), and Web of Science (Clarivate) for articles published in the five years prior to July 9, 2024. The search was developed and conducted with a professional medical librarian and included a combination of keywords and standardized vocabulary representing predictive artificial intelligence, wearable devices, and biosignals. Study selection: Original, peer reviewed research articles that proposed novel contributions in data-driven modelling to make predictions from wearable device biosignal data were included. Preliminary results: Our search identified a total of 4,487 unique articles of which 2,633 articles met the study inclusion criteria and were included in full-text retrieval. Our early screening results demonstrate that there is not a clear consensus on the attributes and attribute values to be reported about the data used to train such AI models. Implications for responsible AI: Standards for demographic reporting on human data are needed to ensure that the field of AI development for wearable device-derived biosignal data continues to innovate inclusively and transparently. Potential impact on society: Improving reporting consistency can empower non-AI experts (such as doctors and individuals) to make informed judgements about the quality and relevance of emerging AI technologies.

Building Trust and Innovation: Responsible AI Policy and Governance in Higher Education

Susan Kreikamp (Partner Advisory, Accenture)*, Mimi Whitehouse (Accenture)*

View Abstract →

As higher education transitions from AI and Generative AI pilot programs and isolated use cases to fully integrating these technologies into the student experience, faculty research, and institutional operations, establishing efficient and effective Responsible AI governance is essential. This presentation, drawing from industry expertise with Fortune 500+ companies, Higher Education clients, and collaborative research with Standford University, will outline leading practices for AI governance and policy. Key findings to be shared include the essential components of a comprehensive AI policy, along with its operationalization, such as risk screening. Additionally, findings on process and coordinated development not only support immediate goal of risk mitigation but also foster broader engagement through awareness and education programs, empowering individuals who may not have initially seen themselves as innovators. Highlighting that through responsible AI governance and policy, organizations can drive impactful outcomes by building trust in AI solutions, unlocking new opportunities, and fostering seamless collaboration, leading to innovation and ethical advancements in technology. Higher Education stands to gain significant benefits from responsible AI adoption and operationalization, much like commercial industries. As demonstrated in a recent Standford Universit study of 1,000 C-suite executives, 49% reported that AI is a key driver of revenue growth for their firms. This underscores AI’s potential to enhance institutional effectiveness, drive innovation, and create new opportunities for sustainable growth in the education sector. References - Is It Time to Regulate AI Use on Campus?, Chronicle of Higher Education - Responsible AI Maturity Mindset | Accenture, Accenture

A Systems Thinking Approach to Algorithmic Fairness

Chris Lam (Epistamai)*

View Abstract →

Systems thinking provides us with a way to model the algorithmic fairness problem by allowing us to encode prior knowledge and assumptions about where we believe bias might exist in the data generating process. We can then encode these beliefs as a series of causal graphs, enabling us to link AI/ML systems to politics and the law. This allows us to combine techniques from machine learning, causal inference, and system dynamics in order to capture different emergent aspects of the fairness problem. We can use systems thinking to help policymakers on both sides of the political aisle to understand the complex trade-offs that exist from different types of fairness policies, providing a sociotechnical foundation for designing AI policy that is aligned to their political agendas and with society's values.

AI sensemaking in interprofessional career transitions: real-world practices, pitfalls, and lessons learned

Lan Li (UNC-Chapel Hill)*

View Abstract →

“Do my skills matter? For how long? And what can I do so that my skills will continue to matter even as AI capabilities progress?” With advancing capabilities exhibited by artificial intelligence (AI) systems, there has been much hype and fear about AI’s implications toward knowledge work – work that is composed primarily of cognitive labor. While recent studies have begun to explore how knowledge workers themselves are making sense of AI and its implications for their work lives, few have examined what actions, if any, workers might be enacting in response to AI’s growing integration in the workplace. Taking this action-oriented focus, my work centers its sociotechnical investigation on a smaller population of knowledge workers who are at various stages of undergoing interprofessional career transitions which are in part informed by the developments of AI that they have observed and experienced within their work environments. Given the strong need for meaning-making during times of transition, the interviews highlight the complex and ongoing sensemaking workers engage in to understand the evolving nature of their relationality towards AI, this new work partner, either as a collaborator and/or a competitor. The transitional strategies workers discussed reveal, through intended actions, what workers believe might be the hard-to-replace skills that may endure even as AI continue to progress. But also, interestingly, how to identify or even create work settings and contexts that would make space for, necessitate, appreciate, or allow their existing skills to endure even as AI poses a threat. The successes and setbacks workers shared provide a precious window into understanding what existing infrastructure and tools are working and/or missing in supporting knowledge workers who occupy different positions within the labor market in adapting to a future of work marked by increasing uncertainty and change.

Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework

Olivier Binette (American Institutes for Research)*

View Abstract →

Machine learning and AI models are commonly evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications - a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations, emphasizing the importance of a well-defined estimation target. We illustrate our proposal on examples of commonly used evaluation methodologies - involving cross-validation, clustering evaluation, and LLM benchmarking - that can lead to incorrect rankings of competing models (rank reversals) with high probability, even when performance differences are large. We demonstrate how the estimands framework helps uncover underlying issues, their causes, and potential solutions in each of these examples. Ultimately, we believe the framework can significantly improve the validity of evaluations through better aligned inference and help decision-makers and model users interpret reported results more effectively.

Spatial LLM: Generative Transformer Approach to Single Cell Tissue Imaging

Leyla Rasheed (Duke)*, John Hickey, Zach Robers, Yuan Feng, James Sohigian

View Abstract →

Understanding and modeling cell-cell relationships within tissues is a crucial and challenging aspect of spatial proteomics research due to the expansivity of datasets. Our study focuses on developing a generative transformer model that reconstructs and extends cell arrangements within intestinal tissue, leveraging a dataset of spatial proteomics images. Specifically, we capture the structure of these tissues by modeling cell type distributions and spatial dependencies. The dataset, derived from Dr. John Hickey’s CODEX imaging of 66 intestinal tissue regions, is primarily structured around x-, y-coordinates, cell-type, and tissue region. CODEX (”Co-detection by Indexing”) is a multiplexed imaging technique that uses antibodies conjugated with unique DNA barcodes to visualize protein markers that can uniquely identify cell types in a slice of tissue. This process creates images of high spatial resolution at the single-cell level, giving a highly detailed analysis of cell interactions within the intestine. Our computational methodology involves restructuring the dataset, modifying a one-dimensional transformer, and generating sequential rows of new cells. We tokenize cell positions by encoding cell types as letters and incorporating spatial information through an ASCII-based encoding system for ”space” between existing cells. We modify batching and generation functions to better capture consecutive rows of cellular structures. Our initial findings demonstrate that the transformer model can effectively learn and emulate the cell type distributions observed in the original dataset. However, due to the model’s sequential nature and lack of explicit 2D spatial encoding, it struggles to fully capture relationships across both x and y dimensions. To address this limitation, we will target other architectures to enhance spatial accuracy in tissue reconstruction. By refining transformer models for spatial proteomics, we improve synthetic tissue generation and enhance the interpretability of AI-driven biomedical research.