Back to research and announcements
This study investigates the effectiveness of using pre-trained weights from the LLaMA language model family for next-visit disease prediction tasks using electronic health records (EHRs). To evaluate this approach, we introduce LotusAI-Predict, a fine-tuned LLaMA-based model designed for disease prediction. While transformer architectures have shown promise in processing clinical sequences, the potential benefits of leveraging pre-trained language models for medical prediction tasks remain unexplored. Using the MIMIC-IV dataset, we compare two variants of a transformer-based model: one initialized with pre-trained LLaMA weights and another with random initialization. Both models process sequences of ICD-10 diagnostic codes while preserving temporal relationships between diagnoses. Our results demonstrate that the pre-trained model achieves significantly better performance, with a validation accuracy of 7.40% compared to 3.05% for the randomly initialized model over the first 2,058 training steps—a 142.6% relative improvement. The pre-trained model also shows faster convergence, reaching a lower validation loss of 1.21 com pared to 1.40 for the random initialization. These findings suggest that the semantic understanding encoded in pre trained language models can transfer effectively to clinical prediction tasks, potentially improving the development and deployment of medical prediction systems.
Introduction
The widespread adoption of Electronic Health Records (EHRs) has led to an unprecedented accumulation of structured clinical data, with over 90% of hospitals in the United States now utilizing EHR systems (Office of the National Coordinator for Health Information Technology, 2021). This wealth of data presents opportunities to improve healthcare outcomes through predictive modeling, particularly in anticipating future diagnoses based on patients’ medical histories.
In healthcare, transformer-based models have shown particular promise in processing clinical sequences, as demonstrated by models like BEHRT (Li u.a., 2020) and TransformerEHR (Yang u.a., 2023), which effectively capture temporal relationships in patient histories and predict future diagnoses. Concurrent with these developments, large language models (LLMs) have emerged as powerful tools for under standing and generating human language, demonstrating impressive capabilities in capturing semantic relationships and domain knowledge through pre-training on vast text corpora (Grattafiori u.a., 2024). The success of these models raises an intriguing question: can the semantic understanding encoded in pre-trained language models enhance the performance of clinical prediction tasks?
This study investigates whether leveraging pre-trained weights from the LLaMA model family can improve the accuracy of next-visit disease prediction. We focus specifically on processing sequences of International Classification of Diseases, 10th revision (ICD-10) codes (Steindel, 2010), utilizing the temporal ordering of diagnoses to predict future conditions. Our approach explores whether the semantic understanding embedded in pre-trained language models can enhance the capture of relationships between different diagnoses, potentially leading to more accurate predictions compared to models trained solely on clinical data.
Related Work
One of the first papers to use transformers for disease predictions, the BEHRT (Bidirectional Encoder representations from Health Records using Transformers) model rep resented a significant advancement in disease prediction methodology (Li u.a., 2020). By adapting the transformer architecture to handle electronic health records, BEHRT addressed several limitations of previous approaches while maintaining interpretability and achieving superior predictive performance.
The model’s architecture integrated multiple embedding layers to capture the complex nature of medical data: disease embeddings encoded relationships between conditions, positional encodings preserved temporal sequence information, age embeddings incorporated demographic context, and visit segment embeddings differentiated be tween medical encounters (Li u.a., 2020). This embedding approach enabled the model to simultaneously process both medical context and temporal relationships within patient histories.
In terms of empirical validation, BEHRT was evaluated using the Clinical Practice Research Datalink (CPRD), encompassing primary care data from 674 general practitioner practices in the UK, linked with hospital episode statistics (HES). The study analyzed approximately 1.6 million patients meeting quality standards and having sufficient visit history (Li u.a., 2020). The researchers standardized various disease coding systems (Read codes and ICD-10) into 301 distinct disease codes to ensure consistent analysis.
Performance evaluation focused on three temporal pre diction tasks: next-visit disease prediction (T1), 6-month prediction (T2), and 12-month prediction (T3). BEHRT demonstrated significant improvements over existing deep learning approaches, achieving 8.0-13.2% higher average precision scores across these tasks (Li u.a., 2020). The model showed particular efficacy in predicting specific conditions such as epilepsy, prostate malignancy, and depression, while maintaining robust performance across gender-specific conditions despite the absence of explicit gender information in the input data.
The evaluation methodology employed two primary metrics: the Area Under the Receiver Operating Characteristic curve (AUROC) and Average Precision Score (APS), as well as comparative analysis against other baseline models, including Deepr (CNN-based) and RETAIN (reverse-time attention RNN) (Li u.a., 2020).
TransformerEHR
Beyond BEHRT, there was also TransformerEHR, which introduced several key innovations in disease prediction methodology (Yang u.a., 2023). By implementing an encoder-decoder framework specifically designed for electronic health records, this approach demonstrated significant improvements over previous models while maintaining interpretability and generalizability.
The model’s architecture employed a generative encoder decoder framework to predict future ICD codes based on longitudinal patient data. Unlike previous approaches that used bidirectional encoder representations, Trans formerEHR utilized a unidirectional decoder that more closely aligned with the temporal nature of disease prediction. The model integrated visit embeddings, time embeddings, and code embeddings to capture the multidimensional aspects of patient histories (Yang u.a., 2023).
A key innovation in TransformerEHR’s methodology was its pretraining objective. Rather than predicting a subset of masked codes within visits, the model was designed to predict complete sets of ICD codes for future visits. This
approach better reflected the reality of clinical practice, where multiple diseases and outcomes often manifest simultaneously. The pretraining phase utilized data from 6.5 million patients across 255 million visits, enabling the model to learn complex interrelationships between different diseases and outcomes.
The evaluation of TransformerEHR was conducted through multiple comprehensive assessments. In disease agnostic prediction tasks, the model demonstrated superior performance compared to existing approaches, with particularly notable improvements in predicting uncommon diseases. The model achieved a 5.92% improvement in AUROC for uncommon diseases and a 3.96% improve ment for common diseases compared to BERT-based models (Yang u.a., 2023).
For specific clinical applications, TransformerEHR was evaluated on two challenging prediction tasks: pancreatic cancer onset and intentional self-harm among PTSD patients. In pancreatic cancer prediction, the model achieved an AUROC of 81.95% and an AUPRC of 78.64%, representing significant improvements over both traditional machine learning approaches and existing deep learning models. For the prediction of intentional self harm, TransformerEHR achieved a positive predictive value of 8.8% at the 10% threshold, substantially exceed ing the clinical cost-effectiveness threshold of 1.7% (Yang u.a., 2023).
The model’s generalizability was validated through both internal and external evaluations. Internal validation used data from 121 VA healthcare facilities not included in the pretraining dataset, while external validation employed the MIMIC-IV dataset from non-VA hospitals. These evaluations demonstrated the model’s robust transfer learning capabilities, with TransformerEHR maintaining strong performance across different healthcare settings and patient populations.
A notable aspect of TransformerEHR’s design was its ability to effectively utilize longer patient histories. Unlike previous models that showed degraded performance with extended histories, TransformerEHR’s attention based architecture demonstrated a 19% improvement in AUPRC when using complete patient histories compared to limiting analysis to the five most recent visits. This capability proved particularly valuable for complex conditions requiring longitudinal analysis (Yang u.a., 2023).
RarePT: Predicting Rare Disease Through Deep Learning
Following BEHRT and TransformerEHR in applying transformer architectures to disease prediction, RarePT emerged as a solution focusing on rare disease prediction (Jordan u.a., 2023). While previous models demonstrated
transformer effectiveness for general disease prediction, RarePT’s architecture addressed specific challenges of rare disease prediction: limited training data and the need to learn from few examples.
Like BEHRT’s standardization of disease codes, RarePT utilized phecodes for data representation. The researchers mapped ICD-10 codes to phecodes using standard map ping procedures, resulting in a dataset of 436,407 partici pants with 1,558 unique phecodes. To ensure balanced training, they constructed a dataset of 100 cases and 100 controls for each phecode, excluding phecodes with insufficient examples. This approach resulted in 1,297 query phecodes, with 273 rare phecodes remaining in the training data but not used as query targets (Jordan u.a., 2023).
RarePT’s evaluation built on the methodologies established by previous models. While BEHRT and Trans formerEHR focused on general disease prediction tasks, RarePT was tested on 155 rare phecodes, defined as those appearing in fewer than 1 in 2,000 UK Biobank participants. Performance was measured using diagnostic odds ratio (OR), with the model achieving a median OR of 48.0 across all rare phecodes. This performance remained consistent in cross-validation and validation on an independent cohort from the Mount Sinai Health System, where it achieved a median OR of 30.6 (Jordan u.a., 2023).
The model’s validation extended beyond the standard met rics used by previous models. The researchers conducted regression analyses to test associations with mortality and disability-adjusted life years (DALY), finding significant associations in 65% and 73% of tested phecodes respectively. They evaluated the model’s predictions against known diagnostic biomarkers, identifying 75 defined relationships between 32 rare phecodes and 23 laboratory tests. In these evaluations, 72% of predictions showed statistically significant associations with the expected biomarker abnormalities (Jordan u.a., 2023).
The use of phecodes in RarePT standardized disease representation across healthcare systems, building on the standardization approaches used in BEHRT and TransformerEHR. While phecodes were designed for common phenotypes rather than rare diseases, the analysis demonstrated that this coding system could capture rare disease patterns. (Jordan u.a., 2023).
Transfer Learning and world models
One interesting area of research that has emerged demonstrates the effectiveness of using pre-trained large language models for specialized domain applications through transfer learning. Transfer learning enables the creation of high-performance learners using data from
different domains (Weiss u.a., 2016). The success of this approach lies in the rich semantic understanding and pattern recognition capabilities that large language models develop during their initial training, which can then be adapted to specific tasks through targeted fine-tuning.
A notable example is LLAVA, which successfully adapted a large language model for vision understanding tasks by adding a linear projection layer and implementing additional pre-training (Liu u.a., 2023). LLAVA achieved strong performance with relatively modest training data requirements, demonstrating that pre-trained models can effectively learn new modalities without requiring extensive domain-specific training data. The model’s ability to leverage its pre-existing language understanding while adapting to visual tasks through a simple projection mechanism suggested that similar approaches could be effective in other specialized domains.
This success in cross-domain adaptation is particularly relevant for medical applications, where large-scale la beled datasets are often difficult to obtain and data privacy concerns can limit training options. The ability to leverage pre-trained language models’ semantic understanding while adapting them to specific medical tasks through targeted architectural modifications could provide a more efficient path to developing effective medical prediction systems. These considerations motivated our approach of using LLaMA for disease prediction tasks, where we hypothesized that the model’s pre-existing semantic understanding could be effectively channeled toward med ical diagnosis prediction through appropriate architectural modifications and fine-tuning strategies.
The application of pretrained language models has also recently shown promise in predicting ligand-protein interactions (LPI), paralleling developments in clinical pre diction tasks. Just as disease prediction models like Trans formerEHR leverage transformer architectures to capture complex temporal patterns in medical histories, recent work has demonstrated that pretrained language models can effectively capture the intricate relationships between ligands and their target proteins (Fauber, 2024). Fauber’s approach of using pretrained small language models (SLMs) for LPI affinity prediction shares conceptual similarities with LLAVA’s vision adaptation strategy (Liu u.a., 2023). Both demonstrate how pretrained language models can be effectively adapted to specialized domains through targeted fine-tuning. While LLAVA added a linear projection layer for vision tasks, Fauber’s method directly fine-tuned pretrained SLMs on ligand SMILES strings and protein amino acid sequences, achieving significant improvements over traditional machine learning and physics-based approaches (Fauber, 2024).
The success of pretrained models in both vision adaptation and LPI affinity prediction suggests a common underlying principle: the semantic understanding developed through pretraining on general language can be effectively transferred to specialized domains. This is particularly notable given that both domains involve complex data - large datasets of images in the case of LLAVA, and molecular interaction patterns in the case of LPI prediction. Fauber’s results demonstrated that instruction-tuned SLMs could achieve 44% accuracy in exact matches for LPI affinity predictions, with accuracy improving to 79% when allowing for predictions within one ordinal value (Fauber, 2024). These results significantly outperformed traditional machine learning approaches, which achieved only 7% accuracy, highlighting the value of transfer learning from pretrained language models. This mirrors the improvements seen in clinical prediction tasks when leveraging pretrained transformer architectures, as demonstrated by BEHRT and TransformerEHR (Li u.a., 2020; Yang u.a., 2023).
The success of pretrained models in both domains too suggests that the sophisticated internal representations developed through language pretraining may be more generalizable than previously thought. Fauber’s work showed that pretrained SLMs could effectively predict both common and rare ligand-protein interaction patterns (Fauber, 2024).There is also research to suggest that language models develop sophisticated internal representations during pretraining insight into why pretrained weights may transfer effectively to specialized domains like disease prediction.
This emergence of "world models", the generalizable internal representations that models develop during pre training, may help explain the success of transfer learning approaches across different domains, with research like Abdou u.a. (2021) showing that the internal representations of large language models of various English color words correspond to their representation in CIELAB, a color space, and Marks u.a. (2024), showing that it is possible to distinguish whether a LLM is telling the truth or not by looking at its internal hidden state.
Another interesting study was done by Rasul u.a. (2024), where they trained a Llama-based transformer model called Lag-LLama, a general-purpose foundation model for univariate probabilistic time series forecasting based on a decoder-only transformer architecture that uses lags as covariates, a large corpus of diverse time series data from several domains, which showed great zero-shot generalization capabilities even on unknown time series datasets, indicating that the model itself must be creating a generalizable world model that could be applied even to domains out of the distribution of the original pre-training data.
A particularly illuminating study by Li u.a. (2024) demonstrated how even a simple transformer model trained on Othello game sequences could develop sophisticated internal representations of game state without explicit supervision. Through careful probing experiments, they showed that the model learned nonlinear representations of board state that were causally linked to its move predictions. This emerged purely from training on move sequences, suggesting that transformer architectures have an innate ability to discover and represent the underlying structure of sequential data. Furthermore, the intervention experiments in Li u.a. (2024) showed that these world models were causally linked to model predictions, not just correlational artifacts. This also suggested that large language models like LLaMA may develop sophisticated internal representations of causality and temporal relationships during their pretraining phase.
This understanding of world models directly motivated our approach of using pretrained large language model weights for disease prediction. We hypothesized that a large language model’s pretraining on vast amounts of text and medical text likely resulted in the development of useful internal representations of disease relationships, progression patterns, and medical concepts. By initializing our disease prediction model with these weights, we could potentially leverage these pre-existing world models rather than having to learn such representations from scratch using only limited medical data.
Thus building upon these advances in transformer architectures for clinical prediction, we propose an approach that investigates the potential benefits of using pre-trained language model weights for disease prediction tasks. While previous work has focused on training transformer architectures specifically for clinical data, our study exam ines whether the semantic understanding encoded in pre trained language models can enhance prediction accuracy. We utilize the LLaMA model architecture, adapting it to process sequences of ICD-10 codes while maintaining their temporal ordering, similar to the approach used in (Yang u.a., 2023). This approach aims to determine whether pre-trained weights offer advantages over models trained exclusively on clinical data, potentially providing insights into the value of transfer learning from general language understanding to specific clinical tasks.
Data and Code Availability
The MIMIC-IV dataset used in this study is publicly available through PhysioNet (https://physionet.org/content/mimiciv/2.2/). Access requires completion of a CITI "Data or Specimens Only Research" course and execution of a data use agreement. Our preprocessing scripts, model implementation, and evaluation code are available on request. The pretrained LLaMA-2 7B weights used in this study are available from Meta AI under the LLaMA 2 Community License. Our implementation uses PyTorch and the Transformers library.
Methods
Dataset and Preprocessing
We utilized the MIMIC-IV dataset (Johnson u.a., 2023), a comprehensive database of de-identified health records from hospital admissions. Our preprocessing pipeline structured the data to represent each patient’s history as a temporal sequence of visits, with each visit containing a set of ICD-10 diagnostic codes. This sequential representation preserves both the temporal relationships between visits and the ordering of conditions within each visit, establishing the foundation for our predictive modeling approach.
Data Representation and Tokenization
Our approach to representing patient data combines hierarchical code structure with temporal sequence preservation through a specialized tokenization strategy. The process encompasses three key components:
Vocabulary Construction
We developed a specialized vocabulary that captures the hierarchical nature of ICD-10 diagnostic codes. Each code is decomposed into two fundamental components:
• Code category: Representing the broad diagnostic classification
• Code etiology: Capturing specific condition details
This decomposition enables the model to learn relationships at multiple levels of diagnostic granularity. The vocabulary includes special tokens for sequence structuring:
• Beginning-of-sequence token (<s>)
• End-of-sequence token (</s>)
• Visit separator token (<sep>)
Sequence Generation
For each patient record, we generate a tokenized sequence that preserves both temporal ordering and diagnostic structure:

where cji and eji represent the category and etiology tokens for the j-th diagnosis in visit i, and ni denotes the number of diagnoses in visit i. This representation captures both the temporal progression of patient visits and the hierarchical nature of medical diagnoses.
Model Architecture and Implementation
Our model architecture extends the LLaMA 2 foundation model, implementing a specialized medical diagnosis prediction system through three integrated components:
Input Embedding
The model begins with a dedicated MimicEmbedding layer that maps ICD-10 diagnostic codes to dense input vector representations that match the internal dimensions of the Llama 2-7b foundation models.

Each input representation then combines the code embedding with positional information to be fed into the transformer backbone.

Transformer Backbone
The transformer backbone utilizes the LLaMA 2 7B architecture without the initial embedding layer, but including the final language model head, so as to leverage as much as possible of the pre-trained weights.

Final Prediction Layer
The final prediction layer maps transformer outputs to diagnostic probabilities:

Training Methodology
We compared two distinct training approaches to evaluate the impact of leveraging pre-training model weights.
Pre-trained Initialization: Using weights from the pre-trained LLaMA-2 7B model (Touvron u.a., 2023)
Random Initialization: Training from scratch as a control condition
Both variants were optimized using cross-entropy loss (Cox, 1958):

where N represents the sample count, C is the number of possible diagnosis codes, yij represents true labels, and pij denotes predicted probabilities. Training was conducted for 2058 steps using an Nvidia A100-80GB GPU, allowing direct comparison of pre-training benefits while controlling for architectural differences and training procedures.
Results
Our experiments demonstrate that pre-trained weights from the LLaMA model provide measurable advantages in both training dynamics and final model performance. Figure 1 presents the validation metrics for both the pre initialized and randomly initialized models over 2,000 training steps.
Model Performance
The pre-initialized model achieved a validation accuracy of 7.40%, compared to 3.05% for the randomly initialized model over the first 2,058 training steps, representing a 4.35% improvement (or approximately 142.6% relative improvement). This substantial difference in early training performance demonstrates the significant advantage of using pre-trained weights for efficient model convergence. Table I presents the key performance metrics for both models, and the same can be seen in figures 2 and 3
Training Dynamics
The most striking difference between the two approaches appears in the convergence behavior and accuracy trajectories:
Loss Convergence
The pre-initialized model demonstrates notably faster convergence in validation loss, as can be seen from figure 3, reaching a significantly lower state within the first 2,058 training steps than the randomly initialized model. The final validation loss values were 1.21 and 1.40 for the pre initialized and randomly initialized models respectively, indicating that the pre-trained weights provide a more efficient pathway to optimization.
Accuracy
The pre-initialized model consistently showed higher accuracy throughout training, reaching accuracy rates of 7.40% much earlier than the randomly initialized model, and consistently maintaining superior performance, as can over the dataset.
Discussion
Impact of Pre-training
The experimental results suggest several key advantages of utilizing pre-trained weights:
Enhanced Convergence Speed
The accelerated convergence of the pre-initialized model suggests that the semantic understanding encoded in the pre-trained weights provides a beneficial starting point for learning medical diagnosis patterns. This faster convergence has practical implications for model training costs and iteration speed in clinical applications.
Improved Model Performance
The consistent improvement in accuracy, coupled with lower validation loss and perplexity, indicates that pre trained weights offer advantages beyond mere training efficiency. The pre-initialized model appears to develop a more robust understanding of the relationships between diagnoses, as evidenced by its superior perplexity scores.
Clinical Implications
The 142% relative percentage point improvement in accuracy could have significant implications in clinical settings:
1. Rare Conditions: Even small improvements in overall accuracy can translate to meaningful gains in the prediction of less common conditions, where model performance typically struggles.
2. Resource Efficiency: The faster convergence of pre initialized models suggests that effective clinical prediction systems could be developed with less training data and computational resources, potentially making these systems more accessible to smaller healthcare institutions.
Limitations and Future Work
Several limitations and opportunities for future research emerge from our findings:
1. Scale Effects: Investigation of how these benefits scale with larger training datasets and longer training periods could provide insights into the optimal application of pre-trained weights.
2. Architecture Optimization: Further research could explore modifications to the architecture that better leverage the pre-trained weights while accommodating the specific characteristics of medical data. For example, adding textual descriptions of the ICD-10 codes could help increase the accuracy with the pretrained large language model even further, as the representation would be more aligned with how the model was pretrained.
3. Interpretability: Additional work is needed to understand how pre-trained weights influence the model’s decision-making process and whether they enhance or complicate model interpretability. In addition, further investigation into what type of pretraining results in tangible improvements could be necessary, by comparing models that have additional RLHF (Reinforcement Learn ing Human Feedback) training vs models that have been only pretrained.
These results suggest that leveraging pre-trained language models for medical prediction tasks offers tangible benefits, both in terms of model performance and training efficiency. The consistent improvements across multiple metrics indicate that the semantic understanding encoded in pre-trained weights can effectively transfer to specialized clinical tasks.
Conclusion
This study investigated the benefits of leveraging pre trained language model weights for medical diagnosis prediction, specifically examining whether the semantic understanding encoded in the LotusAI-Predict LLaMA model could enhance prediction accuracy and training efficiency. Our results demonstrate that pre-trained weights provide substantial advantages in both model performance and training dynamics, with the pre-initialized model achieving a validation accuracy of 7.40% compared to 3.05% for the randomly initialized model over the first 2,058 steps, representing an 142.6% relative improvement.
The accelerated training convergence, evidenced by the reduction in validation loss from 1.40 to 1.21 between the randomly initialized and pre-initialized model the suggests that the semantic understanding embedded in pre-trained language models can effectively transfer to specialized clinical tasks. This finding has important implications for the development of clinical prediction systems, particularly in reducing the computational resources and training data required to achieve robust performance. The consistent improvements across our measured metrics demonstrate that pre-trained weights provide a powerful foundation for learning complex relationships in medical data.
The improvement in accuracy - more than doubling the model’s predictive capabilities over the same number of training steps - represents an advancement in medical prediction tasks. This substantial performance gain, combined with faster convergence and improved training efficiency, could transform how healthcare institutions implement prediction systems. The consistently superior performance throughout the training process demonstrates
that pre-trained weights enhance the model’s ability to learn and generalize medical patterns, and also lends credence to the assertion that large language models learn generalizable world models when fine-tuned over large text corpora.
These findings open several directions for future research, particularly in investigating how these benefits might scale with larger datasets and longer training periods. Given the significant improvements observed with our current implementation, further research into architectural optimizations specifically designed to leverage pre trained weights in clinical contexts could potentially yield even more dramatic improvements. Additionally, deeper analysis of model interpretability could help understand how pre-trained weights contribute to this substantial performance enhancement in medical predictions.
This work contributes evidence supporting the effective ness of transfer learning in specialized domains, while specifically demonstrating its potential in clinical prediction tasks. The improvement in accuracy suggests that future development of clinical prediction systems could be improved by leveraging the semantic understanding encoded in large language models. As healthcare systems continue to generate increasingly large volumes of clinical data, the ability to train predictive models much quicker to higher accuracy through leverage pre-trained language models could help lead to an advancement in medical prediction capabilities, which could improve patient care and clinical outcomes.
References
Cox, David R(1958): The regression analysis of binary sequences, 2: 215–232.
Steindel, Steven J(2010): International classification of diseases, 10th edition, clinical modification and procedure coding system: descriptive overview of the next generation HIPAA code sets, 3: 274–282.
Weiss, Karl / Khoshgoftaar, Taghi M / Wang, DingDing(2016): A survey of transfer learning, 1: 1–40. Li, Yikuan u.a.(2020): BEHRT: Transformer for Elec tronic Health Records, 1: 7155.
Jordan, Daniel M / Vy, Ha My T / Do, Ron(2023): A deep learning transformer model predicts high rates of undiagnosed rare disease in large electronic health systems2023.12.21.23300393.
Johnson, A. E. W. / Bulgarelli, L. / Shen, L. / others u.a.(2023): MIMIC-IV, a freely accessible electronic health record dataset, 1: 1.
Yang, Zhichao / Mitra, Avijit / Liu, Weisong / Berlowitz, Dan / Yu, Hong(2023): TransformEHR: transformer based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records, 7857:
Abdou, Mostafa / Kulmizev, Artur / Hershcovich, Daniel / Frank, Stella / Pavlick, Ellie / Søgaard, Anders (2021): Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color.
Fauber, Ben (2024): Accurate Prediction of Ligand Protein Interaction Affinities with Fine-Tuned Small Language Models
Grattafiori, Aaron u.a. (2024): The Llama 3 Herd of Models.
Li, Kenneth / Hopkins, Aspen K. / Bau, David / Viégas, Fernanda / Pfister, Hanspeter / Wattenberg, Martin (2024): Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task.
Liu, Haotian / Li, Chunyuan / Wu, Qingyang / Lee, Yong Jae (2023): Visual Instruction Tuning.
Marks, Samuel / Tegmark, Max (2024): The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets .
Office of the National Coordinator for Health Information Technology (2021): National Trends in Hospital and Physician Adoption of Electronic Health Records , Accessed: 2024-11-30.
Rasul, Kashif u.a. (2024): Lag-Llama: Towards Founda tion Models for Probabilistic Time Series Forecasting.
Touvron, Hugo u.a. (2023): Llama 2: Open Foundation and Fine-Tuned Chat Models.
Vaswani, Ashish / Shazeer, Noam / Parmar, Niki / Uszkoreit, Jakob / Jones, Llion / Gomez, Aidan N. / Kaiser, Lukasz / Polosukhin, Illia (2023): Attention Is All You Need.