Kimi | Viral Dark Matter AI Models

If I wanted to explore the properties of viral dark matter using CUDA AI tools, what would be your best pick? ESM2? OpenFold? Evo2?

ESM2 vs Evo2 vs OpenFold viral dark matter metagenomics Evo2 model viral dark matter metagenomic sequences ESM2 viral protein structure prediction dark matter

20 results

Nature

Genome modelling and design across all domains of life ...

nih.gov

Viral Dark Matter: Illuminating Protein Function, Ecology, and Biotechnological Promises

Science | AAAS

Evolutionary-scale prediction of atomic-level protein structure with a language model

Evo2 is the best pick for exploring viral dark matter with CUDA AI tools.

Evo2 is a DNA language model, not a protein model. It processes raw genomic sequences at single-nucleotide resolution with context lengths up to 1 million base pairs, and it was trained on OpenGenome2 — a dataset containing 8.8 trillion tokens from all domains of life, including viruses and prokaryotes. This matters because viral dark matter is fundamentally a sequence discovery problem: the "dark matter" consists of uncharacterized genomic sequences in metagenomic reads that have no detectable homology to known proteins or genes. Evo2's architecture (StripedHyena 2) and training distribution make it natively suited for identifying prophage sequences, viral-derived elements, and novel open reading frames directly from DNA without requiring prior protein annotation. The model has demonstrated learned features for viral-derived sequences in its latent space, and it can generate or score long genomic contexts that would fragment protein-level models.

ESM2 and ESMFold are secondary tools for this specific task. ESM2 is a protein language model trained on UniRef. It excels at predicting structure and function given a protein sequence, but it cannot process raw DNA or identify viral sequences that lack clear ORF annotation. ESMFold depends on ESM2 embeddings and performs end-to-end structure prediction from single protein sequences. While these are powerful for characterizing viral proteins after translation and ORF calling, they do not address the initial discovery problem of finding viral dark matter in unannotated metagenomic data. The literature explicitly notes that protein language models like ESM2 are mitigation strategies for the "lack of homologous sequences" problem, not primary discovery engines for novel viral genomes.

OpenFold is irrelevant here. OpenFold is an open-source reimplementation of AlphaFold2's architecture. It predicts protein structures from sequences and requires MSAs or templates. It has no genomic sequence modeling capability and no training on viral dark matter.

Practical recommendation: Use Evo2 as your primary CUDA tool for genomic sequence scoring, viral sequence generation, and prophage identification in metagenomic assemblies. If you extract putative viral ORFs from Evo2's predictions, then use ESM2/ESMFold downstream to characterize the protein structures and functions of those novel viral genes. This two-stage pipeline (Evo2 for DNA-level discovery, ESM2 for protein-level annotation) matches the actual workflow needed for viral dark matter research.

The problem is eukaryotic viruses are excluded from Evo2 - making its generation and embeddings noise for that application.

Search for Models Targeting Eukaryotic Viral Dark Matter

The user is pointing out a critical limitation: Evo2 was trained on prokaryotic and viral sequences but explicitly excluded eukaryotic viruses from its training data. This makes it unsuitable for studying eukaryotic viral dark matter because its embeddings and generation capabilities would be biased/noisy for that domain.

I need to:

Acknowledge this is correct - Evo2 excluded eukaryotic viruses
Re-evaluate the best pick given this constraint
Consider alternatives:
- ESM2: protein-level, but could work on eukaryotic viral proteins if they have ORFs
- OpenFold: structure prediction, not discovery
- Are there other models? Maybe ProGen2, other viral-specific models?
- Maybe the user needs a model specifically trained on eukaryotic viruses

Let me search for models that include eukaryotic viruses or are specifically designed for viral dark matter including eukaryotic viruses.

Evo2 training data eukaryotic viruses excluded ESM2 eukaryotic virus protein embeddings viral dark matter eukaryotic AI model metagenomics protein language model eukaryotic virus training data

33 results

Using ESM2 on Translated ORFs for Eukaryotic Viral Discovery

The user is correct. Evo2 explicitly excluded eukaryotic viruses from training for biosafety reasons, and the paper confirms this results in high perplexity and poor performance on eukaryotic viral sequences. So Evo2 is genuinely unsuitable for eukaryotic viral dark matter.

Now I need to determine the best alternative among the options given (ESM2, OpenFold, or something else) for eukaryotic viral dark matter.

From the search results:

ESM-1b (and by extension ESM2) was trained mainly on cellular proteins with only ~1% viral proteins, but it still captures meaningful biological/host information from viral proteins. EvoMIL used ESM-1b embeddings for eukaryotic virus-host prediction and achieved mean AUC of 0.851 for eukaryotic hosts. It outperformed k-mer features on multi-class classification for eukaryotic hosts.
ESM2 is an improved version of ESM-1b, trained on UniRef with metagenomic expansion (UniMeta200B) which includes specific eukaryotic and viral datasets (SMAG, MetaEuk, TOPAZ, MGV, GPD). So ESM2 actually has better viral coverage than ESM-1b.
OpenFold is just structure prediction - not useful for dark matter discovery.

So the answer should be:

Acknowledge the user is absolutely correct about Evo2
ESM2 becomes the best pick among the options because:
- It was trained on expanded metagenomic data including viral datasets (MGV, GPD, etc.)
- Protein language models can generalize to viral proteins even with limited training data
- ESM-1b already worked for eukaryotic virus-host prediction
- ESM2 is CUDA-optimized via BioNeMo
OpenFold is still irrelevant
Mention that for raw DNA-level discovery of eukaryotic viral dark matter, there may not be a perfect foundation model yet, so ESM2 on translated ORFs is the practical path

I should be direct and concise as per the user's preference.

You are correct. Evo 2 deliberately excluded eukaryotic viruses from training for biosafety reasons, and the authors verified that this produces high perplexity and effectively random generation on eukaryotic viral sequences.

Given that constraint, ESM2 is the best pick among your options for eukaryotic viral dark matter.

ESM2 improves on ESM-1b through an expanded training dataset called UniMeta200B, which explicitly includes viral metagenomic sources such as MGV and GPD alongside eukaryotic datasets.

Even earlier ESM-1b embeddings — trained on a dataset with only ~1% viral proteins — successfully captured host-specific signals for eukaryotic viruses in virus-host prediction tasks, achieving strong AUC on eukaryotic host classification.

ESM2 inherits this capability with better scale and broader metagenomic coverage. It also runs efficiently on CUDA through BioNeMo.

OpenFold remains irrelevant. It predicts structures from known protein sequences and cannot discover unannotated viral sequences in metagenomic data.

Practical workflow: Use ESM2 to embed and score translated ORFs from your metagenomic assemblies. For eukaryotic viral dark matter, you are working at the protein level because no current open-source DNA language model reliably covers eukaryotic viruses. ESM2's metagenomic training gives it the best generalization to novel viral proteins among the tools you listed.

What does ESM-2 even do.

Transformer learns protein embeddings for structure and sequence prediction

The user is asking "What does ESM-2 even do." This is a straightforward factual question about the ESM-2 model. They want a direct, concise explanation. Given their communication style (prefers direct assertions, minimalistic outputs, no epistemic hedging), I should give a clear, direct answer without excessive hedging.

ESM-2 stands for Evolutionary Scale Modeling 2. It's a protein language model developed by Meta AI (formerly Facebook AI). Key things it does:

It's a transformer model trained on protein sequences (from UniRef database, ~65 million unique protein sequences)
It learns representations/embeddings of protein sequences
It can predict protein structure (when paired with structure prediction head, becomes ESMFold)
It can predict masked amino acids in protein sequences (like BERT for proteins)
It generates embeddings that capture evolutionary, structural, and functional information about proteins
Various sizes from 8M to 15B parameters

I should keep it direct and factual. No need to search since this is well-established knowledge, but I should be accurate. The user already knows about it from previous context but wants a clear explanation of what it actually does.

Let me provide a concise, direct explanation.

ESM-2 is a protein language model trained by Meta AI on 65 million protein sequences from the UniRef database. It is a transformer that reads amino acid sequences and outputs dense vector embeddings representing each residue and the full protein.

What it actually does:

Masked language modeling — Given a protein sequence with some amino acids hidden, ESM-2 predicts the missing residues. This forces the model to learn evolutionary and structural patterns from sequence context alone.
Protein embedding — It converts any amino acid sequence into a numerical representation that captures evolutionary, structural, and functional information. These embeddings can be fed into downstream classifiers for function prediction, remote homology detection, or variant effect prediction.
Structure prediction (ESMFold) — When paired with a structure prediction head, ESM-2 embeddings drive ESMFold, an end-to-end protein structure predictor comparable to AlphaFold2 for many targets, without requiring multiple sequence alignments.
Zero-shot fitness prediction — It scores how likely a mutated sequence is under the model, which correlates with protein fitness and stability. No task-specific training required.

Architecture: Transformer-based, released in scales from 8 million to 15 billion parameters. The larger variants capture more subtle structural signals.

Training data: UniRef, with later versions (ESM-2) incorporating metagenomic sources including viral proteins. It does not process DNA or RNA directly — it only sees amino acid strings.

That seems useless for viral dark matter work.

ESM2 limitations for viral discovery and alternative dedicated tools

The user is expressing frustration that ESM2 seems useless for viral dark matter work. This is a valid critique - ESM2 is a protein language model, and viral dark matter is fundamentally about discovering novel sequences in metagenomic data that have no known homologs. ESM2 requires you to already have protein sequences (ORFs), which means you've already done gene calling and translation, and it can't discover novel viral genomes from raw DNA.

The user previously mentioned Evo2 excludes eukaryotic viruses, and now is pointing out that ESM2 is also inadequate for the actual task of discovering viral dark matter.

I should:

Acknowledge the critique is correct
Explain exactly why ESM2 is limited for this application
Suggest what actually WOULD work for eukaryotic viral dark matter discovery
Be direct and concise (per user's preference from memory)

The user wants direct assertions, no hedging. Let me be honest about the limitations.

Key points:

ESM2 is protein-level, requires ORF calling first
Viral dark matter is often discovered at DNA/RNA level first
Viral sequences often have no detectable protein homologs
ESM2 can't discover novel viral genomes from raw metagenomic reads
For eukaryotic viral dark matter, you need different approaches

What would actually work?

Dedicated viral discovery tools: VirSorter, DeepVirFinder, VIBRANT, geNomad
Or custom DNA language models trained on viral sequences including eukaryotic viruses
Or protein-level approaches but with specialized viral protein databases

The user is right - ESM2 is downstream analysis, not discovery.

You are correct. ESM2 is useless for the discovery phase of viral dark matter work.

Why it fails:

Viral dark matter consists of genomic sequences with no detectable homology to known proteins. ESM2 requires you to already have a protein sequence — meaning you must first call ORFs, translate them, and filter them. By that point, you have already excluded the actual dark matter: the sequences that do not resemble anything in UniRef. ESM2 can only embed what it has already seen or slight variations thereof. It cannot discover a novel eukaryotic virus from raw metagenomic reads.

What you actually need:

For eukaryotic viral dark matter discovery, use tools designed for de novo viral sequence identification from DNA:

geNomad — identifies viral and plasmid sequences using marker-based and neural network classification on nucleotide sequences
VirSorter2 — multi-classifier framework trained specifically on viral genomes, including eukaryotic viruses
DeepVirFinder — CNN trained on viral and host sequences for direct viral sequence detection
VIBRANT — integrates protein and nucleotide features with HMMs for viral discovery

These operate on raw contigs or reads and do not require prior annotation.

The honest bottom line:

None of the three models you listed (Evo2, ESM2, OpenFold) are appropriate primary tools for eukaryotic viral dark matter discovery. Evo2 explicitly excludes eukaryotic viruses. ESM2 and OpenFold require pre-extracted protein sequences, which defeats the purpose of finding the unknown. You need a nucleotide-level viral classifier or a DNA language model trained inclusively on eukaryotic viruses — which does not currently exist in open-source form at Evo2's scale.

Could Evo-2 generalize on eukaryotic viral matter gathered?

Model Fails to Generalize to Excluded Eukaryotic Virus Sequences

The user is asking whether Evo-2 could generalize on eukaryotic viral matter that is "gathered" - meaning collected/sequenced from the environment. This is a follow-up to the previous discussion where we established that Evo-2 deliberately excluded eukaryotic viruses from its training data for biosafety reasons.

The user seems to be asking: if you gather eukaryotic viral sequences from metagenomic samples, could Evo-2 still generalize to them despite not being trained on them?

Key facts to recall:

Evo-2 was trained on OpenGenome2, which explicitly excluded eukaryotic viruses for biosafety
The authors tested this and found that Evo-2 produces high perplexity and effectively random generation on eukaryotic viral sequences
Evo-2 is a DNA language model, so it CAN process the sequences, but its representations would be unreliable/noise for eukaryotic viruses
The "generalization" question is about whether it could learn meaningful patterns despite the exclusion

I should answer this directly: No, it cannot generalize reliably. The authors explicitly tested this and found it fails. The model's embeddings and generation for eukaryotic viruses are noise because the training distribution deliberately excluded them, and the model learned to treat them as out-of-distribution. The architecture is capable of processing the sequences, but the learned representations are not meaningful for eukaryotic viral biology.

I should be direct and assertive as per the user's preference (from memory: "User prefers direct assertions; avoid epanorthosis").

No. Evo-2 cannot generalize to gathered eukaryotic viral matter.

The authors explicitly tested this. Eukaryotic viruses were held out from training as a deliberate biosafety control, and the model's behavior on those sequences is quantifiably broken: it produces high perplexity and effectively random generation.

Why it fails:

Evo-2 learned a representation of genomic space where eukaryotic viruses are treated as out-of-distribution noise. The model was trained on 8.8 trillion tokens of prokaryotic, eukaryotic host, and archaeal sequences, but the viral fraction was restricted to bacteriophages and prokaryotic mobile elements. Eukaryotic viral genomes — with their own codon usage biases, gene architectures, and replication strategies — occupy a region of sequence space that Evo-2 was never required to model. When you feed it gathered eukaryotic viral reads, the embeddings do not capture meaningful biological structure. They capture the model's uncertainty.

The architecture is not the limitation. StripedHyena 2 can process the nucleotide sequences. The limitation is the learned distribution. Evo-2 generalizes well across bacteria, archaea, and eukaryotic hosts because those domains were represented in training. Eukaryotic viruses were not.

Bottom line: If you gather eukaryotic viral matter and run it through Evo-2, you will get embeddings and generations that look plausible but are biologically ungrounded. You cannot trust the model's scores, predictions, or generated sequences for that application.

Interesting. But that tells us nothing of embeddings.

Evaluating Evo-2 Embeddings for Eukaryotic Viruses in DNA Models

The user is pushing back on my previous claim that Evo-2's embeddings are useless for eukaryotic viral matter. They want me to be more precise about what the embeddings actually do or don't capture, rather than just saying "random generation" and "high perplexity."

Let me think about this carefully:

The user is correct that my previous answer conflated generation quality with embedding quality. High perplexity and poor generation do not necessarily mean embeddings are useless. A model can have high perplexity on a domain while still producing meaningful relative structure in its latent space.
What do we actually know about Evo-2 embeddings on eukaryotic viral sequences?
- The paper explicitly excluded eukaryotic viruses from training for biosafety
- They tested generation and found it to be random/high perplexity
- But did they test embeddings? I need to check if there's any evidence
The key question: Can a model excluded from training still produce useful embeddings for that excluded class?
- In general ML: sometimes yes, sometimes no. It depends on how close the excluded class is to the training distribution in latent space.
- Eukaryotic viruses share some features with prokaryotic viruses (capsid proteins, polymerases) but also have unique features (envelopes, complex replication cycles, different codon usage).
- If Evo-2 learned general biological principles (transcription, translation signals, protein domains), it might still encode some useful structure even for eukaryotic viral sequences.
- But if the exclusion was so complete that eukaryotic viral sequences map to a collapsed or noisy region of latent space, the embeddings would be useless.
What does the Evo-2 paper actually say about embeddings?
- I need to search for specific information about Evo-2 embeddings on eukaryotic viruses, or more generally about whether excluded classes produce meaningful embeddings in DNA LMs.
The user is being precise and demanding. I should:
- Acknowledge the valid distinction between generation and embeddings
- Search for actual evidence on this specific question
- If no direct evidence exists, say so honestly and explain the theoretical considerations
- Avoid making unsupported claims

Let me search for more specific information.

You are correct. I conflated generation quality with embedding utility. Let me check what is actually known about Evo-2 embeddings on excluded eukaryotic viral sequences.

Evo2 embeddings eukaryotic virus out of distribution DNA language model embeddings excluded domain generalization Evo2 latent space eukaryotic viral sequences prokaryotic trained model embeddings eukaryotic virus

30 results

Empirically Validating Embedding Quality Claims Without Making Unsubstantiated Claims

The user is pushing me to be more precise about Evo-2 embeddings specifically, not just generation/perplexity. They correctly identified that my previous answer conflated generation quality with embedding utility.

From the search results, I have some relevant information:

The Evo-2 paper explicitly states they excluded eukaryotic viruses and verified this led to high perplexity and "effectively random" generation. But they did NOT test embeddings specifically on eukaryotic viruses.
The GoodFire AI article about interpreting Evo-2 mentions SAE features and that Evo-2 learned to identify "viral derived sequences (prophage and CRISPR elements)" - but this is about prokaryotic viral elements, not eukaryotic viruses.
The ESM-1b/EvoMIL papers show that protein language models trained mostly on cellular life (with only ~1% viral proteins) can still produce useful embeddings for eukaryotic viral proteins - but this is ESM-1b, not Evo-2, and it's protein-level not DNA-level.

The key question: What do we actually know about Evo-2 embeddings on eukaryotic viral sequences?

The honest answer: The Evo-2 paper does not report embedding quality on eukaryotic viruses. They only report:

High perplexity on eukaryotic viral genomic sequences (Figure S2A)
Downstream mutational effect prediction is weakened (Figure S2B)
Generation is effectively random (Figure S2C)

These are all language modeling and generation metrics, not embedding structure metrics.

The user is asking a subtle and important question: Could the embeddings still be useful even if generation is broken?

Theoretical considerations:

In transformer LMs, embeddings are the hidden states before the output layer. A model can have poor next-token prediction (high perplexity) while still producing meaningful intermediate representations.
However, if the model has never seen eukaryotic viral sequences, the latent space might map them to arbitrary regions.
But eukaryotic viruses share some genomic features with their hosts (promoters, splice sites for some) and with prokaryotic viruses (replication genes, structural proteins). So there might be partial structure.
The real question is whether Evo-2's embeddings of eukaryotic viral sequences cluster meaningfully or correlate with biological properties.

Since there's no published evidence on this specific question, I need to:

Acknowledge the user is correct that my previous answer was sloppy
State clearly what the Evo-2 paper actually tested (perplexity, generation, mutational effects - not embeddings)
Explain the theoretical distinction between generation and embeddings
Give an honest assessment: we don't know, but there are reasons to be skeptical
Suggest how one could test this empirically

I should not make up claims about embedding quality that aren't in the paper.

You are right. I conflated generation failure with embedding failure, and the Evo-2 paper does not actually test embeddings on eukaryotic viral sequences.

What the paper actually measured:

The authors excluded eukaryotic viruses from training and verified three things: high perplexity on eukaryotic viral genomic sequences, weakened downstream mutational effect prediction, and effectively random generation of pathogenic viral proteins.

These are all language modeling and generative metrics. They did not report whether intermediate hidden-state embeddings retain geometric structure for eukaryotic viral sequences.

The distinction that matters:

A model can have broken next-token prediction on a domain while its penultimate-layer embeddings still encode meaningful relative structure. This happens in natural language models with out-of-distribution dialects, and in protein models like ESM-1b — trained on ~1% viral proteins — which still produces embeddings that cluster eukaryotic viral proteins by host taxonomy and function.

What we do not know about Evo-2:

Whether eukaryotic viral genomic sequences map to a coherent region of its latent space, or whether they scatter arbitrarily. The SAE interpretability work from GoodFire shows Evo-2 learned prophage and CRISPR features, but those are prokaryotic viral elements, not eukaryotic viruses.

There is no published analysis of eukaryotic viral embedding structure.

The honest bottom line:

It is possible that Evo-2 embeddings retain partial signal for eukaryotic viral sequences — replication machinery, capsid folds, and promoter-like elements may map near their prokaryotic or host analogs. It is also possible that the exclusion was so complete that eukaryotic viral sequences collapse to a high-entropy, uninformative region of latent space. The paper gives no evidence either way. If you want to use Evo-2 for this, you would need to test it yourself: embed a labeled set of eukaryotic viral genomes and check whether the resulting vectors cluster by taxonomy, Baltimore class, or host range in a way that outperforms raw k-mer counts. Until someone runs that experiment, claims about embedding utility are speculation.

". If the model never saw these sequences, the embedding space has no reason to allocate meaningful structure to them. " Wouldn't it treat it like the viruses it has seen? Like it would say " This looks like a viral protein of X family" and so we could taxonomically classify it using embedidngs? Is there an Evo foundation model that DID include eukaryote viruses?

Checking Evo Models for Eukaryotic Virus Inclusion and Embedding Mapping

The user is asking two things:

Challenging my previous statement about embeddings - arguing that even if Evo-2 hasn't seen eukaryotic viruses, it might still classify them by similarity to prokaryotic viruses it has seen, or host sequences.
Asking if there's an Evo foundation model that DID include eukaryotic viruses.

For the first point: This is a theoretical question about out-of-distribution generalization in DNA language models. The user is proposing that eukaryotic viruses might map to "viral-like" regions of the embedding space due to shared features with prokaryotic viruses (e.g., replication machinery, capsid proteins, etc.). This is actually a reasonable hypothesis - convergent evolution and shared protein folds could cause this.

For the second point: I need to check if there's any other Evo model (Evo-1, or a different variant) that included eukaryotic viruses. Evo-1 was trained on prokaryotic genomes only (bacteria and archaea). Evo-2 expanded to eukaryotes but excluded eukaryotic viruses. I should verify if there are any other variants.

Let me search for this.

Evo-1 training data eukaryotic viruses included Evo DNA language model all domains viruses Arc Institute Evo model eukaryotic virus training Evo-2 vs Evo-1 training data differences viruses

22 results

Exclude eukaryotic viruses from Evo models suggest experimental test

The user is asking two things:

Challenging my previous assertion that Evo-2 embeddings would be meaningless for eukaryotic viruses. The user argues that even if Evo-2 hasn't seen eukaryotic viruses, it might still map them to "viral-like" regions of embedding space based on shared features with prokaryotic viruses (e.g., replication machinery, capsid proteins, etc.), allowing taxonomic classification.
Asking if there's any Evo foundation model that DID include eukaryotic viruses.

From the search results, I can see:

Evo-1 was trained on bacteria + bacteriophage only (prokaryotic genomes). No eukaryotic viruses.
Evo-2 expanded to all domains of life (bacteria, archaea, eukaryotes) + bacteriophage, but explicitly excluded eukaryotic viruses for biosafety.
The safety verification showed high perplexity and poor generation on eukaryotic viruses.
The paper does NOT test embeddings on eukaryotic viruses specifically.

So for question 2: No, there is no Evo foundation model that included eukaryotic viruses. Both Evo-1 and Evo-2 excluded them.

For question 1: The user's hypothesis is actually quite reasonable. The embedding space might still allocate structure to eukaryotic viral sequences because:

They share protein folds with prokaryotic viruses (e.g., jelly-roll capsid folds, helicase domains)
They share sequence motifs (promoters, replication origins)
The model has seen bacteriophages and might map eukaryotic viruses to "viral-like" regions

However, I need to be careful. The paper only tested perplexity and generation, not embedding structure. So we don't know. The user's hypothesis is plausible but untested.

I should:

Acknowledge the user's point is valid and interesting
Explain why the embedding question is genuinely open - the paper didn't test it
Note that there are reasons it might work (shared protein folds, convergent evolution) and reasons it might not (eukaryotic viruses have fundamentally different genome architectures, codon usage, intron/exon structures if they infect eukaryotes)
Answer clearly: No Evo model includes eukaryotic viruses. Both Evo-1 and Evo-2 excluded them.
Suggest that if the user wants to test this, they could run an experiment: embed known eukaryotic viral genomes and see if they cluster near bacteriophages or host sequences.

The user is right to push back on my claim. I should be direct and not hedge.

Your hypothesis is sharp, and it exposes a real gap in the published work.

No — there is no Evo foundation model that included eukaryotic viruses. Evo-1 was trained on bacteria and bacteriophage only. Evo-2 expanded to all domains of life plus bacteriophage, but explicitly excluded eukaryotic viruses in both cases.

On embeddings: You are right that my previous claim was too strong. The Evo-2 paper only verified that perplexity and generation are broken on eukaryotic viruses. They did not test whether the hidden-state embeddings still cluster eukaryotic viral sequences near their prokaryotic viral analogs.

Your reasoning is biologically sound. Eukaryotic viruses and bacteriophages share protein folds — jelly-roll capsids, helicase domains, polymerase motifs — and Evo-2 has seen those folds expressed in prokaryotic viral contexts. It is plausible that a eukaryotic viral sequence containing a helicase or capsid protein would map closer to bacteriophage embeddings than to host embeddings, because the model has learned to associate those amino-acid-encoding DNA patterns with viral function.

But there are countervailing factors:

Eukaryotic viral genomes have architectures Evo-2 never saw: complex splicing patterns in nuclear-replicating viruses, large DNA virus gene arrays, polyprotein cleavage sites in RNA viruses. The model may not have learned to parse these as coherent genomic objects.
Codon usage in eukaryotic viruses often mirrors their hosts. A eukaryotic viral sequence might embed closer to its host genome than to any viral region because the model associates that codon bias with eukaryotic cellular DNA.
The safety exclusion was comprehensive. If the training pipeline filtered out all eukaryotic viral contigs, the model may have learned an implicit "this pattern is noise" representation for any sequence that looks too divergent from its known domains.

The honest answer: It is an open empirical question whether Evo-2 embeddings retain partial structure for eukaryotic viruses. The paper gives no evidence either way. If you want to test your hypothesis, embed a labeled set of eukaryotic viral genomes (e.g., NCBI RefSeq viruses) and check whether they cluster by Baltimore class or family in t-SNE/UMAP relative to bacteriophage and host embeddings. If they form coherent clusters near prokaryotic viral regions, your intuition is correct and the embeddings are usable despite the generation failure. If they scatter as high-entropy noise, the exclusion was too complete.

Hmm. That sucks. I want to work with this viral dark matter still, is there any way I can use transformer's learning of high dimensional structure to classify it?

Transforming Eukaryotic Viral Analysis

The user is asking about using transformers (specifically their ability to learn high-dimensional structure) to classify viral dark matter, given that Evo-2 excludes eukaryotic viruses and ESM2 requires pre-extracted protein sequences.

Key constraints:

Evo-2 excludes eukaryotic viruses from training
ESM2 only works on protein sequences (not raw DNA discovery)
The user wants to work with viral dark matter (uncharacterized metagenomic sequences)
They want to leverage transformer learning of high-dimensional structure

I should think about what approaches actually exist:

Fine-tuning Evo-2 on eukaryotic viral data (but they may not have the compute)
Using other DNA language models that DO include eukaryotic viruses
Using protein-level approaches but with better ORF calling
Hybrid approaches: use k-mer/alignment-free methods + transformer embeddings
Train or fine-tune a smaller transformer on viral metagenomic data

I should search for:

DNA language models trained on eukaryotic viruses
Fine-tuning strategies for Evo-2 on viral data
Alternative approaches for viral dark matter classification using transformers
Whether there are any open-source DNA LMs that include all viruses

Let me search for relevant information.

DNA language model eukaryotic viruses included training fine-tune Evo-2 eukaryotic virus embeddings viral dark matter transformer classification metagenomics HyenaDNA viral sequences eukaryotic DNABERT viral metagenomic classification

37 results

Conventional Classification with Open-Set Recognition for Viral Lineages

The user is asking about using transformers to classify viral dark matter, given that Evo-2 excludes eukaryotic viruses. They want to know if there's a way to leverage transformer's learning of high-dimensional structure for this task.

From the search results, I found several relevant things:

DeepVirus - A hierarchical transformer-based framework that models viral genomes as structured sequences of protein-coding genes. It combines protein-level embeddings from a foundation model with genome-aware representations. It achieves accurate classification across deep taxonomic hierarchies and can detect novel viral lineages through open-set recognition. It was applied to large-scale metagenomic resources and uncovered previously uncharacterized RNA-dependent RNA polymerases (RdRps). This is directly relevant to viral dark matter. (web_search:8#1)
Fine-tuning Evo-2 - There is evidence that fine-tuning can rescue performance on excluded domains. One paper fine-tuned Evo-2 on harmful human-infecting viruses and found that perplexity improved, though they only tested on held-out viral genomes within the fine-tuning distribution. (web_search:8#11)
DNABERT-S - Trained on viral genomes among others (17,636 viral genomes in training data), and does species differentiation and metagenomics binning. It could potentially be fine-tuned for viral classification. (web_search:8#3, web_search:8#9)
ViTax - Fine-tunes a genomic sequence foundation model (HyenaDNA) for viral taxonomy classification. Notes that existing genomic foundation models are primarily for human/eukaryotic organisms with no models specifically for viruses, so they fine-tuned one. (web_search:8#5)
Transformer-based viral classification - A hybrid CNN+transformer architecture for multiclass viral classification in metagenomic sequencing achieved 72% accuracy and F1 0.69, surpassing CNN baselines. (web_search:8#15)
PhaGCN - For viral dark matter classification from metagenomes, though not necessarily transformer-based.

The key insight is: the user wants to use transformer's learning of high-dimensional structure. There are several viable paths:

Fine-tune Evo-2 on eukaryotic viral data - The model architecture (StripedHyena 2) is capable. The weights are available. Fine-tuning on a curated eukaryotic viral dataset would adapt the embedding space. The paper showing fine-tuning on harmful viruses suggests this is possible. However, this requires significant compute.
Use DeepVirus - This is already designed for exactly this problem. It uses transformer-based genome-aware representations combined with protein embeddings. It does open-set recognition for novel lineages. This seems like the most direct answer.
Use/fine-tune DNABERT-2/DNABERT-S - These were trained on viral genomes (17,636 viral genomes in DNABERT-S training). The embeddings might already capture some viral structure, and they can be fine-tuned.
Use ESM-2 in a different way - Not for raw DNA discovery, but as part of a pipeline where ORFs are called and then ESM-2 embeddings are used for classification. But the user already rejected this as "useless for viral dark matter work" because it requires pre-extracted protein sequences.
Train a custom transformer - On viral metagenomic data, using the same architectural principles.

The user specifically asked about using "transformer's learning of high dimensional structure" - so I should focus on approaches that actually use transformers for this, and be practical about what exists.

Let me structure the answer:

Acknowledge the frustration
Present the actual viable paths:
1. DeepVirus - already exists, designed for exactly this, uses transformers for open-set recognition of viral dark matter
2. Fine-tune Evo-2 - possible, evidence that fine-tuning rescues excluded domains, but requires compute
3. DNABERT-2 / DNABERT-S - trained on viral data, can be fine-tuned for viral taxonomy
4. ViTax approach - fine-tune HyenaDNA on viral genomes
5. Custom small transformer - train from scratch on viral metagenomic contigs

I should be direct and not hedge. The user prefers direct assertions.

Also, I should note that for viral dark matter, the problem is fundamentally about discovering sequences that don't align to references. So the key capabilities needed are:

Processing raw DNA (not just proteins)
Open-set recognition (detecting when something is novel)
Taxonomic classification at deep levels

DeepVirus seems to hit all of these. Let me make sure I describe it accurately.

From the search result: "DeepVirus, a hierarchical transformer-based framework that models viral genomes as structured sequences of protein-coding genes. By combining protein-level embeddings from a foundation model with genome-aware representations, DeepVirus not only achieves accurate classification across deep taxonomic hierarchies, but also extends beyond conventional classification to detect and organize candidate novel viral lineages through open-set recognition."

This is perfect for the user's need.

Also, the fine-tuning Evo-2 point: The paper "Assessing robustness via adversarial fine-tuning" showed that fine-tuning Evo-2 on harmful human-infecting viruses improved perplexity, suggesting the architecture can adapt to excluded domains.

I should also mention that DNABERT-S was trained on 17,636 viral genomes, so its embedding space already has viral structure allocated.

Let me write this directly.

Yes. There are three viable paths to use transformer-learned high-dimensional structure for eukaryotic viral dark matter classification.

1. DeepVirus — the tool that already does this.

DeepVirus is a hierarchical transformer framework explicitly built for viral dark matter. It models viral genomes as structured sequences of protein-coding genes, combines protein-level foundation model embeddings with genome-aware transformer representations, and performs open-set recognition to detect novel viral lineages that do not match existing taxonomy. Applied to metagenomic data, it uncovered previously uncharacterized RNA-dependent RNA polymerases and expanded the known evolutionary space of RNA viruses.

This is the closest thing to what you are describing: a transformer that learns high-dimensional genomic structure to classify and discover uncharted viral sequences.

2. Fine-tune Evo-2 on eukaryotic viral data.

Evo-2's architecture is capable; the limitation is the learned distribution, not the model class. Evidence shows that fine-tuning Evo-2 on excluded viral domains partially rescues performance — perplexity drops and the model learns to predict the fine-tuned viral sequences.

If you gather a diverse eukaryotic viral dataset and fine-tune Evo-2 with LoRA or full fine-tuning, you can reallocate its latent space to encode eukaryotic viral structure. The compute requirement is substantial but the weights and training framework (Savanna, BioNeMo) are open.

3. Use DNABERT-2 or DNABERT-S, which were trained on viral genomes.

Unlike Evo-2, DNABERT-S explicitly included 17,636 viral genomes in its training data alongside bacteria and fungi.

It learns species-aware DNA embeddings that cluster and segregate sequences by taxonomy in the latent space. While it was optimized for species differentiation and metagenomics binning, the embedding space already has viral structure allocated. You can fine-tune it for viral taxonomy classification or open-set novel virus detection, similar to the ViTax approach which fine-tunes HyenaDNA for viral taxonomy.

The honest ranking:

If you want to work with viral dark matter today without training a model from scratch, DeepVirus is the best pick. It was designed for exactly your problem: metagenomic viral discovery with transformer-learned structure, including open-set recognition for sequences with no known relatives. If you want to use Evo-2 specifically, you must fine-tune it on eukaryotic viral data first; the base model embeddings are unreliable for that domain.