HyperAIHyperAI

Command Palette

Search for a command to run...

UniRef50 Protein Sequence Dataset

Date

3 months ago

Publish URL

www.uniprot.org

Paper URL

arxiv.org

Join the Discord Community

The UniRef50 protein sequence dataset is from the UniProt knowledge base, and the related paper results are "AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model".

This dataset, derived from UniProtKB and filtered from UniParc sequences via iterative clustering (UniProtKB+UniParc → UniRef100 → UniRef90 → UniRef50), contains 41,546,293 training sequences and 82,929 validation sequences. This iterative process ensures high-quality, non-redundant, and diverse representation of UniRef50 sequences, providing extensive coverage of the protein sequence space for protein language models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
UniRef50 Protein Sequence Dataset | Datasets | HyperAI