Aligning Protein Sequences Using Embedding Vectors

Welcome to the E-score web server! Follow these simple steps to align two protein sequences using local, global, or semi-global alignment. Our algorithms build on traditional dynamic programming methods (Smith-Waterman and Needleman-Wunsch) with an enhanced scoring function. By utilizing a protein embedding model, each amino acid is represented as a numerical vector, capturing its biochemical properties and semantic meaning. These vectors are directly integrated into the scoring function during the alignment process in the form of cosine similarity.

Our testing identified ProtT5 as the optimal embedding model, consistently outperforming traditional BLOSUM scoring matrices. Due to computational constraints, ProtT5 is the default option on this server, with ESM2 also available. Any embedding model can be used and we provide set up information for six models (ProtT5, ESM2, ProtBert, ProtAlbert, ESM1b, and ProtXLNet) on our GitHub page. Please note that user requests are processed in a queue, which may cause delays. For quicker results, using the source code directly is recommended. Users can provide their email addresses to receive results upon completion. Average wait times range from a few minutes to an hour, depending on protein length and server traffic. Once computed, results will be available for download and emailed if an address was provided. Please note, the email may be filtered to your junk folder.

Instructions:

1. Select an embedding model and alignment type.
2. Set gap penalties and score shift (for local). To the best of our knowledge, the default options are near optimal.
3. Provide two protein sequences in FASTA format. e.g:
>gi_169791723
PRQCRICGGLAMYECRECYDDPDISAGKIKQFCKTCNTQVHLHPKRLNHKYNPVSLP
>gi_678986944
PRQCRICGGLAMFECRECYEDTDISAGKIKQFCKTCNTQVHLHPKRQSHKFNPLSLP
4. Provide an email to receive results.
5. Click "Align."

Source Code

The E-score source code is available on GitHub

Reference

Sepehr Ashrafzadeh, G.B. Golding, S. Ilie, L. Ilie, Scoring alignments by embedding vector similarity.

Contact Information

Lucian Ilie: ilie@uwo.ca,
Department of Computer Science, University of Western Ontario

Julia Malec: jmalec@uwo.ca,
Department of Computer Science, University of Western Ontario

Sepehr Ashrafzadeh: sashra29@uwo.ca,
Department of Computer Science, University of Western Ontario