Remember to carefully read the documentation available on the web pages, as a lot of useful information can be found there.
The aim of this exercise is to explore and understand the PROSITE database.
First have a look at the following PROSITE entry: PS50235.
Analyze the following sequence using ScanProsite:
>seq1 MELRVLLCWASLAAALEETLLNTKLETADLKWVTFPQVDGQWEELSGLDEEQHSVRTYEV CDVQRAPGQAHWLRTGWVPRRGAVHVYATLRFTMLECLSLPRAGRSCKETFTVFYYESDA DTATALTPAWMENPYIKVDTVAAEHLTRKRPGAEATGKVNVKTLRLGPLSKAGFYLAFQD QGACMALLSLHLFYKKCAQLTVNLTRFPETVPRELVVPVAGSCVVDAVPAPGPSPSLYCR EDGQWAEQPVTGCSCAPGFEAAEGNTKCRACAQGTFKPLSGEGSCQPCPANSHSNTIGSA VCQCRVGYFRARTDPRGAPCTTPPSAPRSVVSRLNGSSLHLEWSAPLESGGREDLTYALR CRECRPGGSCAPCGGDLTFDPGPRDLVEPWVVVRGLRPDFTYTFEVTALNGVSSLATGPV PFEPVNVTTDREVPPAVSDIRVTRSSPSSLSLAWAVPRAPSGAVLDYEVKYHEKGAEGPS SVRFLKTSENRAELRGLKRGASYLVQVRARSEAGYGPFGQEHHSQTQLDESEGWREQLAL IAGTAVVGVVLVLVVIVVAVLCLRKQSNGREAEYSDKHGQYLIGHGTKVYIDPFTYEDPN EAVREFAKEIDVSYVKIEEVIGAGEFGEVCRGRLKAPGKKESCVAIKTLKGGYTERQRRE FLSEASIMGQFEHPNIIRLEGVVTNSMPVMILTEFMENGALDSFLRLNDGQFTVIQLVGM LRGIASGMRYLAEMSYVHRDLAARNILVNSNLVCKVSDFGLSRFLEENSSDPTYTSSLGG KIPIRWTAPEAIAFRKFTSASDAWSYGIVMWEVMSFGERPYWDMSNQDVINAIEQDYRLP PPPDCPTSLHQLMLDCWQKDRNARPRFPQVVSALDKMIRNPASLKIVARENGGASHPLLD QRQPHYSAFGSVGEWLRAIKMGRYEESFAAAGFGSFELVSQISAEDLLRIGVTLAGHQKK ILASVQHMKSQAKPGTPGGTGGPAPQY
Now look at this sequence from a patient with a cardio-vascular disease.
>seq2 MELRVLLCWASLAAALEETLLNTKLETADLKWVTFPQVDGQWEELSGLDEEQHSVRTYEV CDVQRAPGQAHWLRTGWVPRRGAVHVYATLRFTMLECLSLPRAGRSCKETFTVFYYESDA DTATALTPAWMENPYIKVDTVAAEHLTRKRPGAEATGKVNVKTLRLGPLSKAGFYLAFQD QGACMALLSLHLFYKKCAQLTVNLTRFPETVPRELVVPVAGSCVVDAVPAPGPSPSLYCR EDGQWAEQPVTGCSCAPGFEAAEGNTKCRACAQGTFKPLSGEGSCQPCPANSHSNTIGSA VCQCRVGYFRARTDPRGAPCTTPPSAPRSVVSRLNGSSLHLEWSAPLESGGREDLTYALR CRECRPGGSCAPCGGDLTFDPGPRDLVEPWVVVRGLRPDFTYTFEVTALNGVSSLATGPV PFEPVNVTTDREVPPAVSDIRVTRSSPSSLSLAWAVPRAPSGAVLDYEVKYHEKGAEGPS SVRFLKTSENRAELRGLKRGASYLVQVRARSEAGYGPFGQEHHSQTQLDESEGWREQLAL IAGTAVVGVVLVLVVIVVAVLCLRKQSNGREAEYSDKHGQYLIGHGTKVYIDPFTYEDPN EAVREFAKEIDVSYVKIEEVIGAGEFGEVCRGRLKAPGKKESCVAISTLKGGYTERQRRE FLSEASIMGQFEHPNIIRLEGVVTNSMPVMILTEFMENGALDSFLRLNDGQFTVIQLVGM LRGIASGMRYLAEMSYVHRDLAARNILVNSNLVCKVSDFGLSRFLEENSSDPTYTSSLGG KIPIRWTAPEAIAFRKFTSASDAWSYGIVMWEVMSFGERPYWDMSNQDVINAIEQDYRLP PPPDCPTSLHQLMLDCWQKDRNARPRFPQVVSALDKMIRNPASLKIVARENGGASHPLLD QRQPHYSAFGSVGEWLRAIKMGRYEESFAAAGFGSFELVSQISAEDLLRIGVTLAGHQKK ILASVQHMKSQAKPGTPGGTGGPAPQY
Hint - a literature reference.
The HAMAP-Scan "Scan" mode can be used to classify protein sequences using HAMAP profiles, while the HAMAP-Scan "Scan & Annotate" mode also provides annotation covering individual sequences and complete proteomes. Both are available here.
To save time we have annotated a number of proteomes and individual sequences for you. The corresponding results from each of these sequences can be retrieved from the HAMAP-Scan results page using the access codes provided in the following tables.
Individual sequences: | ||
---|---|---|
Species name | Taxonomic identifier | HAMAP-Scan access code |
Escherichia coli (strain K12) | 83333 | FGV |
Bacillus cereus var. anthracis (strain CI) | 637380 | IIO |
Retrieve the annotations for the individual sequences of Escherichia coli (strain K12) and Bacillus cereus var. anthracis (strain CI) shown below at the HAMAP-Scan results page. At the same time, copy/paste each of these sequences into the HAMAP-Scan search box here and use the simple "Scan" mode to search all HAMAP profiles in real time for matches to each of the sequences.
>Escherichia coli MLKIFNTLTRQKEEFKPIHAGEVGMYVCGITVYDLCHIGHGRTFVAFDVVARYLRFLGYK LKYVRNITDIDDKIIKRANENGESFVAMVDRMIAEMHKDFDALNILRPDMEPRATHHIAE IIELTEQLIAKGHAYVADNGDVMFDVPTDPTYGVLSRQDLDQLQAGARVDVVDDKRNPMD FVLWKMSKEGEPSWPSPWGAGRPGWHIECSAMNCKQLGNHFDIHGGGSDLMFPHHENEIA QSTCAHDGQYVNYWMHSGMVMVDREKMSKSLGNFFTVRDVLKYYDAETVRYFLMSGHYRS QLNYSEENLKQARAALERLYTALRGTDKTVAPAGGEAFEARFIEAMDDDFNTPEAYSVLF DMAREVNRLKAEDMAAANAMASHLRKLSAVLGLLEQEPEAFLQSGAQADDSEVAEIEALI QQRLDARKAKDWAAADAARDRLNEMGIVLEDGPQGTTWRRK
>Bacillus cereus var. anthracis MTIHIYNTLTRQKEEFTPLEENKVKMYVAGPTVYNYIHIGNARPPMVFDTVRRYLEYKGY DVQYVSNFTDVDDKLIKAANELGEDVPTIADRFVEAYFEDVTALGCKHATVHPRVTENMD IIIEFIQELVNKGYAYESEGDVYFRTKEFEGYGKLSHQPIADLRHGARIEVGEKKQDPLD FALWKAAKEGEIFWESPWGQGRPGWHIECSAMARKYLGDTIDIHAGGQDLAFPHHENEIA QSEALTGKTFARYWMHNGYININNEKMSKSLGNFILVHDIIKQYDPQLIRFFMLSVHYRH PINFSEELLQSTNNGLERIKTAYGNLKHRMESSTDLTDHNEKWLADLEKFQTAFEEAMND DFNTANAITELYNVANHANQYLLEEHTSTVVIEAYVKQLETLFDILGLELAQEELLDEEI EELIQKRIEARKNRDFALSDQIRDDLKDRNIILEDTAQGTRWKRG
At the end of 2019, a novel coronavirus (nCoV) of animal origin started infecting humans, initiating a severe outbreak in China. On January 30th, 2019-nCoV was designated a global health emergency by the WHO. On February 11th, the WHO called the disease caused by the virus COVID-19, and the virus itself was named Severe Acute Respiratory Syndrome-related coronavirus 2 or SARS-CoV-2 by the International Committee on Taxonomy of Viruses (ICTV).
SARS-CoV-2 belongs to the large family of Coronaviridae. This genus comprises mainly vertebrate respiratory viruses, including HCoV-OC43, which is responsible for 10% of common colds, and SARS, which caused an epidemic in 2003. The novel coronavirus genome has been sequenced.
SARS-CoV-2 protein sequences from the current public health emergency have been made available as a pre-release dataset on the UniProt FTP site.
Download the 14 protein sequences of SARS-CoV-2 in fasta from UniProt here and submit them to HAMAP-Scan
Next, retrieve the annotations for the SARS-CoV-2 sequences from the HAMAP-Scan results page using the access codes provided below.
SARS-CoV-2: | ||
---|---|---|
Species name | Taxonomic identifier | HAMAP-Scan access code |
Coronaviridae | 11118 | FGU |
You are working with a family of proto-oncogene proteins and have identified a potential functional region that is conserved in several related proteins. You have built a multiple sequence alignment of this region from these proteins, and now you would like to identify other proteins having this signature.
Seq1 WFFKGIADKDAERHLLA Seq2 WFFKNLEQKDAEARLLA Seq3 WFFKR---KDAERQLLA Seq4 WFFGTI---DAERQLLA Seq5 WFFKDIPTKDAERQLLA Seq6 WYFG----RESERLLLA Seq7 WYFGKIPLKDAERQLLA Seq8 WYFGKLRAKDTERLLLL
The first thing to do now is the check the quality of your pattern.
Search the UniProtKB/Swiss-Prot database with your pattern using ScanProsite.
Repeat the exercise with the following sequences:
seq1 ERGLAAAR seq2 DRVSCLIR seq3 DRLGSGGR seq4 ERAALILR seq5 ERIVVTVR
More practicals on how to use MyHits (a SIB resource where you can build your own profiles and HMMs):