We are major contributors to the growing realization that hundreds of thousands of small ORFs (smORFs) of 10 to 100 amino-acids (aa) exist in every metazoan genome; that thousands of smORFs are translated; and that smORF peptides can have important functions and be widely conserved across metazoans. We are one of the few groups worldwide that can apply functional genomics to the study of smORFs, due to our expertise in functional genetic analysis, bioinformatics, and ribosomal profiling. Our aim is not only to understand and predict the function of individual smORFs, but of smORFs as a whole, as a new class of genes in the genome. Our approach is based on a multidisciplinary feedback between these three areas of research:
A) Biochemical detection of translated smORFs and their resulting peptides: This not been done thoroughly at a genomic scale in any metazoan species, and we are using Poly-Ribo-Seq, our improvement of Ribosomal profiling that allows genome-wide identification of translated sequences, and proteomics.
B) Genetic analysis: Translation may be a secondary consequence of a non-coding function (for example, in the regulation of translation of other sequences) and give raise to non-functional peptides. Thus, we aim to characterize in-depth the function of selected smORFs with biochemically-detected translation and promising bioinformatic markers, and from here, identify the translation parameters and bioinformatic markers that identify smORF actually producing bioactive peptides.
C) Bioinformatic prediction and analysis: Analyzing the sequence of smORFs corroborated as producing biologically active peptides by biochemical and genetic methods is allowing us to obtain bioinformatic markers of smORF function. We then use these markers to predict the function of additional smORFs from their sequence, and hence identify new candidates for genetic analysis. Ultimately, we aim to predict smORF function from their sequence.
1- Characterization of putative smORF genes:
Our pioneering discovery of the polycistronic tarsal-less gene has been followed by two medically-relevant smORFs: sarcolamban and hemotin. We showed that sarcolamban, encoding two smORFs of 28 and 29aa, is the true homologue of the vertebrate genes sarcolipin and phospholamban. The three define an ancient gene family and their peptides are structurally and functionally homologous. They dampen the Ca2+ pump SERCA and thus regulate Ca2+ traffic in muscles; in the fly heart, alterations of sarcolamban produce heart arrhythmia, as has been described for sarcolipin and phospholamban. Hemotin encodes an 88aa peptide that we have shown to be translated by proteomics and Poly-Ribo-Seq, and identified it as homologous to stannin, a human and vertebrate protein with alpha-helix regions and uncertain function but involved in neurotoxicity. We determined that hemotin is expressed in early endosomes and required for endosome maturation and phagocytic digestion in hemocytes (fly macrophages), and we obtained evidence of its expression, and conserved phagocytic function, in macrophages of fishes and mice.
2- Poly-Ribo-Seq, an improvement of ribosomal profiling:
We have successfully improved ribosomal profiling to detect, genome-wide, productive smORF translation. We developed a novel technique to profile polysomal fractions (Poly-Ribo-Seq): RNAs bound by multiple ribosomes (and thus actively translated) are purified and the exact position of the translating ribosomes determined by ribosomal profiling. The application of this new technique to S2 cells revealed the translation of 2,789 new smORFs without previous experimental evidence of translation or function. Extrapolation of this result suggests that some 50% of transcribed smORFs could be translated, that is near 12,000 Drosophila smORFs of which currently only 152 have suspected homologues and hence suspected peptide function.
3- Bioinformatic characterisation of smORFs:
Extrapolation of our Poly-Ribo-Seq results and our bioinformatic analyses of smORFs reveals three main types of smORFs in the genome: A) Hundreds of ‘Longer’ smORFs appear within monocistronic transcripts, and produce conserved 80 aa-long peptides with predicted functions biased towards an association with cell membranes; these are translated in most (80%) cases. B) Thousands of ‘dwarf’ smORFs appear in polycistronic arrangements in 80% of putative non-coding RNAs and in 60% of 5’UTRs of standard mRNAs (also called uORFs). On average dwarf smORFs are less conserved, around 20 aa long and translated, weakly, in a third of the cases. Finally, C) hundreds of thousands of intergenic smORFs appear in the genome as DNA sequences neither conserved nor transcribed, but resembling dwarf smORFs in the other aspects.