Supplementary Information: Genome sequencing of two closely related non-human primate animal models, cynomolgus and the Chinese rhesus macaque

Size: px
Start display at page:

Download "Supplementary Information: Genome sequencing of two closely related non-human primate animal models, cynomolgus and the Chinese rhesus macaque"

Transcription

1 Supplementary Information: Genome sequencing of two closely related non-human primate animal models, cynomolgus and the Chinese rhesus macaque -1 1 Genome sequencing and assembly Genome sequencing Genome assembly Assembly quality validation in neutral mode Transcriptome sequencing Gene annotation Gene annotation pipeline and evaluation of gene quality Detection of genetic variation in macaques Single nucleotide variation detection Indel variations among macaques base pair insertion/deletion detection base pair insertion/deletion detection >100 base pair insertion/deletion detection Demographic analysis of macaque populations and detection of introgression DaDi analysis of two rhesus macaque population models DaDi analysis of three macaque population models Detection of putative introgression regions Detection of selective sweep regions Principle of the HKA test and its application in our analysis Detection of selective sweeps Gene evolution Ortholog determination dn/ds calculation Rapidly and slowly evolving function categories Identification of lineage specific accelerated GO categories Positive gene selection Identification of lost genes as compared to human Pseudogene identification Population studies of mutations in disease genes

2 -2 8 Analysis of compensated pathogenic deviations Cross-transcriptome analysis Transcriptome mapping Species specific novel gene identification in CE and IR Supplementary References

3 -3 1 Genome sequencing and assembly 1.1 Genome sequencing A 5 year-old female Chinese rhesus macaque and a four year-old female Cynomologus macaque were used in this study. Prior to genome sequencing, we sequenced their mitochondrial sequences to verify their origin. Comparative phylogenetic analysis with previously determined mitochondrial sequences of other macaque species indicated that these two macaques clustered with individuals from their corresponding subpopulations, verifying their predicted origin (Supplementary Fig. 1). Genomic DNA was isolated from the peripheral blood cells of healthy female CR and CE macaques. The two genomes were then sequenced using an Illumina Genome Analyzer. Library preparation and sequencing followed the manufacturer's instructions. Sequence reads were collected from the output of the Solexa data processing pipeline using default parameters. In order to optimize the de novo assembly quality and to minimize sources of systematic bias, various insert size libraries were constructed for each individual: a total of 19 and 18 paired-end libraries, with a spanning size range of 200bp to 10 kb (Supplementary Table 1), were used to generate short reads for CR and CE, respectively. A short insert size library serves to minimize the number of gaps in the assembly whereas a long insert size library contributes to a greater scaffold N50 length. Most reads generated from mate-pair libraries (insert size >2000bp) were of the order of 44bp whereas the corresponding length for paired-end libraries (ranging from 200 to 500bp) was 75bp. 3

4 -4 Supplementary Fig. 1 Phylogenetic trees of Macaca species constructed using mtdna sequences. CE and CR are respectively the crab-eating macaque and the Chinese rhesus macaque sequenced in this study. IR is the Indian rhesus macaque. The accession numbers [D D85291] for the other mt DNA sequences were from ref 1. Phylogenetic trees were constructed by neighbour-joining (a) and maximum parsimony (b). The Patas monkey (genus Erythrocebus), baboon (genus Papio) and human served as outgroups. Supplementary Table 1. Genome sequencing for two macaques using a highthroughput sequencing platform Chinese rhesus macaque Crab-eating macaque Sequence Data Insert Size Raw data (Gb) High-quality (Gb) Sequence Depth (X) Raw data (Gb) High-quality (Gb) Sequence Depth (X) 200 bp bp Solexa 500bp Reads 2 Kb Kb Kb Total Genome assembly The two macaque genomes were assembled de novo by SOAPdenovo 2 software ( whose effectiveness has already 4

5 been proven in the context of assembling the genome of the giant panda which is comparable in size to that of the macaque 3. SOAPdenovo employs the de Bruijn graph algorithm in order both to simplify the task of assembly and to reduce computational complexity. Low quality reads were filtered and potential sequencing errors were removed by k-mer frequencybased error correction. We filtered the following type of reads: 1. Reads having an N over 10% of its length. 2. Reads from short insert-size libraries having more than 65% of bases with quality 7, and reads from large insert-size libraries that contained more than 80% of bases with a quality Reads with more than 10 bp from the adapter sequence (allowing 2 bp mismatches). 4. Small insert size paired-end reads that overlapped 10 bp between two ends. 5. Read1 and read2 of two paired-end reads that were completely identical (and were hence considered to be the products of PCR duplication). 6. Reads having a k-mer frequency <4 (to minimize the influence of sequencing errors). After these quality control and filtering steps, a total of 142G (or 47.4X) and 162G (or 54.0X) data for CR and CE respectively, were retained for assembly. SOAPdenovo first constructs the de Bruijn graph by splitting the reads from short insert size libraries ( bp) into 31-mers and then merging the 31-mers ( 30bp overlaps with 1 bp overhangs); contigs were then collected which exhibited unambiguous connections in de Bruijn graphs. Reads from mate-paired libraries (insert size >2k) were aligned onto the contigs for scaffold building using the paired-end information. This paired-end information was subsequently used to link contigs into scaffolds, step by step, from short insert sizes to long insert sizes. About 103.2G (or 34.4X) and 104.1G (or 34.5x) of sequence data respectively were used to build contigs for CR and CE, Some intra-scaffold gaps were filled by local -5 5

6 -6 assembly using the reads in a read-pair where one end uniquely aligned to a contig whereas the other end was located within the gap. The final total contig size and N50 of CR were 2.7G and 12.0K respectively, whilst the corresponding values for CE were 2.7G and 12.7K. The total scaffold size (including the estimated gap size within the scaffold) and N50 were respectively 2.84G and 891K for CR, whereas the corresponding values for CE were 2.85G and 652K (Supplementary Table 2). Our assemblies provide comparable or in some cases improved quality in comparison with the IR sequence assembly. These assessed variables include the following: (1) our two macaque genomes having extensive coverage (50X), which provides for a high degree of confidence in the sequence quality at the single nucleotide level. Given that the inherent sequencing error rate of short-read assembly after our stringent pipeline is approximately the same as that of Sanger sequencing, this higher coverage should result in high quality of assembly at single-base level. Additionally, over 95% of genomic regions in our two sequenced macaques were covered by at least 20X (Supplementary Fig. 2), providing extremely high sequence quality throughout the whole genome 4. (2) The CR genome assembly allowed us to close 42 Mb of existing gap regions in the IR subspecies genome, which revealed 12 genes that had been missed in the IR genome assembly, thereby indicating the completeness of our sequencing and assembly, and also providing an important complementary genetic resource for the rhesus macaque. (3) We compared each genome sequence using the human reference genome to estimate the insertion/deletion (indel) error rate due to mis-assembly based on a neutral model 5. Our results indicated that the indel error rates of the two assemblies using short reads were close to that of the IR genome sequence obtained using Sanger sequencing (Supplementary Fig. 3 ). (4) We used the same annotation pipeline to annotate the genes in the three 6

7 -7 macaques using the human gene set as a reference, and obtained similar numbers of genes with intact open-reading frames between the three macaques (CR: 14,492; IR: 14,067; CE: 13,509). However, the short readbased assemblies contained fewer annotated genes that had premature stop codons or frame-shifts relative to the IR version. (5) We found that fewer genes were missed in our two assemblies (CR: 1,055; CE: 1,171), due to the draft assembly stage, in comparison with that in the IR version (1,264). Since most of the gaps were located within highly repetitive regions that we had failed to fill, about 11.0% (CR) and 15.9% (CE) of the assemblies remained unclosed. However, over 95% of genomic regions in our two sequenced macaque genomes were covered by at least 20 X (Supplementary Fig. 2), providing extremely high quality of the assembly at the individual base level. Various insert size libraries provided gradient linking information for building scaffolds, whereas thousands of folds physical coverage of data provided an extremely high density of pair-end information to join contigs into scaffolds. Further, synteny analysis of CR/IR and CE/IR found that only 1829 and 2036 of scaffolds were mapped onto non-contiguous regions in IR; these may due to structural variation or mis-assembly. For this reason, we randomly selected 10 scaffolds which were consistent in CR and CE, but discrepant in IR, for PCR validation (Supplementary Table 3). PCR validation revealed that all the PCRs yielded the expected products in CR and CE thereby confirming the accuracy of our assemblies (Supplementary Fig. 4). However, we found that all the primers also yielded similar products when CR and CE were compared, possibly suggesting mis-assembly in IR or structural variation between our IR individual in the PCR validation and that generated for rhemac2. Taking advantage of the availability of the IR genome assembly and the close evolutionary relationship between CR, CE and IR, we adopted the 7

8 Indian rhesus (rhemac2) assembly as a reference; the assemblies of CR and CE were first aligned to rhemac2 using LASTZ, the detailed process being summarized below: 1. We aligned all the scaffolds of CR and CE to the IR genome by LASTZ with parameter M=254 K=4500 L=3000 Y=15000 C=2 T=2 format=axt. 2. Run axtchain in the chainnet package(v2) to chain the LASTZ alignments to longer syntenic blocks, with the default settings of axtchain. 3. The length of each block was calculated and the longest was selected as the best chain of a specific scaffold/contig for further analysis. Chains with aligned lengths less than 10% of their corresponding scaffolds were discarded. 4. For a specific locus in a reference chromosome, if there were more than one scaffold mapped to it, we chose the scaffold with the longest aligned length; the others were discarded. The purpose of steps 3 and 4 was to obtain the reciprocal best chain alignments, which can reduce the error when linking scaffolds into chromosomes. The scaffolds were linked into chromosomes according to the coordinates in IR chromosomes. After step 4, if a scaffold was found to be mapped to a unique chromosome, even although some parts of this scaffold might not be aligned to this chromosome (i.e. might be aligned to another chromosome or could not be aligned to any locus of IR), we did not split the scaffold into several parts, because the scaffold itself was supported by paired-end reads and there could be structural variation between CR/CE and IR or unassembled regions in IR. If there was a gap between two adjacent scaffolds that mapped to an IR chromosome, this gap was considered as a region that was not sequenced or a region that was not assembled by SOAPdenovo. We filled the gap with Ns in the CR/CE chromosomes; the size of Ns was equal to the size of the unaligned region in IR. The statistics pertaining to the mapping of scaffolds to chromosomes are shown in Supplementary Table 4. We also found 20 genes locate in -8 8

9 gap regions in IR but presented in CR or CE assemblies (Supplementary Table 5). We calculated the quantity of scaffolds wherever there was a discrepancy in synteny from the results of the chainnet package (v2) when they were mapped to the IR genome. After mapping the scaffolds to chromosomes, we realigned CR and CE chromosomes against IR chromosomes by LASTZ, and then ran the chainnet pipeline to generate a syntenic net for each IR-CR alignment and IR-CE alignment. In the syntenic net files of chainnet results, there is a field type describing the status of synteny of the alignment blocks: syn denotes that a chain alignment is on the same chromosome and in the same direction as the parent, inv denotes that a chain alignment is on the same chromosome in the opposite direction from the parent (can also be termed inversion), and nonsyn denotes that a chain alignment is on a different chromosome from the parent (can also be termed inter-chromosomal translocation). From the syntenic net results, we found 4.49 Mb of inversions and 28.29Mb of translocations for the IR-CR alignment, and 2.16 Mb of inversions and 28.58Mb of translocations for the IR-CE alignment. We also selected 10 large discrepancy regions to perform PCR validation and all selected regions were validated. One of the challenges in genome assembly is provided by the presence of repetitive sequences that have multiple identical or very similar copies in the genome. To address the assembly quality by using short reads for the highly repetitive regions, we compared the two-macaque genomes we assembled and the IR assembly in the LRC (Leukocyte Receptor Complex) region and MHC (Major Histocompatibility Complex) region. The human LRC region and the IR complete MHC region 6 were used as references. We used LASTZ to map the macaque scaffolds onto the references. We noted that all three macaque assemblies encompassing these two regions were very fragmented (Supplementary Fig. 5-9). This could be having been due to the high density of repetitive sequences. Even using the Sanger method for the assembly did not improve matters much for such regions. -9 9

10 -10 Supplementary Fig. 2 Cumulative distribution of single-base depth from two Macaca genomes. High-quality reads were aligned onto the assembly and the sequencing depth at each position in the assemblies was calculated. The cumulative distributions of single-base depths indicate that >94.05% of the assembly of CR (a) was covered >20 times, the comparable proportion for CE being 96.54% (b). Supplementary Fig. 3 Genome sequence quality assessment using the InDel neutral model. The three graphs display the gap errors quantified in pairwise alignments of human and IR, CR and CE, respectively. The x-axis represents the length of inter-gap segment (IGS), the y- axis is the frequency of IGS on a natural log scale. The straight line represents the neutral indel model predictions calculated from the observed frequencies of IGS lengths between 150 and 300 bases (curved line). The lower plots denote the frequencies of gap errors. The three macaque assemblies exhibit no significant differences in terms of their InDel error rates. 10

11 -11 Supplementary Fig. 4 Agarose gel electrophoresis of PCR products for validation of discrepancy between CR/IR and CE/IR. The result revealed that all the PCR products validated our assembly. However, discrepancies with the IR assembly were evident which may suggest structural variation between individuals or mis-assembly of the original IR assembly. Supplementary Table 2. Genome assembly using pure short reads. Genome Chinese rhesus macaque Crab-eating macaque assembly Contig Size Contig Scaffold Contig Scaffold Scaffold Size Contig Size Scaffold Size Number Number Number Number N90 2, , ,525 3,264 2, , ,693 4,649 N80 4, , ,677 2,350 5, , ,303 3,289 N70 6, , ,809 1,757 7, , ,846 2,424 N60 9,270 86, ,489 1,312 9,811 84, ,893 1,793 N50 11,960 61, , ,514 60, ,093 1,303 Total Size 2,661,096, ,482 2,835,360,169 92,997 2,653,792, ,029 2,847,561, ,664 Supplementary Table 3. Genomic locus and PCR primers for PCR validation. Number chromosome start end strand CR.chr11 58,428,505 58,448,611 + IR.chr6 29,221,839 29,201,339-1 CE.chr11 58,824,015 58,843,967 + primer1 AGGTACACCACTACGGGACT primer2 CTGACTTGCCACAGGGTT CR.chr14 106,249, ,286,142 + IR.chr10 22,733,494 22,766, CE.chr14 104,516, ,554,049 + primer1 GGCAATATGTTGGCGTTTA primer2 CTTGTGGGAACTCCGAAC 11

12 -12 Number chromosome start end strand CR.chr1 40,722,218 40,822,774 + IR.chr19 19,165,434 19,266,270 + CE.chr1 41,552,541 41,649,443 + primer1 TCAGTGGGTGAAGGGATA primer2 AAAAACAGGTTTGGGGAC CR.chr5 168,014, ,019,730 + IR.chr19 43,815,420 43,820,638 + CE.chr5 168,988, ,994,048 + primer1 ATTGCACTTCAGTAGGCTCT primer2 CTGGGTGGAACTACTTGG CR.chr6 160,863, ,986,311 + IR.chr9 96,362,774 96,468,401 + CE.chr6 161,706, ,815,170 + primer1 TCATGGGGAAAATGCTGC primer2 TGTTTGCTTGCTGTAATCTG CR.chr9 87,693,313 87,730,393 + IR.chr12 15,218,884 15,255,747 + CE.chr9 90,018,510 90,055,749 + primer1 TGTGTAGCCAGGGTTGAG primer2 GGAAGGCTTTAGTCGTGG CR.chr9 13,614,265 13,616,742 + IR.chr20 44,641,371 44,643,962 - CE.chr9 14,075,418 14,077,906 + primer1 CTGTTTCCTGCCTTCTAT primer2 TTCTCAACACGCTTCTAAT CR.chr1 40,722,218 40,732,530 + IR.chr19 19,165,434 19,176,130 + CE.chr1 41,552,541 41,562,287 + primer1 TGGACTACTACTCAGCGG primer2 TCCCTGGAGATGACTGTT CR.chr3 148,855, ,856,358 + IR.chr20 22,964,006 22,964,829 - CE.chr3 150,187, ,188,042 + primer1 ATTCACAGTCGCTGGTTT primer2 TGACAAGAGCATCAACCC 12

13 -13 Number chromosome start end strand CR.chr18 41,751,406 41,757, IR.chr14 112,414, ,420,517 - CE.chr18 41,900,921 41,906,738 + primer1 TGGATGGAAGGGGATAGC primer2 AGGTGGGGTGAAGGATTG 13

14 -14 Supplementary Table 4. Assembly statistics by chromosome. Indian rhesus macaque Chinese rhesus macaque Crab-eating macaque Chr length (Mb) gap length (Mb) scaf# Placed length (Mb) placed% scaf# Placed length (Mb) placed% chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr

15 -15 Indian rhesus macaque Chinese rhesus macaque Crab-eating macaque Chr length (Mb) gap length (Mb) scaf# Placed length (Mb) placed% scaf# Placed length (Mb) placed% chr chrx Total 2, ,206 2, ,445 2, scaffolds placed on chromosomes 7,206 2, ,445 2, scaffolds unplaced 87, ,

16 -16 Supplementary Table 5. Genes located in gaps of IR filled by CR and CE assemblies. Gap position Gap CR/CE position CR/CE length length (bp) (bp) IR.chr19_ _ CE.chr19_ _ C19orf69 IR.chr10_ _ CE.chr10_ _ WFDC9 IR.chr9_ _ CE.chr9_ _ C10orf118 IR.chr17_ _ CE.chr17_ _ TMCO3 IR.chr13_ _ CE.chr13_ _ ASTL, IR.chr3_ _ CE.chr3_ _ LRRC61 IR.chr12_ _ CR.chr12_ _ C2orf85 IR.chr7_ _ CR.chr7_ _ TMEM30B IR.chr17_ _ CR.chr17_ _ TMCO3 LRCH4, IR.chr3_ _ CR.chr3_ _ AC , AGFG2 IR.chr13_ _ CR.chr13_ _ ASTL,ADRA2B IR.chr4_ _ CR.chr4_ _ IER3 IR.chr7_ _ CR.chr7_ _ AL IR.chr19_ _ CR.chr19_ _ C19orf69 IR.chr3_ _ CR.chr3_ _ LRRC61 IR.chr9_ _ CR.chr9_ _ C10orf118 IR.chr19_ _ CR.chr19_ _ CD79A IR.chr10_ _ CR.chr10_ _ WFDC9 Gene Supplementary Fig. 5 Comparison of LRC assemblies in three macaques. We aligned the three assemblies of macaques against human LRC using LASTZ. The blue bars represent aligned regions, and the red bars in the assemblies of CR and CE represent the intra-scaffold gaps. The black boxes are the genes annotated in this region. The total size of gaps is about 216 kb, 84 kb, and 373 kb in the IR, CR, and CE assembly, respectively 16

17 -17 Supplementary Fig. 6 Comparison of MHC assemblies in the three macaque species. As a reference for this comparison, we used a complete MHC sequence from IR, which was isolated from a BAC library and sequenced using Sanger sequencing (center black line marked with gene regions in black boxes) (Daza-Vamenta, et al. Gen Res. 2004). The shaded boxes of different colors on the reference sequence line indicate the tandem duplication regions; those of the same color are duplication regions that share a common ancestor. The IR sequence of this region from the whole-genome shotgun assembly using Sanger sequencing (was provided as a control, and its alignment with the reference is shown by blue bars just below the black line. The blue bars above and below the IR assembly represent the aligned blocks in the CE and CR assemblies, respectively. The dot plots represent the sequencing depth in CE and CR, and the red lines in the dot plots represent their average sequencing depth. The MHC class I region (region I-1: 220,000-1,175,788; region I-2: 2,325,200-3,456,752) is enclosed in a yellow box, class II region (4,370,071-5,143,074) in green, and class III region (3,456,752-4,154,879) in blue. 17

18 -18 Supplementary Fig. 7 Comparison of assemblies of the MHC class I region in three macaques. MHC class I region-1 (220,000-1,175,788); B. MHC class I region-2 (2,325,200-3,456,752). The lines in the middle in both panels represent the MHC assembly from BAC library (Daza-Vamenta et al. 2004). We aligned the three assemblies of macaques (IR, CR and CE) against MHC reference using LASTZ. The three blues bars in the order from top to bottom represent the aligned regions in CE, IR and CR respectively. The black bars along the middle line represent the gene regions and the other bars along the middle line represent the gene duplication events. The dot plots represent the sequencing depths in CE and CR, while the red lines represent the average sequencing depths in whole-genome for CE and CR respectively. 18

19 -19 Supplementary Fig. 8 Comparison of assemblies of the MHC class II region in three macaques. The line in the middle represent the MHC assembly from BAC library (Daza- Vamenta et al. 2004). We aligned the three assemblies of macaques (IR, CR and CE) against MHC reference using LASTZ. The three blues bars in the order from top to bottom represent the aligned regions in CE, IR and CR respectively. The black bars along the middle line represent the gene regions and the other bars along the middle line represent the gene duplication events. The dot plots represent the sequencing depths in CE and CR, while the red lines represent the average sequencing depths in whole-genome for CE and CR respectively. Supplementary Fig. 9 Comparison of assemblies of MHC class III region in three macaques. The line in the middle represent the MHC assembly from BAC library (Daza- Vamenta et al. 2004). We aligned the three assemblies of macaques (IR, CR and CE) against MHC reference using LASTZ. The three blues bars in the order from top to bottom represent the aligned regions in CE, IR and CR respectively. The black bars along the middle line represent the gene regions and the other bars along the middle line represent the gene duplication events. The dot plots represent the sequencing depths in CE and CR, while the red lines represent the average sequencing depths in whole-genome for CE and CR respectively. 19

20 Assembly quality validation in neutral mode We used the neutral InDel model 5 to validate the quality of our genome assemblies. When aligning two closely related genome sequences, the frequencies of lengths of successive alignment blocks (which were split by gaps during the alignment), termed Inter-gap Segments (IGS), may be expected to follow a geometric frequency distribution under a standard neutral model. Within the neutral evolving regions, incorrect InDels introduced during the assembly process would result in the observed IGS length distribution departing from the geometric distribution. The introduced InDels would generate an excess of short IGS over the number predicted by the neutral InDel model. By quantifying this excess, several parameters viz. the proportion (ɛ), average density (D), and number (N g ) of the clustered erroneous gaps in the genome alignments can be estimated. Pairwise alignments of human-ir, human-cr and human-ce were generated, and the gaps in the alignments were clustered by size. We then compared the gap size frequency distribution to that in neutral mode. To estimate N g, we accumulated the difference between the observed and expected IGS counts for small IGS lengths, starting from the smallest IGS lengths that exceeded expectation. The proportion (ɛ) of InDels that represented errors was calculated by dividing N g by the total number of IGSs in the entire alignment being analyzed. The InDel error rate (D) was calculated by dividing N g by the total number of alignable bases (equivalent to the total number of nucleotides covered by the IGSs). This analysis revealed that the patterns of the InDel error rate in CR and CE were comparable to those in IR (Supplementary Fig. 3). 1.4 Transcriptome sequencing Two cynomolgus macaques of Indonesian origin were euthanized for tissue collection. Samples from brain, kidney, liver and white adipose tissue were collected from a two-year old male whereas tissue from testes 20

21 and ileum were collected from a six-year old male. Frozen tissues were ground in liquid nitrogen using a liquid nitrogen-chilled mortar and pestle. Tissue RNA extraction was performed using the QiagenRNeasy Kit. One male Rhesus macaque of Indian origin was euthanized for tissue collection. Samples from brain, heart, kidney, liver, quadriceps and testes were collected. Frozen tissues were homogenized in Trizol reagent in a bead mill with 5mm stainless steel beads. The Trizol procedure was then followed, including two alcohol precipitations and suspension of the final RNA pellet in RNAse-free water. RNA sequencing libraries were constructed by using an Illumina standard mrna-seq Prep Kit. Briefly, oligo(dt) magnetic beads were used to purify the poly-a containing mrna molecules. The mrna was further fragmented into short lengths by controlled temperature, and then randomly primed during first strand synthesis by reverse transcription. This was followed by second-strand synthesis with DNA polymerase I to create double-stranded cdna fragments. Double stranded cdna was subjected to end repair by Klenow and T4 DNA polymerases and A-tailed by Klenow lacking exonuclease activity. Ligation to Illumina Paired-End Sequencing adapters, size selection by gel electrophoresis and then PCR amplification complete the library preparation. The paired-end libraries were sequenced on a Illumina Genome Analyzer for 100 bp at each end (Supplementary Table 6)

22 -22 Supplementary Table 6. Transcriptome sequencing data statistics. species tissue Read number Mapped reads cynomolgus macaque brain 60,244,328 29,413,792 ileum 62,367,606 27,159,525 kidney 60,377,002 19,712,266 testes 59,864,484 27,497,736 white adipose tissue 58,459,464 24,141,998 liver 59,000,500 39,073,980 rhesus macaque brain 40,917,070 16,636,201 heart 54,952,900 25,966,028 testes 46,007,760 20,334,844 quadriceps 50,563,532 29,393,204 kidney 36,793,044 18,515,570 liver 51,178,284 20,399,166 2 Gene annotation 2.1 Gene annotation pipeline and evaluation of gene quality Taking advantage of the close evolutionary relationship between the three macaque species and human, together with the availability of the well annotated gene sets from both human and IR, we annotated the genes of the two newly sequenced macaque genomes by mapping the annotated genes of IR (MMUL_0_1) and human (Ensembl release-56) onto the CE and CR macaque assemblies by BLAT. Orthologous regions were then determined by best-blat hit and synteny-based analysis, followed by the application of Exonerate [ and GENEWISE [ to refine gene model at each locus. 22

23 In total, 21,283 gene models were annotated for CE and 21,610 were annotated for CR. The transcriptome data provided direct evidence of transcript abundance. We investigated the intact gene number and their relative expression levels in IR, CR and CE respectively, and found no significant difference between these numbers (Supplementary Table 7). We then evaluated the quality of gene sets for three macaques by calculating the numbers of genes with an intact open reading frame (ORF) starting with a start codon and ending with a stop codon, and genes with premature stop codons or frameshift mutations. The latter two types of gene could arise as a consequence of flaws in the draft genome assembly. Hence, the numbers of genes with premature stop codons or frameshift mutations could reflect the assembly quality in the gene core regions. Our calculation suggests that the numbers of these types of genes were comparable between the three macaque species (Supplementary Fig. 10). In addition, we analysed the impact of the draft status of the three macaque genomes on the fragmentation of, or failure to detect, genes. We found that 1,171 human genes were overlooked in the CE genome, 1,055 genes were overlooked in the CR genome, while a total of 1,264 genes were deemed to be absent from the IR genome owing to its draft assembly status. Overall, this indicates that the assembly quality using 50X coverage of short reads is equal to that of an assembly using 5X long Sanger reads at this level. -23 Supplementary Table 7. Summary of genes annotated in three macaques. The gene set of IR was from Ensembl (MMUL_0_1). The gene expression level was quantified by RPKM (reads per kilobase per million mapped reads). An RPKM 5 means that a gene exhibits an RPKM of no less than 5 in at least one of the six tissues. Species # Total # Intact ORF # partial ORF RPKM 0 RPKM 5 RPKM 0 RPKM 5 RPKM 0 RPKM 5 CE CR IR

24 -24 a intact ORF CE CR b CE 1635 premature stop codon IR CR 1730 c CE frameshift CR IR 1785 IR 1929 Supplementary Fig. 10 Comparison of predicted genes between the three macaca genomes. (a) The number of genes with intact ORFs in the three macaca genomes. The gene set of IR used in this analysis was downloaded from ensemble (release-56). The gene sets of CR and CE were mainly predicted from proteins from human and rhesus macaque using Exonerate and Genewise. The number of genes with premature stop codons (b) and frameshift mutations (c) in the three macaca genomes are similarly depicted. Since the genes released by ensembl were annotated using a different method, and some genes were manually curated, for these two comparisons we re-predicted the gene models for IR, CR and CE, simply using human proteins downloaded from ensemble (release-56) with Exonerate, to exclude the potential bias which could result from the use of different annotation methods. 3 Detection of genetic variation in macaques 3.1 Single nucleotide variation detection All sequenced reads of CR and CE were aligned onto rhemac2 using SOAPaligner 7 in gap-free mode, only two mismatches being allowed for a read length of 44bp, while 5 mismatches were allowed for a read length of 75bp. A total of 82.56% of reads in CR and 76.43% of reads in CE were 24

25 aligned onto the rhemac2 assembly, covering respectively 99.82% and 99.78% of the IR reference genome (excluding the gap region in the reference sequence which cannot be aligned). Single nucleotide variation (SNV) calling was performed by means of SOAPsnp 8, which uses a Bayesian model by carefully considering the character of the Solexa sequencing data and experimental factors. Potential SNVs were extracted that met the following criteria: 1) quality score 20 (on the Phred scale); 2) only unit genomic regions with unique reads mapping were considered for SNP calling; 3) total depth of this location <150. The same criteria were applied to the consensus sequences of CR and CE to calculate the size of the region that could be used for SNV detection. After the filtering process, a total of 86.45% and 85.38% of the reference sequences were found to be usable for SNV detection in CR and CE respectively. Finally, 9.4M SNVs (37.34% homozygous, 62.66% heterozygous) and 12.0M SNVs (44.29% homozygous, 55.71% heterozygous) in CR and CE were identified; the SNV rate was estimated to be 0.41% and 0.53% in CR and CE respectively, the ratio of transition to transversion being about 2.1 (Supplementary Table 8). -25 Supplementary Table 8. Statistics pertaining to SNVs in CR and CE Species Chr %Cov Depth #SNP %SNP #Homo #Hete %Hete #Ti #Tv Ti/Tv autosome ,140, ,408,583 5,731, ,196,010 2,944, CR chrx , , , , , Total ,447, ,527,731 5,919, ,398,289 3,049, autosome ,550, ,042,634 6,507, ,824,217 3,726, CE chrx , , , , , Total ,018, ,322,879 6,695, ,135,423 3,883, Indel variations among macaques Given the wide range of indel sizes, in order to accurately identify all indels, we employed three different methods to estimate divergence with respect to insertion/deletion (indel) events between the three macaque species (see below for detail). The observed indels fell into four categories 25

26 according to their sizes (Supplementary Table 9). For indels that were <10 kb, we identified a total of 313,937 indel events between the two rhesus subspecies CR and IR, and 426,750 indels between the CE and IR macaque species. Other structural variations, including translocations and inversions were also detected, shared of sequences were also presented in Supplementary Table 10. The majority of these indels were <100 bp, with 1-bp indels constituting roughly 50% of the total in both CR and CE. To determine the accuracy of our indel detection, we performed PCR sequencing on randomly selected indels in each category from each macaque. The average validation rate was >85%, demonstrating the reliability our indel detection methods (Supplementary Table 11). -26 Supplementary Table 9. Indel variants between macaque species/sub-species. Variation type CR/IR CE/IR # Length (bp) # Length (bp) 1~5 bp Insertion 131, , , ,732 Deletion 122, , , ,163 5~100 bp Insertion 16, ,142 23, ,766 Deletion 16, ,186 22, , ~10 kbp Insertion 5,213 1,032,548 4, ,288 Deletion 22,253 13,961,550 29,260 18,373,283 Deletion >10 kbp 68 1,706, ,724,954 26

27 -27 Supplementary Table 10. Structural variations between macaque species/sub-species. Species/ IR CR CE shared Type subspecies Mb # Mb # Mb # Mb # Translocation Inversion IR CR CE IR CR CE Supplementary Table 11. Validation rate of detected InDels Variation type CR CE #PCR #Validated %Validated #PCR #Validated %Validated 1-5 bpindel Insertion Deletion bpindel Insertion Deletion kb InDel Insertion Deletion >10kb Deletion Total We next looked at the indel distribution pattern. Here we used only the 1-5 bp indels, since these were less likely to have any bias that resulted from alignment issues. The indel ratios displayed similar chromosomal distribution patterns to the SNVs. The X chromosome exhibited the lowest indel ratio between the macaque subspecies (CR and IR), and, as with the SNPs, a large number of indels (relative to the IR reference) were shared between CR and CE. A total of 112, bp indel events appear in both CR and CE, meaning that these two species share the same genotypes and differ from IR at these positions. These shared indels comprise ~44% of the total number of 1-5 bp indels between the two subspecies CR and IR, and 33% of the 1-5 bp indels between the species CE and IR. 27

28 To estimate the specific indel rate in each species/subspecies, we carried out a four-way alignment using the human genome as an outgroup. From this, we could unambiguously distinguish microinsertion and microdeletion events of <100 bp that had occurred in each macaque species/subspecies after they became separated. The estimated microinsertion rates in the three macaques were very similar (~ ), and the microdeletion rates were somewhat higher than those of the microinsertions (~ ; Supplementary Table 12). On average, the indel mutation rate in the macaque was about one order of magnitude lower than the singlenucleotide mutation rate, which has also been seen in other primates Supplementary Table 12. Small Indel rate estimated by four-way alignment species IR CR CE human specific deletion rate 4.66E E E E-03 specific insertion rate 2.57E E E E-03 We used a long paired-end read strategy to find segments in the IR genome that were missing in CE, CR or both, to identify large indels of length >10 kb, and found 68 large deletions (with an average size of ~25 kb) in the CR genome relative to the IR genome, and 71 in the CE genome (Supplementary Table 13); 18 of these indels were shared between CR and CE. These indels encompass 31 genes in CR and 28 genes in CE, which would be expected to lead either to the disruption or loss of these genes in the two species. We carried out PCR on 14 of these indels to determine their presence in 10 additional individuals to obtain an assessment of whether these indels were specific to the individual tested or were common within the population. Interestingly, all the tested indels displayed polymorphic variation in each population, indicative of the unfixed state of these genomic rearrangements. 28

29 -29 Supplementary Table 13. Large segmental deletions (>10kb) CR CHR Start End Length CHR Start End Length chr1 101,658, ,668,315 10,101 chr1 86,927,537 86,938,511 10,974 chr1 137,048, ,075,287 26,500 chr1 91,517,330 91,527,587 10,257 chr1 145,375, ,388,155 12,750 chr1 107,544, ,554,262 10,248 chr1 207,744, ,756,551 12,370 chr1 107,925, ,936,716 11,281 chr2 150, ,658 66,672 chr1 127,901, ,915,385 14,219 chr2 73,804,957 73,816,021 11,064 chr1 131,591, ,602,380 10,944 chr2 84,469,359 84,519,658 50,299 chr1 207,744, ,756,525 12,343 chr2 92,155,990 92,171,972 15,982 chr2 7,895,809 7,919,063 23,254 chr2 122,501, ,600,905 98,930 chr2 16,154,046 16,170,376 16,330 chr2 143,906, ,957,906 51,592 chr2 73,804,981 73,816,037 11,056 chr3 58,277,050 58,290,067 13,017 chr2 84,469,370 84,519,604 50,234 chr3 140,180, ,191,314 11,034 chr2 92,155,988 92,171,942 15,954 chr3 141,704, ,727,510 22,916 chr2 141,293, ,354,535 61,239 chr4 29,152,687 29,163,060 10,373 chr2 143,906, ,957,911 51,600 chr4 33,692,792 33,704,540 11,748 chr3 116,357, ,409,440 51,664 chr4 125,012, ,026,156 13,658 chr3 141,704, ,727,506 22,901 chr4 133,298, ,323,088 24,731 chr4 33,692,849 33,704,540 11,691 chr4 136,821, ,840,248 19,143 chr4 118,645, ,655,460 10,056 chr4 151,148, ,176,659 27,782 chr4 133,298, ,323,072 24,692 chr5 76,888,522 76,899,063 10,541 chr5 38,849,056 38,864,043 14,987 chr5 92,286,884 92,306,837 19,953 chr5 40,462,908 40,476,359 13,451 chr5 154,278, ,291,375 12,716 chr5 57,554,191 57,565,210 11,019 chr5 179,924, ,936,357 11,627 chr5 65,437,040 65,447,979 10,939 chr7 59,478,175 59,490,355 12,180 chr5 70,318,079 70,333,515 15,436 chr7 62,321,653 62,355,252 33,599 chr6 17,678,765 17,702,412 23,647 chr7 62,896,427 62,926,293 29,866 chr6 21,632,664 21,644,267 11,603 chr7 70,871,352 70,922,608 51,256 chr6 29,109,181 29,121,011 11,830 chr7 148,833, ,865,056 31,856 chr6 29,223,660 29,235,084 11,424 chr7 169,251, ,273,324 21,980 chr6 46,185,492 46,195,800 10,308 chr8 1,712,439 1,730,280 17,841 chr6 74,709,276 74,789,759 80,483 chr9 3,173,004 3,192,165 19,161 chr6 106,436, ,448,855 12,302 chr9 28,177,007 28,195,464 18,457 chr7 59,478,181 59,490,456 12,275 chr9 81,627,925 81,642,166 14,241 chr7 63,139,498 63,163,962 24,464 chr10 19,221,297 19,243,534 22,237 chr7 109,143, ,176,379 32,597 chr10 47,224,249 47,235,945 11,696 chr7 126,779, ,800,153 20,206 CE 29

30 -30 CR CE CHR Start End Length CHR Start End Length chr10 61,204,816 61,240,578 35,762 chr7 148,833, ,865,057 31,835 chr10 77,329,438 77,363,416 33,978 chr7 151,586, ,598,393 11,927 chr11 118,723, ,736,625 13,234 chr8 7,130,783 7,143,317 12,534 chr11 132,726, ,739,576 13,544 chr8 48,972,629 48,998,144 25,515 chr12 97,503,816 97,546,008 42,192 chr9 3,172,973 3,192,182 19,209 chr13 15,844,810 15,900,001 55,191 chr11 9,697,198 9,712,703 15,505 chr13 30,253,918 30,277,394 23,476 chr13 65,979,524 66,024,969 45,445 chr13 65,942,487 65,964,883 22,396 chr13 70,436,926 70,447,910 10,984 chr13 85,050,371 85,063,760 13,389 chr13 85,050,388 85,063,716 13,328 chr13 87,562,854 87,573,390 10,536 chr13 123,713, ,732,292 18,656 chr13 89,487,079 89,520,049 32,970 chr14 47,458,578 47,493,763 35,185 chr13 106,149, ,161,911 12,114 chr14 57,204,161 57,216,012 11,851 chr14 68,944,437 68,962,784 18,347 chr14 70,210,192 70,260,411 50,219 chr14 92,762,065 92,793,306 31,241 chr15 25,242,560 25,308,501 65,941 chr14 108,893, ,903,867 10,236 chr15 28,768,302 28,803,143 34,841 chr15 28,768,316 28,803,165 34,849 chr15 32,627,323 32,639,969 12,646 chr15 30,262,206 30,297,274 35,068 chr16 51,300,004 51,336,008 36,004 chr15 47,355,386 47,372,754 17,368 chr17 32,719,422 32,733,505 14,083 chr16 10,211,128 10,262,157 51,029 chr18 23,855,324 23,873,540 18,216 chr17 16,657,490 16,668,654 11,164 chr18 28,758,227 28,772,112 13,885 chr17 19,384,101 19,398,586 14,485 chr19 32,641,931 32,701,133 59,202 chr17 22,884,221 22,902,928 18,707 chr19 52,901,619 52,914,758 13,139 chr17 68,011,450 68,023,134 11,684 chr19 63,210,293 63,234,342 24,049 chr19 21,059,730 21,070,842 11,112 chrx 25,167,183 25,194,460 27,277 chr19 23,827,009 23,840,377 13,368 chrx 42,281,531 42,296,214 14,683 chr19 43,815,078 43,830,397 15,319 chrx 46,074,463 46,099,985 25,522 chr20 29,465,163 29,506,423 41,260 chrx 55,107,259 55,119,967 12,708 chrx 25,167,184 25,194,467 27,283 chrx 80,534,403 80,544,549 10,146 chrx 31,482,519 31,500,126 17,607 chrx 84,479,716 84,490,050 10,334 chrx 42,743,426 42,769,553 26,127 chrx 102,531, ,572,009 40,729 chrx 138,395, ,407,030 11,852 chrx 115,331, ,354,866 23,264 chrx 147,940, ,981,507 41,418 chrx 119,942, ,955,643 12,867 chrx 153,450, ,537,864 87,837 chrx 138,250, ,312,508 61,580 chrx 138,395, ,406,204 11,016 chrx 138,410, ,501,990 91,345 chrx 147,940, ,981,477 41,376 30

31 base-pair insertion/deletion detection To identify small insertions and deletions ranging in size from 1 to 5 basepairs (bp), reads were aligned onto rhemac2 (allowing for gaps in pairedend mode) using SOAPaligner; paired-end alignment can significantly improve accuracy as compared to single end sequencing technology. In order to minimize the alignment error, the following set of criteria were applied to the alignment: 1) only one gap, maximum 5-bp,was allowed in a single read; 2) if one read in a pair had a gap in the alignment, the other end had to be gap-free, and the orientation and distance had to meet the parameters of the library; 3) no gap was allowed within 5-bp of the ends of a read; 4) no mismatch was allowed within the gap-containing read. Finally, we collated all gap-containing and unique placement alignments with no fewer than 8 paired-ends having the same pattern of gap-containing alignments in a given location in the reference, and these were retained as potential insertions or deletions. A total of 253,350 Indels (51.8% insertions) and 345,558 Indels (52.0% insertions) were detected in CR and CE respectively base-pair insertion/deletion detection Since the alignment of such short reads onto the IR reference sequence was unreliable for the purpose of large (50bp for example) Indel identification, we developed an in-house pipeline for median size Indel detection. Firstly, the de novo assemblies were aligned onto rhemac2 using LASTZ [ /README.lastz a.html]; potential Indels were obtained based on the alignment adopting the following set of filter parameters: 1) the alignment block (potential orthologous region) used for InDel detection had to be larger than 1000 bp; 2) two gaps in the same alignment block 31

32 had to be separated by >50 bp; 3) the alignment identity in the block had to be >95%; 4) the flanking region of a gap had to be larger than 50 bp in gap-free alignment; 5) the gap size had to be <100bp in length >100 base-pair insertion/deletion detection Large deletions were detected on the basis of paired-end alignments by comparison with the rhemac2 reference genome. Each library was characterized by a certain length of DNA fragment which followed a normal distribution with a certain standard deviation. The sequenced read-pairs were expected to align to the reference sequence in a certain orientation, with most of the span size of the read-pairs in alignments ranging from an insert size -3*SD to +3*SD. Abnormally aligned read-pairs could have been caused by variations in genomic structure. To detect larger segmental deletions, the uniquely aligned read-pairs exhibiting an improper orientation relationship or span size were collected and clustered, and then compared to our pre-designed pattern; if more than 10 paired-end alignments supported a specific pattern of structural variation in a given region, this was then reported as a candidate genomic alteration. 4 Demographic analysis of macaque populations and detection of introgression 4.1 DaDi analysis of two rhesus macaque population models First, we considered the model described by Hernandez et al. (2007) 10, and used Diffusion Approximations for Demographic Inference (DaDi) 11 software to estimate the unknown parameters μ= {t 1, t 2, t 3, t 4 } (Supplementary Fig. 11).Time in a coalescent genealogy is scaled by the effective population size; in units of 4N e,where 4N e is the effective population size of the common ancestor. All population sizes within a 32

33 demographic model are specified relative to some reference population size N ref, usually the population size of their common ancestor. The population sizes in the two population models inferred by DaDi (Test 2 in 1Supplementary Table 14) are nearly the same as that derived by Hernandez et al. (2007) 10, although the time-points are different between the two results. Then, we assumed symmetric migration between IR and CR, and employed DaDi to estimate the parameters (Supplementary Table 13 column3). The DaDi results of the model with symmetric migration indicated that migration probably occurred between IR and CR. This putative migration should not simply be ignored on account of the significant improvement in the likelihood observed. The migration parameter M is expressed here in units of 4N ref m, where m is the fraction of new migrants transferring from one population to another in each generation. Allowing asymmetric migration between IR and CR (1Supplementary Table 14 column4) yielded a more precise model. The DaDi results of the model with asymmetric migration indicated that the migration rate from CR to IR was much greater than that from IR to CR, with the likelihood also being greater

34 -34 Supplementary Fig. 11 Demographic models. (A) Two-population model. N r represents the effective ancestral population size of the two rhesus macaque species; t 3 represents the divergence time in the ancestry of the two rhesus macaques (where time in this study is scaled by 4N r generations). The ancestral population split into two independent populations, IR and CR, at time t 2 in the past; the Chinese macaque population abruptly changed size to f 2 N r at t 2. The Indian macaque population maintained its ancestral size until time t 1 in the past at which point its size changed abruptly to f 1 N r. (B) Three-population model. N m represents the effective ancestral population size of the three macaque populations; at time t 4 in the past, the ancestral population split into two populations, one being the rhesus population (IR and CR), the other the CE population. In the absence of population data for CE, we have assumed that N m is equal to N r in the analysis. 34

35 -35 Supplementary Table 14. Parameters estimated in demographic model Parameter Test1 Test2 Test3 Test4 Test5 Test6 t t t3 NA e e t4 NA NA NA NA e f f f3 NA NA NA NA 1 1 NA NA m NA NA NA NA NA NA NA NA NA NA log likelihood NA Test1: estimates in Hernandez et al. 10 Test2: estimates in two populations without migration model Test3: estimates in two populations with symmetric migration model Test4: estimates in two populations with asymmetric migration model Test5: estimates in three populations without migration between CE and CR Test6: estimates in three populations with migration between CE and CR 4.2 DaDi analysis of three macaque population models Owing to the paucity of polymorphism data for the CE population (only one individual was studied here), we adopted a simple model for the three macaque populations. We assumed no migration between either CE and IR or CE and CR. We further assumed that the population model for rhesus macaque (IR) was the same as for the previous model with asymmetric migration. In addition, we assumed that the population sizes of rhesus macaque (IR) and CE were the same as that of their most recent 35

36 common ancestor (MRCA). Hence, there was only one parameter,t 4, to be estimated which was the divergence time between rhesus and CE (Supplementary Table 14 column5).additionally, we assumed migration between CE and CR and retained the other assumptions. The nt 4, MCE CR and MCR CE values were also estimated (Supplementary Table 14 column6) Detection of putative introgression regions We observed that the sequence divergence between CE and CR was much lower than the sequence divergence between CE and IR (Supplementary Fig. 12). In addition, by combining previous SNP data from the IR and CR populations with data from our own sequenced CR and CE individuals 10, we noted that our CE individual clustered within the CR population (Supplementary Fig. 13). This implies that there may have been introgression between CR and CE. We therefore estimated the degree of introgression between these two genomes. The degree of asymmetry in the divergence between CE/CR and CE/IR can be used to estimate the proportion of the CE genome that is of CR origin. In particular, the proportion of the genome that is introgressed (m) should adhere to the following equation: m = (D 13 -D 12 )/(D 13 -D 22 ), where D ij is the proportion of average pair-wise differences from individuals sampled from populations i and j, and the population indices are CE: 1, CR: 2, IR: 3. Using this equation, we found that approximately 30% of the CE genome is of CR origin. We next sought to identify putative introgression regions (PIRs) in the macaque. We noted that if a given chromosomal region had originated as a consequence of hybridization between CE and CR, the sequence diversity between CE and CR (denoted as that between CR and IR (denoted as DIVCE CR ) would be lower than DIVIR CR ). The diversity between two 36

37 species/subspecies can be scaled in genetic distance with matrix listed in Supplementary Table 15. Then, we calculated DIVCE CR and DIVCR IR for non-overlapping windows of fixed size (denoted DIV and the windowi ) using the method described below: DIV i S1 S2 i CE CR distance in window i number of non-n bases in window i where S 1 and S 2 are two different species/subspecies. Further, we introduced a statistic to quantify the difference between DIV and DIV : i CR IR Under this definition, a negative R i i ( DIV DIV ) i DIV i CE CR CR IR diff CE CR i CE CR i DIVCR IR for -37 i R diff indicates that CR is closer to CE than IR. To filter out the regions where the CR sequence was closer to CE by chance alone, we generated simulation data assuming a neutral model using parameters estimated by demographic analysis. The demographic model used for simulation assumed no migration between CE and CR. The program ms 12 was used to generate segregating sites of the three macaques. The ms command we used was: ms t I n 1 1 -n n ma x xxxx x x -en en ej ej en >ms.out Then, we calculated quantile of as cutoff R diff for each window in the simulated data. The 1% R diff in the simulated data was used as a cut-off (denoted R ), that is P( R R ) 0.01, for all windows in simulated data diff cutoff Then, the cut-off from the simulated data was applied to our actual data, with R i diff R cutoff, to predict which regions in the macaque genome were PIRs. Under this definition of R cutoff, we may see that the p-values of PIRs were < We performed this analysis using a series of window sizes from 10kb to 1Mb.The total size of PIRs was shown in Supplementary Fig. 14. We found 37

38 that fewer PIRs could be detected using small windows (10kb and 20kb) or large windows (>100kb). This could be due to the smaller number of SNVs existing within a small window, which can reduce the power to detect PIRs. Although the large window may have exceeded the average size of real PIRs, the total PIR sizes do not change much when 40kb to 100kb windows are used, yielding a peak at 50kb. To maximize the precision of PIR detection, we used 50kb in our final version. -38 Supplementary Fig. 12 Diversity ratio difference between and within species. The divergence along 100kb windows was compared between CE/CR and CR/IR, yielding an estimate of >23% for the genomic regions over which CE and CR are the most closely related. No such inconsistency was however observed in comparisons involving CE/IR and CR/IR. Only homozygous divergence was considered in the calculation. This may reflect the influence of recurring hybridization between the Chinese rhesus and crab-eating macaques on the genomic sequences of these two species. 38

39 -39 Supplementary Fig. 13 Principal component analysis for 1,476 SNPs in the macaque species. A total of 1,476 SNPs, obtained from ref 10, were combined with those of our two sequenced individuals (one CE, the other CR in this plot) to perform principal component analysis. PC1 and PC2 placed these two individuals within the cluster of Chinese rhesus individuals surveyed in ref 10, suggesting that our sequenced CE individual shares genomic sequence with the CR population. CH denotes the 9 Chinese rhesus macaques from ref 10 ; IN are the 38 Indian rhesus macaques from ref 10. Supplementary Fig. 14 Total size of PIRs detected using different window sizes. 39

40 -40 Supplementary Fig. 15 Distribution of putative introgression regions (PIRs) by chromosome. The bar denotes the total size of PIRs in each chromosome, whereas the red curve represents the proportion of PIRs on each chromosome. The X chromosome contains the lowest proportion of PIRs. Supplementary Table 15. Genetic distance matrix of diploid pattern Diploid pattern (AA-AA) (AA-TT) (AT-TT) (AT-AT) Distance assigned 0 1 1/2 1/2 5 Detection of selective sweep regions 5.1 Principle of the HKA test and its application in our analysis Although this section describes the main principle of the HKA test, more detailed information may be obtained from the original paper describing the Hudson-Kreitman-Aguadé [HKA] test 13. The neutral theory of molecular evolution predicts that genomic regions which evolve at a high rate, as revealed by inter-species DNA sequence comparisons, will also exhibit high levels of intra-species polymorphism. Let us consider data collected from two species, which we shall refer to as 40

41 species A and species B, and from L regions in the genome which we shall refer to as locus l through locus L. Let us assume that a random sample of n A gametes from species A have been sequenced at all L loci and n B gametes from species B have been sequenced at the same loci. Let denote the number of nucleotide sites that are polymorphic at locus i in the sample of n A gametes from species A. Similarly, let B S i denote the number of polymorphic sites at locus i in the sample of n B gametes from species B. Let D, i 1,..., L denote the number of sequence differences at i locus i between a random gamete from the sample of species A and a random gamete from the sample of species B. The HKA test also assumes that: 1) species A and B were at stationarity at the time of sampling with population sizes 2N and 2Nf respectively, 2) the two species were derived T generations ago from a common ancestor, 3) the ancestral population was at stationarity at the time of the split, with population size of 2 N(1 f )/2 gametes, 4) at locus i, the number of mutations per gamete in each generation approximated to a Poisson distribution with mean i. The HKA test then employs the of the observations to the model, defined as follows: A S i 2 X statistic to measure the goodness-of-fit -41 L L 2 ˆ / ˆ B L A A A ˆ B / ˆ B S ˆ / ˆ i E Si Var Si Si E i i i i i i 1 i 1 i 1 X ( ( )) ( ) ( ( S )) Var( S ) ( D E( D)) Var( D) Under the neutral model, the estimated expectations and variances of D, A Si and B S i could be obtained using the following properties: i 41

42 -42 where i A E( S ) C( N ), i 1,..., L i i A 1 Var S E S, i L na 1 A A 2 ( i ) ( i ) i 1,..., 2 j 1 j B E( S ) f C( N ), i 1,..., L i Var S i B 1 S f, i L na 1 B B 2 2 ( i ) E( i ) i 1,..., 2 j 1 j ED ( i ) i ( T (1 f) / 2), i 1,..., L n N i, T T /2 N, C( n). When the estimators i, f, T j j 1 (denoted as ˆ, ˆ, ˆ 2 i f T ) are known, we may calculate the value of X. The estimators can be obtained by solving the following system of equations: L L A ( ) ˆ Si C na i i 1 i 1 L L B ˆ Si fc( nb ) i 1 i 1 L i 1 ˆ i D ˆ { Tˆ (1 fˆ) / 2 C( n ) fc ˆ ( n )}, i 1,..., L 1 i i A B The original paper 13 provided a modified version of the typical HKA test to accommodate a situation in which there are polymorphism data derived from only one species, and in which there are different numbers of sites for the within-species data than for the between-species data. The macaque data used in our own analysis represent just such a situation. The only difference between our analysis and the typical HKA test is that here we took CR and CE as one species, while using Homo sapiens as an outgroup. It is certainly reasonable to suppose that the three macaques we analyzed are very closely related evolutionarily. Let us consider the data in Supplementary Table 16 A A where S1 969, S2 681, D1 6146, D2 5308, n A 6. Assuming that the neutral mutation rate for a locus is proportional to the number of sites, let i denote 4N times the mutation rate per nucleotide site in locus i. 42

43 -43 We assume that the two populations (macaque and human), have the same population size, hence f 1. To obtain the estimates oft and ( i 1,2), we must solve the following system of equations: i S S C( n )(88187 ˆ ˆ ) D D T D S C n A A 1 2 A 1 2 ˆ ˆ ˆ 1 2 ( )( 1) A ˆ 1( Tˆ 1) ( )88187 ˆ A 1 Then the X 2 statistic can be calculated with T ˆ and ( i 1,2). Supplementary Table 16. Example of input data in HKA test Locus i* Length Aligned length #variable sites ˆi Length Lavg** Aligned length #variable sites Within macaques 100,000 88, ,000 84, between macaques and human 100,000 88,067 6, ,000 84,236 5,308 * Locus i denotes the targeted region ** L avg denotes the average level of whole genome 5.2 Detection of selective sweeps We employed the HKA test 12,13 to detect regions that contain potential selective sweeps and which have a low degree of divergence among macaques but normal levels of divergence between macaques and outgroup (here we used human as the outgroup). The procedure of the HKA test we used was as follows: 1) each chromosome from IR was split into non-overlapping windows of a fixed size (100 kb), with each window being considered as a locus in the HKA test; 2) we calculated the average divergence within macaques and the average divergence between macaques and human, using the SNP set from the three macaques and the pairwise alignment between IR and hg18. The average divergence between macaques, and the average divergence between macaques and human, were both 43

44 The considered to be neutral and were used to construct a virtual locus, termed L avg. 3) we ran the HKA test with two loci each time, one locus being a window taken from the genome, the other one being the L avg. The program used to run the HKA test was downloaded from 2 X statistic used for measuring the goodness-of-fit can be obtained after application of the HKA test for each window, and can be used to infer 2 the putative selective sweeps. Since the actual distribution of X and the number of selective sweep regions are unknown, simulations were performed to calculate the significance of the deviations. Based on the demographic model of three populations (allowing migration between CE and CR) estimated by DaDi, we used the program ms 12 to generate simulated segregating sites of 3 macaque species. The ms command we used was: ms t I n 1 1 -n n ma x x x x x -en en ej ej en >ms.out -44 The output was independent samples with every sample corresponding to a locus of 1000bp. The value of parameter '-t' (denoted by θ s ) was converted from the estimate of θ in DaDi analysis, θ s =θ/150367*1000, where is the total length in bp of those regions within which the SNPs used in DaDi analysis reside. The output of ms represented the positions of the segregating sites, but was not enough to calculate the divergence between the three macaque species and human. Therefore, we used the ancestral sequence of the three macaques, which was determined by means of the parsimony method using multiple sequence alignments, and placed the segregating sites on the ancestral sequence to generate the simulated diploid sequences for the three macaques. Then we performed the same HKA test procedure as described above on the simulated data and obtained the simulated distribution of 2 X. 44

45 -45 From the deviation distribution of simulated data, we obtained the 99% 2 cutoff of X (Supplementary Fig. 16) to identify the significant signals corresponding to strong selective sweeps. We replicated the simulation 10 times and calculated the mean value of with 99% cutoffs with an SD of about The mean value was then used to determine the location of the putative selective sweep regions. For normal background levels of divergence between macaques and human within the selective sweep regions, we filtered out those loci whose divergence (between macaques and human)lay outside the interval:[ mean( DIV ) SD( DIV ), mean( DIV ) SD( DIV )], where h m h m h m h m mean( DIVh m) denotes the mean value of divergences between macaques and human and SD( DIVh m) denotes the standard deviation of divergences between macaques and human. The mean( DIVh m) and SD( DIVh m) were calculated for each chromosome. values To avoid false positives due to the reduced diversity in any two macaques rather in all macaques, we restricted our search for putative selective sweeps to within those regions displaying consistently low pairwise divergence between macaque species. Thus, we only retained those putative selective sweep regions that satisfied the following criteria: DIV mean( DIV ) SD DIV DIV mean( DIV ) SD DIV DIV mean( DIV ) SD DIV IR CR IR CR IR CR IR CE IR CE IR CE CR CE CR CE CR CE Finally, we merged the adjacent windows of putative selective sweep regions to a single locus, and extended two ends of the sweeps to their nearest segregating sites. All the detected selective sweep regions can be found in Supplementary Table 17 in a separate file. 45

46 -46 Supplementary Fig. 16 Distribution of HKA test X2values in simulated data. Supplementary Table 17. Selective sweep regions detected in this study. (This table can be found in a separate file) 6 Gene evolution 6.1 Ortholog determination DNA and protein data from five species [human, chimpanzee, Indian Rhesus macaque (IR), mouse and rat] were downloaded from the Ensembl database. For genes with alternative splicing variants, the longest transcript was selected to represent the gene. We used orthologous 1:1 relationships downloaded from the Ensembl database to identify pairwise orthologs between human and other species. The orthologous relationships in macaques were determined on the basis of synteny. Those genes which may have contained errors such as frameshifts, or which were incomplete, were removed from the analysis. Based on these orthologous relationships, we determined the orthologous relationships from different catalogues, including five orthologous species (Human, 46

47 Chimpanzee, IR, CR and CE; FIVE ortholog), six orthologous species (Human, Chimpanzee, IR, CE, Mouse and Rat; SIX ortholog), and seven orthologous species (Human, Chimpanzee, IR, CR, CE, Mouse and Rat; SEVEN ortholog). In total, we assigned 14,978 1:1 gene orthologues for human, chimpanzee and the various macaque species/subspecies by genome alignment. After filtering incomplete genes and discarding genes with any frameshifts (due to the draft quality of the four non-human primate genomes), we were able to assign unambiguously 8,620 high quality 1:1 orthologues, which were used in further analysis to avoid any bias due to the false gene prediction. We also compiled high confidence 1:1 orthologous relationships for all five primates as well as two rodent species (mouse and rat) containing 5,409 orthologues that had complete openreading frames dn/ds calculation We used the Yang-Nielsen (YN) model 14 to calculate dn and ds values for each pair of macaques as well as those between human and chimpanzee. To estimate lineage-specific dn and ds values, the SEVEN ortholog dataset was used for the calculation of in-branch dn and ds values employing the codeml program of the PAML (Phylogenetic Analysis by Maximum Likelihood) package 15 using the F3x4 model (which calculates codon frequencies using base composition at the three codon positions),and different ω ratios across branches and a single ω ratio across sites, and separate estimation of κ per gene and a given tree topology in PAML 15 package. 6.3 Rapidly and slowly evolving function categories To explore the evolution of function catalog, we downloaded the Gene Ontology (GO) annotation of human genes from the Ensembl database. Only 47

48 -48 the GO catalog containing no fewer than 20 genes in our six ortholog dataset was selected for further analysis. Orthologs of IR and CE were selected for the identification of rapidly or slowly evolving functions in the GO catalog. Firstly, we calculated the average Ka and Ks values for all genes annotated to a given GO: k a a i iet A, i iet where a i and A i are the numbers of non-synonymous substitutions and sites s i and S i are the numbers of synonymous substitutions and sites in gene i, and T is the number of genes annotated to GO. k s iet iet s S i i Secondly, the expected proportion of non-synonymous substitutions to all substitutions P A in a GO category C was then estimated as: P A ka Ai iec k A k S a i s i iec iec Finally, for a given GO category C, we used binomial distributions to estimate the divergence of the proportion of non-synonymous substitutions and synonymous sites between the observed and the expected: ac sc a s j PC PA(1 PA) j a j c c c ac sc j Seven gene categories were specifically identified as outliers compared to what was expected by randomly permuted annotations (p< 10-4 ). These four rapidly evolving categories were melanocortin receptor activity (ω 0.59), single-stranded DNA binding (ω 0.52), olfactory receptor activity (ω 0.51), sensory perception of smell (ω 0.51), response to stimulus (ω 0.38) and mitochondrion (ω 0.32), much of which are already known to have undergone rapid evolution in many other mammalian species including humans 18 (Supplementary Table 18). Supplementary Table 18. Rapid evolution of function catalogue in macaques GO ID gene GO name GO Categories dn/ds Amino Acid divergence pvalue 48

49 -49 number GO: sensory perception of smell biological_process E-12 GO: methyltransferase activity molecular_function E-04 GO: Mitochondrion cellular_component E-06 GO: olfactory receptor activity molecular_function E-13 GO: single-stranded DNA binding molecular_function E-03 GO: response to stimulus biological_process E-06 GO: melanocortin receptor activity molecular_function E Identification of lineage-specific accelerated GO categories We used the SIX orthologous data set to identify lineage-specific accelerated GO categories. Firstly, for lineages x and y, we calculated the average proportion of non-synonymous sites. p x x x y, which x x, i i y yi where x is the total number of non-synonymous sites in the x lineage, and y is the total number of non-synonymous sites in the y lineage. Then, the divergence of the proportion of non-synonymous numbers in different lineages between observed and expected could be estimated by using the binomial distribution. xc yc xc yc j pc px(1 px) j x j C i xc yc j All p-values were adjusted to allow for multiple testing by means of the Bonferroni method. Compared with murids, 19 categories were found to exhibit specific acceleration of non-synonymous substitutions in both the Macaca and hominid lineages (Supplementary Table 19). These rapidly evolving primate genes encode proteins with functions in sensory perception, stimulus response, melanocortin receptor activity and keratin filaments. However, we did not detect any significant positive selection for GO categories in the murid lineage as compared with the primates. The rapidly evolving categories we detected in murids as compared to hominids, which 49

50 mainly relate to a function in host defence 18, were no longer statistically significant when compared with Macaca, suggesting that these genes have undergone adaptive evolution in both Macaca and murids in stark contrast to hominids. Our analysis also revealed a total of 14 GO categories that displayed significantly accelerated evolution specifically in the hominid lineage. These genes encode proteins with major functions in the regulation of glucose metabolism, cartilage development, male gonad development, receptor-mediated endocytosis and ion transport. Interestingly, many synapse-related genes specifically accumulate an excess of nonsynonymous mutations in hominids, which may be indicative of the extraordinarily rapid evolution of the hominid nervous system 36. A total of 15 categories of genes evolved rapidly in the Macaca lineage

51 -51 Supplementary Table 19. Lineage-specific evolution GO categories in each branch Amino Acid divergence dn/ds Category GO ID GO Term Category #Orthologue hominid macaca murid hominid macaca murid Primatespecific Hominidspecific positive regulation of ubiquitin-protein ligase activity during GO: mitotic cell cycle BP GO: GTPase activity MF GO: response to stimulus BP GO: GTP binding MF GO: spliceosomal complex CC GO: cell motion BP GO: keratin filament CC GO: nuclear mrna splicing, via spliceosome BP GO: structural constituent of cytoskeleton MF GO: translational elongation BP GO: smallgtpase mediated signal transduction BP GO: nucleocytoplasmic transport BP GO: cytosolic small ribosomal subunit CC GO: olfactory receptor activity MF GO: melanocortin receptor activity MF GO: sensory perception of smell BP GO: intracellular protein transport BP GO: mrna binding MF GO: translational initiation BP GO: neurotransmitter receptor activity MF GO: Synapse CC GO: proteinaceous extracellular matrix CC GO: extracellular matrix structural constituent MF

52 -52 Category GO ID GO Term Amino Acid divergence dn/ds Category #Orthologue hominid macaca murid hominid macaca murid Macaquespecific GO: extracellular ligand-gated ion channel activity MF GO: cartilage development BP GO: receptor-mediated endocytosis BP GO: specific RNA polymerase II transcription factor activity MF GO: GABA-A receptor activity MF GO: MAPKKK cascade BP GO: extrinsic to membrane CC GO: actin filament organization BP GO: male gonad development BP GO: glucose metabolic process BP GO: single-stranded DNA binding MF GO: Melanosome CC GO: cytosolic large ribosomal subunit CC GO: protein folding BP GO: interspecies interaction between organisms BP GO: isomerase activity MF GO: structural constituent of ribosome MF GO: response to organic nitrogen BP GO: unfolded protein binding MF GO: microtubule-based process BP GO: mitochondrial respiratory chain complex I CC GO: response to wounding BP GO: mitochondrial electron transport, NADH to ubiquinone BP GO: intracellular membrane-bounded organelle CC GO: insulin receptor signaling pathway BP

53 Positive gene selection To identify potential candidates for positive selection genes in macaques, the branch site model which compares the modified model A with the corresponding null model with ω = 1 fixed (fix_omega = 1 and omega = 1) was applied to the FIVE ortholog gene set, the p-value being calculated using the X 2 statistic adjusted by the Bonferroni method to allow for multiple testing.to reduce the false positive rate due to the high similarity in the sequences of these three macaques, we introduced strict criteria for the final result with FDR <0.05 with all the detected genes having an adjusted probability < Supplementary Table 20. Specific genes displaying evidence of positive selection in each macaque species/sub-species. Gene MIM Morbid name Chr Description Description P value Test GABRA5 15 gamma-aminobutyric acid (GABA) A E-03 IR receptor, alpha 5 MTMR3 22 myotubularin related protein E-03 IR TGFBRAP1 2 transforming growth factor, beta - receptor associated protein E-07 IR Neuropathy, distal DCTN1 2 dynactin 1 hereditary motor, type VIIB; Amyotrophic lateral 4.85E-03 IR sclerosis, susceptibility to; Perry syndrome, MGAT3 22 mannosyl (beta-1,4-)-glycoprotein beta- - 1,4-N-acetylglucosaminyltransferase 1.40E-02 IR ZFP zinc finger protein 112 homolog (mouse) E-03 IR TBC1D2B 15 TBC1 domain family, member 2B E-03 IR CYTH3 7 cytohesin E-03 IR ATBF1 16 zinc finger homeobox 3 Prostate cancer, 1.20E-08 susceptibility to, IR FYN 6 FYN oncogene related to SRC, FGR, - YES 3.39E-03 IR SLITRK6 13 SLIT and NTRK-like family, member E-02 IR FAR1 11 fatty acyl CoA reductase 1 Cerebral cavernous 8.48E-06 malformations 3; IR 53

54 -54 Gene name Chr Description MIM Morbid Description P value Test Epilepsy, juvenile myoclonic, susceptibility to, 8; Epilepsy, juvenile CLCN2 3 chloride channel 2 absence, susceptibility 2.18E-03 IR to, 2; Epilepsy, idiopathic generalized, susceptibility to, 11, SDC1 2 syndecan E-03 IR SR140 3 U2-associated protein SR140 (140 - kdaser/arg-rich domain protein) 1.10E-07 IR ZNRF4 19 zinc and ring finger E-08 IR TCTE1 6 t-complex-associated-testis-expressed E-02 CR DUOX1 15 dual oxidase E-02 CR SPRY4 5 sprouty homolog 4 (Drosophila) E-02 CR C18orf25 18 Uncharacterized protein C18orf E-03 CR CTDSPL2 15 CTD (carboxy-terminal domain, RNA polymerase II, polypeptide A) small E-02 CR phosphatase like 2 Colonic adenoma ODC1 2 ornithine decarboxylase 1 recurrence, reduced risk 4.38E-02 CR of, ZNF zinc finger protein E+00 CR ALOX15 17 arachidonate 15-lipoxygenase E-02 CE PYGL 14 phosphorylase, glycogen, liver Glycogen storage 8.51E-09 disease VI CE CD2BP2 16 CD2 (cytoplasmic tail) binding protein E-02 CE ARC 8 activity-regulated cytoskeletonassociated protein 1, Cataract, posterior polar, 1.06E-02 CE GOSR1 17 golgi SNAP receptor complex member E-02 CE OR5T3 11 olfactory receptor, family 5, subfamily T, - member E-04 CE SNX5 20 sortingnexin E-07 CE RTP2 3 receptor (chemosensory) transporter - protein E-02 CE FEM1B 15 fem-1 homolog b (C. elegans) E-02 CE RAB1B 11 RAB1B, member RAS oncogene family E-03 CE UQCRC2 16 ubiquinol-cytochrome c reductase core - protein II 8.35E-03 CE ZNF zinc finger protein E-08 CE ZNF zinc finger protein E-08 CE 54

55 Immunity pathway dn/ds calculation The 15 human immunity pathways information was downloaded from the KEGG database. And the dn/ds values were listed in Supplementary Table 21. Supplementary Table 21. Immunity pathway dn/ds calculation Pathway #human gene #IR gene Average identity dn/ ds #gene loss/ pseudo #CR gene Average identity dn/ ds #gene loss/ pseudo #CE gene Average identity dn/ ds #gene loss/ pseudo Chemokine signaling pathway Complement and coagulation cascades Antigen processing and presentation Toll-like receptor signaling pathway NOD-like receptor signaling pathway RIG-I-like receptor signaling pathway Cytosolic DNAsensing pathway Hematopoietic cell lineage Natural killer cell mediated cytotoxicity T cell receptor signaling pathway B cell receptor signaling pathway Fc epsilon RI signaling pathway Fc gamma R- mediated phagocytosis

56 -56 Pathway #human gene #IR gene Average identity dn/ ds #gene loss/ pseudo #CR gene Average identity dn/ ds #gene loss/ pseudo #CE gene Average identity dn/ ds #gene loss/ pseudo Leukocyte transendothelial migration Intestinal immune network for IgA production Identification of lost genes as compared to human To identify potential lost genes in macaque, we aligned the human cdna sequences to three macaque genomes using BLAT with following parameters extendthroughn -minidentity 80 respectively. If more than 50% of a human cdna failed to align to macaque genomes was retained for further analysis. Then, we searched the potential orthologous regions for these genes using the synteny information between human and macaques, and re-predicted these genes within the potential orthologous regions in macaques. If no any similar fragments can be found at that region, we manually checked the assembly quality by paired-end raw reads to avoid arising from poor assembly quality. To further confirm this, we aligned the genomic reads and transcriptome reads from macaques onto the human reference with relax criteria. Human genes which can be mapped with these reads were excluded from the candidate list. Finally, we only kept the genes with only one copy in human so as to make sure the loss has functional importance. A total of 25 genes have been lost in macaques (Supplementary Table 22). We used PCR to validate part of these genes. The primers were designed in the conserved region between human and other mammalians. The human DNA was used as a control to make sure the workable of the primers. We have validated all the 11 cases that have workable primers (Supplementary Fig.17). 56

57 -57 Supplementary Fig. 17 PCR validation for lost genes. Primers were designed in the conserved regions between human and other mammalians, and human DNA was used as a control to make sure the workable of the primers. Most of the gel validate genes lost in macaques compare to human. The PCR experiment also confirmed OR52A4 was specific lost in CR. For ZNF519, PCR bands were observed in three macaques with different size relative to human, sequencing of the PCR products showed that different regions were amplified between macaques and human thus suggest its loss in macaques. Abbreviations: M, marker; HM, human; CR, Chinese rhesus macaque; CE, Crab-eating macaque; IR, Indian rhesus macaque. 57