The variation and evolution of complete human centromeres

The variation and evolution of complete human centromeres

7 April, 2024 trendhype.net 0 Comments 1 category

Cell lines

CHM1hTERT (CHM1) cells were originally isolated from a hydatidiform mole at Magee-Womens Hospital. Cryogenically frozen cells from this culture were grown and transformed using human telomerase reverse transcriptase (hTERT) to immortalize the cell line. This cell line has been authenticated by short-tandem-repeat analysis by Cell Line Genetics and has tested negative for mycoplasma contamination. Human HG00733 lymphoblastoid cells were originally obtained from a female Puerto Rican child, immortalized with the Epstein–Barr Virus (EBV) and stored at the Coriell Institute for Medical Research. This cell line has been authenticated using a multiplex PCR assay with six autosomal microsatellite markers and has tested negative for mycoplasma contamination. Chimpanzee (Pan troglodytes, Clint, S006007) fibroblast cells were originally obtained from a male western chimpanzee named Clint (now deceased) at the Yerkes National Primate Research Center and immortalized with EBV. Orangutan (Pongo abelii, Susie, PR01109) fibroblast cells were originally obtained from a female Sumatran orangutan named Susie (now deceased) at the Gladys Porter Zoo, immortalized with EBV and stored at the Coriell Institute for Medical Research. Macaque (Macaca mulatta; AG07107) fibroblast cells were originally obtained from a female rhesus macaque of Indian origin and stored at the Coriell Institute for Medical Research. The chimpanzee, orangutan and macaque cell lines have not yet been authenticated or assessed for mycoplasma contamination to our knowledge.

Cell culture

CHM1 cells were cultured in complete AmnioMax C-100 Basal Medium (Thermo Fisher Scientific, 17001082) supplemented with 15% AmnioMax C-100 Supplement (Thermo Fisher Scientific, 12556015) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). HG00733 (Homo sapiens) cells were cultured in RPMI-1650 medium (Sigma-Aldrich, R8758) supplemented with 15% fetal bovine serum (FBS; Thermo Fisher Scientific, 16000-044) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). Chimpanzee (P. troglodytes; Clint; S006007) and macaque (Macaque mulatta; AG07107) cells were cultured in MEM α containing ribonucleosides, deoxyribonucleosides and l-glutamine (Thermo Fisher Scientific, 12571063) supplemented with 12% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). Orangutan (P. abelii; Susie; PR01109) cells were cultured in MEM α containing ribonucleosides, deoxyribonucleosides and l-glutamine (Thermo Fisher Scientific, 12571063) supplemented with 15% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). All cells were cultured in a humidity-controlled environment at 37 °C under 95% O₂.

DNA extraction, library preparation and sequencing

PacBio HiFi data were generated from the CHM1 and HG00733 genomes as previously described²¹ with some modifications. In brief, high-molecular-weight DNA was extracted from cells using a modified Qiagen Gentra Puregene Cell Kit protocol⁴⁷. High-molecular-weight DNA was used to generate PacBio HiFi libraries using the Template Prep Kit v1 (PacBio, 100-259-100) or SMRTbell Express Template Prep Kit v2 (PacBio, 100-938-900) and SMRTbell Enzyme Clean Up kits (PacBio, 101-746-400 and 101-932-600). Size selection was performed with SageELF (Sage Science, ELF001), and fractions sized 11 kb, 14 kb, 15 kb or 16 kb (as determined by FEMTO Pulse (Agilent, M5330AA)) were chosen for sequencing. Libraries were sequenced on the Sequel II platform with seven or eight SMRT Cells 8M (PacBio, 101-389-001) per sample using either Sequel II Sequencing Chemistry 1.0 (PacBio, 101-717-200) or 2.0 (PacBio, 101-820-200), both with 2 h pre-extension and 30 h videos, aiming for a minimum estimated coverage of 30× in PacBio HiFi reads (assuming a genome size of 3.1 Gb). Raw CHM1 data were processed using DeepConsensus⁴⁸ (v.0.2.0) with the default parameters. Raw HG00733 data were processed using the CCS algorithm (v.3.4.1) with the following parameters: –minPasses 3 –minPredictedAccuracy 0.99 –maxLength 21000 or 50000.

Ultra-long ONT data were generated from the CHM1, HG00733, chimpanzee, orangutan and macaque genomes according to a previously published protocol⁴⁹. In brief, 3–5 × 10⁷ cells were lysed in a buffer containing 10 mM Tris-Cl (pH 8.0), 0.1 M EDTA (pH 8.0), 0.5% (w/v) SDS and 20 μg ml⁻¹ RNase A (Qiagen, 19101) for 1 h at 37 °C. Then, 200 μg ml⁻¹ proteinase K (Qiagen, 19131) was added, and the solution was incubated at 50 °C for 2 h. DNA was purified through two rounds of 25:24:1 (v/v) phenol–chloroform–isoamyl alcohol extraction followed by ethanol precipitation. Precipitated DNA was solubilized in 10 mM Tris (pH 8.0) containing 0.02% Triton X-100 at 4 °C for 2 days. Libraries were constructed using the Ultra-Long DNA Sequencing Kit (ONT, SQK-ULK001) with modifications to the manufacturer’s protocol. Specifically, around 40 μg of DNA was mixed with FRA enzyme and FDB buffer as described in the protocol and incubated for 5 min at room temperature, followed by a 5 min heat-inactivation at 75 °C. RAP enzyme was mixed with the DNA solution and incubated at room temperature for 1 h before the clean-up step. Clean-up was performed using the Nanobind UL Library Prep Kit (Circulomics, NB-900-601-01) and eluted in 225 μl EB. Then, 75 μl of library was loaded onto a primed FLO-PRO002 R9.4.1 flow cell for sequencing on the PromethION, with two nuclease washes and reloads after 24 and 48 h of sequencing.

Additional ONT data were generated from the CHM1, HG00733, chimpanzee, orangutan and macaque genomes according to a previously published protocol²¹. In brief, high-molecular-weight DNA was extracted from cells using a modified Qiagen Gentra Puregene protocol⁴⁷. High-molecular-weight DNA was prepared into libraries with the Ligation Sequencing Kit (SQK-LSK110) from ONT and loaded onto primed FLO-PRO002 R9.4.1 flow cells for sequencing on the PromethION system, with two nuclease washes and reloads after 24 and 48 h of sequencing. All ONT data were base-called using Guppy (v.5.0.11) with the SUP model.

Targeted sequence assembly and validation of centromeric regions

To generate complete assemblies of centromeric regions from the CHM1, HG00733, chimpanzee, orangutan and macaque genomes, we first assembled each genome from PacBio HiFi data (Supplementary Table 1) using hifiasm²⁴ (v.0.16.1). The resulting PacBio HiFi contigs were aligned to the T2T-CHM13 reference genome⁴ (v.2.0) using minimap2⁵⁰ (v.2.24) with the following parameters: -I 15G -a –eqx -x asm20 -s 5000. Fragmented centromeric contigs were subsequently scaffolded with ultra-long (>100 kb) ONT data generated from the same source genome using a method that takes advantage of SUNKs (Supplementary Fig. 1; https://github.com/arozanski97/SUNK-based-contig-scaffolding). In brief, SUNKs (k = 20 bp) were identified from the CHM1 PacBio HiFi whole-genome assembly using Jellyfish (v.2.2.4) and barcoded on the CHM1 PacBio HiFi centromeric contigs as well as all ultra-long ONT reads. PacBio HiFi centromeric contigs sharing a SUNK barcode with ultra-long ONT reads were subsequently joined together to generate contiguous assemblies that traverse each centromeric region. The base accuracy of the assemblies was improved by replacing the ONT sequences with locally assembled PacBio HiFi contigs generated using HiCanu⁷ (v.2.1.1).

We validated the construction of each centromere assembly using four different methods. First, we aligned native PacBio HiFi and ONT data from the same source genome to each whole-genome assembly using pbmm2 (v.1.1.0) (for PacBio HiFi data; https://github.com/PacificBiosciences/pbmm2) or Winnowmap⁵¹ (v.1.0) (for ONT data) and assessed the assemblies for uniform read depth across the centromeric regions using IGV⁵² and NucFreq²². We next assessed the concordance between the assemblies and raw PacBio HiFi data using VerityMap²⁷, which identifies discordant k-mers between the two and flags them for correction. We then assessed the concordance between the assemblies and ONT data using GAVISUNK²⁸, which identifies concordant SUNKs between the two. Finally, we estimated the accuracy of the centromere assemblies from mapped k-mers (k = 21) using Merqury (v.1.1)⁵³ and publicly available Illumina data from each genome (Extended Data Table 1). We estimated the QV of the centromeric regions with the following formula:

$$-10\times \,\log (1-{(1-(\text{number of erroneous}k\text{-mers}/\text{total number of}k\text{-mers}))}^{(1/k)})$$

FISH and spectral karyotyping

To determine the karyotype of the CHM1 genome, we first prepared metaphase chromosome spreads by arresting CHM1 cells in mitosis via the addition of KaryoMAX Colcemid Solution (0.1 µg ml⁻¹, Thermo Fisher Scientific, 15212012) to the growth medium for 6 h. Cells were collected by centrifugation at 200g for 5 min and incubated in 0.4% KCl swelling solution for 10 min. Swollen cells were pre-fixed by the addition of freshly prepared methanol:acetic acid (3:1) fixative solution (~100 μl per 10 ml total volume). Pre-fixed cells were collected by centrifugation at 200g for 5 min and fixed in methanol:acetic acid (3:1) fixative solution. Spreads were dropped on a glass slide and incubated on a heating block at 65 °C overnight. Before hybridization, slides were treated with 1 mg ml⁻¹ RNase A (Qiagen, 19101) in 2× SSC for at least 45 min at 37 °C and then dehydrated in a 70%, 80% and 100% ethanol series for 2 min. Denaturation of spreads was performed in 70% formamide/2× SSC solution at 72 °C for 1.5 min and was immediately stopped by immersing the slides into an ethanol series pre-chilled to −20 °C.

Fluorescent probes for spectral karyotyping were generated in-house. Individual fluorescently labelled whole-chromosome paints were obtained from Applied Spectral Imaging. Paints were provided in a hybridization buffer and mixed 1:1 for indicated combinations. Labelled chromosome probes and paints were denatured by heating to 80 °C for 10 min before applying them to denatured slides. Spreads were hybridized to probes under a HybriSlip hybridization cover (Grace Bio-Labs, 716024) sealed with Cytobond (SciGene, 2020-00-1) in a humidified chamber at 37 °C for 48 h. After hybridization, the slides were washed three times in 50% formamide/2× SSC for 5 min at 45 °C, 1× SSC solution at 45 °C for 5 min twice, and at room temperature once. The slides were then rinsed with double-deionized H₂O, air-dried and mounted in Vectashield-containing DAPI (Vector Laboratories, H-1200-10).

For spectral karyotyping, images were acquired using LSM710 confocal microscope (Zeiss) with the 63×/1.40 NA oil-immersion objective and ZEN (v.3.7) software. Segmentation, spectral unmixing and identification of chromosomes were performed using an open-source karyotype identification via spectral separation (KISS) analysis package for Fiji⁵⁴ (v.2.13.1), freely available online (http://research.stowers.org/imagejplugins/KISS_analysis.html). A detailed description of chromosome paints, hybridization and analysis procedures was reported previously⁵⁵.

For individually painted chromosomes, z stack images were acquired on the Nikon Ti-E microscope equipped with a 100× objective NA 1.45, Yokogawa CSU-W1 spinning disk and Flash 4.0 sCMOS camera with NIS-Elements AR (v.3.2) software. Image processing was performed in Fiji⁵⁴ (v.2.13.1).

Strand-seq analysis

To assess the karyotype of the CHM1 genome, we prepared strand-seq libraries from CHM1 cells using a previously published protocol^56,57. We sequenced the mono- and dinucleosome fractions separately, with the mononucleosomes sequenced with 75 bp, paired-end Illumina sequencing, and the dinucleosomes sequenced with 150 bp, paired-end Illumina sequencing. We demultiplexed the raw sequencing data based on library-specific barcodes and converted them to FASTQ files using Illumina standard software. We aligned the reads in the FASTQ files to the T2T-CHM13 reference genome⁴ (v.2.0) using BWA⁵⁸ (v.0.7.17-r1188), sorted the alignments using SAMtools⁵⁹ (v.1.9) and marked duplicate reads using sambamba⁶⁰ (v.1.0). We merged the BAM files for the mono- and dinucleosome fractions of each cell using SAMtools⁵⁹ (v.1.9). We used breakpointR (v.1.18)⁶¹ to assess the quality of generated strand-seq libraries with the following parameters: windowsize = 2000000, binMethod = ‘size’, pairedEndReads = TRUE, min.mapq = 10, background = 0.1, minReads = 50. We filtered the libraries based on the read density, level of background reads and level of genome coverage variability⁶². In total, 48 BAM files were selected for all subsequent analysis and are publicly available. We detected changes in strand-state inheritance across all strand-seq libraries using the R package AneuFinder⁶³ with the following parameters: variable.width.reference = <merged BAM of all 48 strand-seq libraries>, binsizes = windowsize, use.bamsignals = FALSE, pairedEndReads = TRUE, remove.duplicate.reads = TRUE, min.mapq = 10, method = ‘edivisive’, strandseq = TRUE, cluster.plots = TRUE, refine.breakpoints = TRUE. We extracted a list of recurrent strand-state changes reported as sister chromatid exchange hotspots by AneuFinder. With this analysis, we identified reciprocal translocations between chromosomes 4q35.1/11q24.3 and 16q23.3/17q25.3 (see below) and established the overall copy number for each chromosome and strand-seq library.

To identify the reciprocal translocation breakpoints between chromosomes 4q35.1/11q24.3 and 16q23.3/17q25.3 in the CHM1 genome, we first aligned CHM1 PacBio HiFi reads to the T2T-CHM13 reference genome⁴ (v.2.0) using pbmm2 (v.1.1.0) and used BEDtools⁶⁴ intersect (v.2.29.0) to define putative translocation regions based on AneuFinder analysis (described above). We extracted PacBio HiFi reads with supplementary alignments using SAMtools⁵⁹ (v.1.9) flag 2048. Using this method, we were able to identify the precise breakpoint of each translocation. Note that, for the reciprocal translocation between chromosomes 4q35.1/11q24.3, we report two breakpoints in each chromosome due to the presence of a ~97–98 kb deletion in the translocated homologues (Supplementary Fig. 3). The breakpoints are located at chromosome 4: 187112496/chromosome 11: 130542388, chromosome 4: 187209555/chromosome 11: 130444240, and chromosome 16: 88757545/chromosome 17: 81572367 (in T2T-CHM13 v.2.0).

Sequence identity across centromeric regions

To calculate the sequence identity across the centromeric regions from CHM1, CHM13 and 56 other diverse human genomes (generated by the HPRC¹⁰ and HGSVC²³), we performed three analyses that take advantage of different alignment methods. In the first analysis, we performed a pairwise sequence alignment between contigs from the CHM1, CHM13 and diverse genomes using minimap2⁵⁰ (v.2.24) and the following command: minimap2 -I 15G -K 8G -t {threads} -ax asm20 –secondary=no –eqx -s 2500 {ref.fasta} {query.fasta}. We chose these minimap2 parameters after testing several options and identifying optimal ones for alignment between repetitive and/or structurally divergent regions in diploid human genomes. Specifically, we chose -I 15G to provide additional memory for aligning between centromeric regions (the default is 4G and sometimes throws an error because of the large number of potential alignments). We also chose -K 8G because it allows for 8 Gb of sequence to be loaded into memory at a time. This is enough for a typical human diploid genome (~6 Gb) to be loaded. If we had left it at the default (500M), only a subset of contigs would be loaded at a time, and once the shortest contigs align, we would be left with only one thread aligning the longest contig. We therefore chose to increase this parameter so that the whole assembly is aligned at one time. We also chose to use -ax asm20 as it allows for sequences that are up to 20% divergent to be aligned. This is more permissive to alternative α-satellite HOR structures and sequence compositions than the other alignment options (for example, asm5 and asm10). We also opted to use –secondary=no to prevent secondary alignments from the same contig, thereby preventing multi-mapping and ensuring that the query would only align once to the reference. We added –eqx to allow us to parse the CIGAR string and calculate the mean sequence identity of the alignments. Finally, we selected -s 2500 as the minimal peak dynamic programming alignment score. The default setting for this parameter is 40, and we tested that one as well as 1000, 2500 and 5000. We found that with -s 40 and -s 1000, spurious alignments occurred from other centromeres, and with -s 5000, accurate alignments from centromeres were filtered out. We therefore chose -s 2500 to allow for diverse α-satellite HOR structures to align without some alignments being filtered out. After generating the alignments, we filtered them using SAMtools⁵⁹ (v.1.9) flag 4, which keeps primary and partial alignments. We subsequently partitioned the alignments into 10 kb non-overlapping windows in the reference genome (either CHM1 or CHM13) and calculated the mean sequence identity between the pairwise alignments in each window with the following formula: (number of matches)/(number of matches + number of mismatches + number of insertion events + number of deletion events). We then averaged the sequence identity across the 10 kb windows within the α-satellite HOR array(s), monomeric/diverged α-satellites, other satellites and non-satellites for each chromosome to determine the mean sequence identity in each region.

In the second analysis, we first fragmented the centromeric contigs from each genome assembly into 10 kb fragments with seqtk (v.1.3; https://github.com/lh3/seqtk) and subsequently aligned them to the reference genome (either CHM1 or CHM13) using minimap2⁵⁰ (v.2.24) and the following command: minimap2 -I 15G -K 8G -t {threads} -ax asm20 –secondary=no –eqx -s 40 {ref.fasta} {query.fasta}. We filtered the alignments using SAMtools⁵⁹ (v.1.9) flag 4, which keeps primary and partial alignments. In this method, multiple 10 kb fragments are allowed to align to the same region in the reference genome, but each 10 kb fragment is only allowed to align once. We then partitioned the alignments into 10 kb non-overlapping windows in the reference genome and calculated the mean sequence identity between all alignments in each window as described above. We averaged the sequence identity across the 10 kb windows within the α-satellite HOR array(s), monomeric/diverged α-satellites, other satellites and non-satellites for each chromosome to determine the mean sequence identity in each region.

In the third analysis, we first identified the location of the α-satellite HOR array(s) in each genome assembly using RepeatMasker⁶⁵ (v.4.1.0) followed by HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) and subsequently extracted regions enriched with ‘live’ α-satellite HORs (denoted with an ‘L’ in the HumAS-HMMER BED file). We then ran TandemAligner⁶⁶ (v.0.1) on pairs of complete centromeric HOR arrays using the following command: tandem_aligner –first {ref.fasta} –second {query.fasta} -o {output_directory}. We parsed the CIGAR string generated by TandemAligner by first binning the alignments into 10 kb non-overlapping windows and calculating the mean sequence identity in each window as described above. As TandemAligner is only optimized for tandem repeat arrays, we assessed the sequence identity only in the α-satellite HOR array(s) of each centromeric region and did not use it to assess the sequence identity in any other region.

Better-match analysis

To determine whether the CHM1 or CHM13 centromeres are a better match to those from the 56 diverse human genomes assembled by the HPRC¹⁰ and HGSVC²³, we performed a pairwise sequence alignment between contigs from the HPRC and HGSVC assemblies to either the CHM1 or CHM1 assembly using minimap2⁵⁰ (v.2.24) and the following command: minimap2 -I 15G -K 8G -t {threads} -ax asm20 –secondary=no –eqx -s 2500 {ref.fasta} {query.fasta}. We filtered the alignments using SAMtools⁵⁹ (v.1.9) flag 4, which keeps primary, secondary and partial alignments, and then calculated an alignment score between each pair of haplotypes, limiting our analysis to only the centromeric α-satellite HOR arrays as follows: (total number of aligned bases in the query)/(total number of bases in the reference) × (mean sequence identity by event). The mean sequence identity by event is calculated as follows: (number of matches)/(number of matches + number of mismatches + number of insertion events + number of deletion events). The set of centromeres with a higher alignment score was determined to be a better match to that haplotype than the other set of centromeres.

Pairwise sequence identity heat maps

To generate pairwise sequence identity heat maps of each centromeric region, we ran StainedGlass⁴⁴ (v.6.7.0) with the following parameters: window=5000 mm_f=30000 mm_s=1000. We normalized the colour scale across the StainedGlass plots by binning the percentage of sequence identities equally and recolouring the data points according to the binning. To generate heat maps that show only the variation between centromeric regions, we ran StainedGlass⁴⁴ (v.6.7.0) with the following parameters: window=5000 mm_f=60000 mm_s=30000. As above, we normalized the colour scale across the StainedGlass plots by binning the percentage of sequence identities equally and recolouring the datapoints according to the binning.

Estimation of α-satellite HOR array length

To estimate the length of the α-satellite HOR arrays of each centromere in the CHM1, CHM13 and 56 diverse genome assemblies^10,23, we first ran RepeatMasker⁶⁵ (v.4.1.0) on the assemblies and identified contigs containing α-satellite repeats, marked by ‘ALR/Alpha’. We extracted these α-satellite-containing contigs and ran HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) on each of them. HumAS-HMMER is a tool that identifies the location of α-satellite HORs in human centromeric sequences. It uses a hidden Markov model (HMM) profile for centromeric α-satellite HOR monomers and generates a BED file with the coordinates of the α-satellite HORs and their classification. Using this BED file, we extracted contigs containing α-satellite HORs that were designated as live or active (denoted with an ‘L’ in the HumAS-HMMER BED file), which are those that belong to an array that consistently associates with the kinetochore in several individuals^5,67. By contrast, dead or inactive α-satellite HORs (denoted with a ‘d’ in the HumAS-HMMER BED file) are those that have not been found to be associated with the kinetochore and are usually more divergent in sequence than the live or active arrays. We filtered out contigs that had incomplete α-satellite HOR arrays (such as those that did not traverse into unique sequence), thereby limiting our analysis to only complete α-satellite HOR arrays. Moreover, we assessed the integrity of each of the α-satellite HOR array-containing contigs using NucFreq²² to ensure that they were completely and accurately assembled, filtering out those with evidence of a deletion, duplication or misjoin. Finally, we calculated the length of the α-satellite HOR arrays in the remaining contigs by taking the minimum and maximum coordinate of the ‘live’ α-satellite HOR arrays and plotting their lengths with GraphPad Prism (v.9.5.1).

Sequence composition and organization of α-satellite HOR arrays

To determine the sequence composition and organization of each α-satellite HOR array in the CHM1, CHM13 and 56 diverse genome assemblies^10,23, we ran HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) on centromeric contigs with the default parameters and parsed the resulting BED file with StV (https://github.com/fedorrik/stv). This generated a BED file with each α-satellite HOR sequence composition and its organization along the α-satellite HOR arrays. We used the stv_row.bed file to visualize the organization of the α-satellite HOR arrays with R⁶⁸ (v.1.1.383) and the ggplot2 package⁶⁶. The α-satellite monomer and HOR classification generated with HumAS-HMMER is described in detail in the supplementary information of a previous study⁵, in which a more complete description of these annotations can be found.

CpG methylation analysis

To determine the CpG methylation status of each CHM1 centromere, we aligned CHM1 ONT reads >30 kb in length to the CHM1 whole-genome assembly using Winnowmap⁵¹ (v.1.0) and then assessed the CpG methylation status of the centromeric regions with Nanopolish⁶⁹ (v.0.13.3). Nanopolish distinguishes 5-methylcytosines from unmethylated cytosines via a HMM on the raw nanopore current signal. The methylation caller generates a log-likelihood value for the ratio of probability of methylated to unmethylated CpGs at a specific k-mer. We filtered methylation calls using the nanopore_methylation_utilities tool⁷⁰ (https://github.com/isaclee/nanopore-methylation-utilities), which uses a log-likelihood ratio of 2.5 as a threshold for calling methylation. CpG sites with log-likelihood ratios greater than 2.5 (methylated) or less than −2.5 (unmethylated) are considered to be high quality and are included in the analysis. Reads that do not have any high-quality CpG sites are filtered from the BAM for subsequent methylation analysis. Nanopore_methylation_utilities integrates methylation information into the BAM file for viewing in IGV’s⁵² bisulfite mode, which was used to visualize CpG methylation. To determine the size of hypomethylated region (termed the CDR³¹) in each centromere, we developed a novel tool, CDR-Finder (https://github.com/arozanski97/CDR-Finder). This tool first bins the assembly into 5 kb windows, computes the median CpG methylation frequency within windows containing α-satellite (as determined by RepeatMasker⁶⁵ (v.4.1.0), selects bins that have a lower CpG methylation frequency than the median frequency in the region, merges consecutive bins into a larger bin, filters for merged bins that are >50 kb and reports the location of these bins.

Native CENP-A ChIP–seq and analysis

To determine the location of centromeric chromatin within the CHM1 genome, we performed two independent replicates of native CENP-A chromatin immunprecipitation–sequencing (ChIP–seq) analysis of CHM1 cells as described previously²¹, with some modifications. In brief, 3–4 × 10⁷ cells were collected and resuspended in 2 ml of ice-cold buffer I (0.32 M sucrose, 15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl₂, 0.1 mM EGTA and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher Scientific, 78429)). Then, 2 ml of ice-cold buffer II (0.32 M sucrose, 15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl₂, 0.1 mM EGTA, 0.1% IGEPAL and 2× Halt Protease Inhibitor Cocktail) was added, and the samples were placed onto ice for 10 min. The resulting 4 ml of nuclei was gently layered on top of 8 ml of ice-cold buffer III (1.2 M sucrose, 60 mM KCl, 15 mM, Tris pH 7.5, 15 mM NaCl, 5 mM MgCl₂, 0.1 mM EGTA and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher Scientific, 78429)) and centrifuged at 10,000g for 20 min at 4 °C. Pelleted nuclei were resuspended in buffer A (0.34 M sucrose, 15 mM HEPES, pH 7.4, 15 mM NaCl, 60 mM KCl, 4 mM MgCl₂ and 2× Halt Protease Inhibitor Cocktail) to 400 ng ml⁻¹. Nuclei were frozen on dry ice and stored at 80 °C. MNase digestion reactions were performed on 200–300 μg chromatin, using 0.2–0.3 U μg⁻¹ MNase (Thermo Fisher Scientific, 88216) in buffer A supplemented with 3 mM CaCl₂ for 10 min at 37 °C. The reaction was quenched with 10 mM EGTA on ice and centrifuged at 500g for 7 min at 4 °C. The chromatin was resuspended in 10 mM EDTA and rotated at 4 °C for 2 h. The mixture was adjusted to 500 mM NaCl, rotated for another 45 min at 4 °C and then centrifuged at maximum speed (21,100g) for 5 min at 4 °C, yielding digested chromatin in the supernatant. Chromatin was diluted to 100 ng ml⁻¹ with buffer B (20 mM Tris, pH 8.0, 5 mM EDTA, 500 mM NaCl and 0.2% Tween-20) and precleared with 100 μl 50% protein G Sepharose bead (Abcam, ab193259) slurry for 20 min at 4 °C with rotation. Precleared supernatant (10–20 μg bulk nucleosomes) was saved for further processing. To the remaining supernatant, 20 μg mouse monoclonal anti-human CENP-A antibody (3-19; Enzo, ADI-KAM-CC006-E; approximately a 1:80 dilution) was added and rotated overnight at 4 °C. Immunocomplexes were recovered by the addition of 200 ml 50% protein G Sepharose bead slurry followed by rotation at 4 °C for 3 h. The beads were washed three times with buffer B and once with buffer B without Tween-20. For the input fraction, an equal volume of input recovery buffer (0.6 M NaCl, 20 mM EDTA, 20 mM Tris, pH 7.5 and 1% SDS) and 1 ml of RNase A (10 mg ml⁻¹) was added, followed by incubation for 1 h at 37 °C. Proteinase K (100 mg ml⁻¹, Roche) was then added, and the samples were incubated for another 3 h at 37 °C. For the ChIP fraction, 300 μl of ChIP recovery buffer (20 mM Tris, pH 7.5, 20 mM EDTA, 0.5% SDS and 500 mg ml⁻¹ proteinase K) was added directly to the beads and incubated for 3–4 h at 56 °C. The resulting proteinase-K-treated samples were subjected to a phenol–chloroform extraction followed by purification using the Qiagen MinElute PCR purification column. Unamplified bulk nucleosomal and ChIP DNA was analysed using an Agilent Bioanalyzer instrument and a 2100 High Sensitivity Kit.

Sequencing libraries were generated using the TruSeq ChIP Library Preparation Kit, Set A (Illumina, IP-202-1012) according to the manufacturer’s instructions, with some modifications. In brief, 5–10 ng bulk nucleosomal or ChIP DNA was end-repaired and A-tailed. Illumina TruSeq adaptors were ligated, libraries were size-selected to exclude polynucleosomes using an E-Gel SizeSelect II agarose gel and the libraries were PCR-amplified using the PCR polymerase and primer cocktail provided in the kit. The resulting libraries were submitted for 150 bp, paired-end Illumina sequencing using the NextSeq 500/550 High Output Kit v2.5 (300 cycles). The resulting reads were assessed for quality using FastQC (https://github.com/s-andrews/FastQC), trimmed with Sickle (v.1.33; https://github.com/najoshi/sickle) to remove low-quality 5′- and 3′-end bases, and trimmed using Cutadapt⁷¹ (v.1.18) to remove adapters.

Processed CENP-A ChIP and bulk nucleosomal reads were aligned to the CHM1 whole-genome assembly using BWA-MEM⁷² (v.0.7.17) with the following parameters: bwa mem -k 50 -c 1000000 {index} {read1.fastq.gz} {read2.fastq.gz}. The resulting SAM files were filtered using SAMtools⁵⁹ (v.1.9) with flag score 2308 to prevent multi-mapping of reads. With this filter, reads mapping to more than one location are randomly assigned a single mapping location, thereby preventing mapping biases in highly identical regions. Alignments were normalized and filtered with deepTools⁷³ (v.3.4.3) bamCompare with the following parameters: bamCompare -b1 {ChIP.bam} -b2 {bulk_nucleosomal.bam} –operation ratio –binSize 1000 –minMappingQuality 1 -o {out.bw}. Alternatively, CENP-A ChIP–seq data alignments were filtered using a marker-assisted mapping strategy as described previously⁵. In brief, unique 51-mers in the CHM1 whole-genome assembly were counted and filtered with meryl⁵³ (v.1.3). The locations of the unique 51-mers were identified with meryl⁵³ (v.1.3) and then used to filter the CENP-A ChIP–seq and input alignments using BEDtools⁶⁴ intersect (v.2.29.0). Alignments were normalized and filtered with deepTools⁷³ (v.3.4.3) bamCompare with the following parameters: bamCompare -b1 {ChIP.bam} -b2 {bulk_nucleosomal.bam} –operation ratio –binSize 1000 -o {out.bw}.

Estimation of the length of the kinetochore sites

To estimate the length of the CHM1 and CHM13 kinetochore sites, we first determined the CpG methylation status of each CHM1 and CHM13 centromere using the approach described above (see the ‘CpG methylation analysis’ section). We then mapped the CENP-A ChIP–seq data from each genome to the same source genome using the mapping parameters described above (see the ‘Native CENP-A ChIP–seq and analysis’ section). We next used CDR-Finder (https://github.com/arozanski97/CDR-Finder) to identify the location of hypomethylated regions within the centromeres, and we filtered the hypomethylated regions that had less than tenfold enrichment of CENP-A ChIP–seq reads relative to the bulk nucleosomal reads. We reported the lengths of the hypomethylated regions enriched with CENP-A as determined with CDR-Finder, and we tested for statistical significance using a two-sided Kolmogorov–Smirnov test with GraphPad Prism (v.9.5.1).

Immuno-FISH on stretched metaphase chromosome spreads

Mechanically stretched metaphase spreads were obtained from the CHM1 cell line according to established procedures⁷⁴. In brief, colcemid-treated cells were washed in phosphate-buffered saline (1× PBS), counted, and resuspended for 15 min in a hypotonic buffer HCM (10 mM HEPES, pH 7.3, 1 mM glycerol, 1 mM CaCl₂ and 0.8 mM MgCl₂) to achieve a final concentration of 10,000 cells per ml. Then, 0.5 ml of the cell suspension was cytocentrifuged onto glass slides at 2,000 rpm for 8 min with a Shandon Cytospin 3 and fixed in methanol at −20 °C for 15 min and in methanol:acetic acid 3:1 at −20 °C for 30 min. The slides were aged overnight at room temperature.

Immunofluorescence was performed on the stretched metaphase chromosome spreads using an in-house rabbit polyclonal CENP-C antibody as previously described with minor modifications⁷⁵. In brief, each slide was rehydrated by immersion in 1× PBS-azide (10 mM NaPO₄, pH 7.4, 0.15 M NaCl, 1 mM EGTA and 0.01% NaN₃) for 15 min at room temperature. Chromosomes were then swollen by washing the slides (three times, 2 min each) with 1× TEEN (1 mM triethanolamine-HCl, pH 8.5, 0.2 mM NaEDTA, and 25 mM NaCl), 0.5% Triton X-100 and 0.1% BSA. The primary polyclonal antibody against the centromeric protein CENP-C was diluted 1:40 in the same solution and then added (100 μl) onto the slides. Each slide was incubated for 2 h at 37 °C. Excess of primary antibody was removed by washing the slides at room temperature (three times, 2, 5 and 3 min each) with 1× KB buffer (10 mM Tris-HCl, pH 7.7, 0.15 M NaCl and 0.1% BSA). A goat anti-rabbit IgG secondary antibody conjugated to FITC (Sigma-Aldrich, F0382) was diluted 1:40 in the same solution, and 100 μl was then added to the slides that were then incubated for 45 min at 37 °C in a dark chamber. After incubation with the secondary antibody, the slides were washed once with 1× KB for 2 min, prefixed with 4% paraformaldehyde in 1× KB for 45 min at room temperature, washed with distilled H₂O by immersion for 10 min at room temperature, and fixed with methanol and acetic acid (3:1) for 15 min. FISH was then performed using two α-satellite-containing plasmids (pZ21A and pGA16) directly labelled by nick-translation with Cy3-dUTP (Enzo, 42501) according to a standard procedure with minor modifications⁷⁶. In brief, 300 ng of labelled probe was used for the FISH experiments; DNA denaturation was performed at 70 °C for 4 min and hybridization at 37 °C in 2× SSC, 50% (v/v) formamide, 10% (w/v) dextran sulphate, 3 μg Cot-1 DNA and 3 mg sonicated salmon sperm DNA, in a volume of 10 μl. Post-hybridization washing was performed under high stringency conditions: at 60 °C in 0.1× SSC (three times, 5 min each). Nuclei and chromosome metaphases were simultaneously DAPI-stained. Digital images were obtained using a Leica DMRXA2 epifluorescence microscope equipped with a cooled CCD camera (Princeton Instruments). DAPI, Cy3 and fluorescein fluorescence signals, detected with specific filters, were recorded separately as grayscale images. Pseudocolouring and merging of images were performed using ImageJ (v.1.53k).

Human and NHP α-satellite SF classification and strand orientation analysis

Human and NHP α-satellite monomers are grouped into 20 distinct SF classes based on shared sequence identity and structure, which is described in detail previously⁵. The SF classes and their monomers are as follows: SF1 (J1 and J2), SF01 (J3, J4, J5 and J6), SF2 (D2, D2, FD), SF02 (D3, D4, D5, D6, D7, D8 and D9), SF3 (W1, W2, W3, W4 and W5), SF4 (Ga), SF5 (R1 and R2), SF6 (Ha), SF7 (Ka), SF8 (Oa and Na), SF9 (Ca), SF10 (Ba), SF11 (Ja), SF12 (Aa), SF13 (Ia), SF14 (La), SF15 (Fa), SF16 (Ea), SF17 (Qa), SF18 (Pa and Ta). To determine the α-satellite SF content and strand orientation of human and NHP centromeres, we ran HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) on centromeric contigs with the following command: hmmer-run_SF.sh {path_to_directory_with_fasta} AS-SFs-hmmer3.0.290621.hmm {number_of_threads}. This generated a BED file with the SF classification and strand orientation of each α-satellite monomer, which we visualized with R⁶⁸ (v.1.1.383) using the ggplot2 package⁶⁶. In cases in which an inversion was detected, we ran StringDecomposer⁷⁷, a tool that detects and reports changes in orientation of tandem repeats, using the default parameters to confirm the presence of reoriented α-satellite monomers at the breakpoints. Finally, we validated the presence of the inversion by aligning native ultra-long ONT reads to the assemblies as described above and confirming even coverage across the breakpoints as well as the presence of inverted α-satellite monomers in the aligned reads.

We uploaded the α-satellite SF and strand orientation tracks generated by HumAS-HMMER for each centromere assembly to the UCSC Human Genome Browser. For the CHM1 centromeres, we uploaded two additional tracks: one showing each α-satellite monomer belonging to known human HORs (ASat-HOR track) and another showing structural variation in human HORs (StV track). All tracks were built and colour-coded as described previously⁵ and are publicly available online (https://genome.ucsc.edu/s/fedorrik/chm1_cen (CHM1); https://genome.ucsc.edu/s/fedorrik/T2T_dev (CHM13); https://genome.ucsc.edu/s/fedorrik/cen_primates (chimpanzee, orangutan, and macaque)). Note that the SF annotation coverage in macaque is sometimes discontinuous (some monomers are not annotated due to significant divergence of macaque dimers from their progenitor Ka class monomers). However, most monomers are identified as Ka, which indicates SF7. In orangutan centromeres, most monomers are identified as R1 and R2, which indicates SF5. In chimpanzee and human autosome and X chromosome centromeres, active arrays are formed by J1 and J2 (SF1), D1, FD and D2 (SF2), and W1–W5 (SF3) monomers. The only exception uncovered in this paper is the centromere of chimpanzee chromosome 5, which appears to be formed by R1 and R2 (SF5), with some monomers identified as J4 and Ga. The former belongs to SF01, which represents the generation of α-satellite intermediate between the progenitor SF5 and the more derived SF1, and J4 is particularly close to the R1 monomer. Moreover, the other SF01 monomers, such as J3, J5 and J6, are absent in the array, which indicates that it is not genuine SF01. Thus, the J4 monomer in chimpanzee centromere 5 should be considered variant R1. Similarly, occasional Ga monomers belong to SF4, which is the direct progenitor of SF5, and Ga is very close to R2. Ga monomers dispersed in the SF5 array are therefore just misclassed R2 monomers. The whole chimpanzee chromosome 5 α-satellite HOR array should therefore be classified as SF5, despite the abovementioned contaminations.

Human and NHP phylogenetic analysis

Humans, chimpanzees, orangutans and macaques diverged over a period of at least 25 million years, with chimpanzees diverging approximately 6 million years ago²⁹, orangutans 12–16 million years ago²⁹ and macaques ~25 million years ago⁷⁸. Despite these divergence times, all primates retain α-satellite repeats, which permit the phylogenetic analysis of these regions and an estimation of their evolutionary trajectory. To assess the phylogenetic relationship between α-satellite repeats in human and NHP genomes, we first masked every non-α-satellite repeat in the CHM1, CHM13, HG00733, chimpanzee, orangutan and macaque centromere assemblies using RepeatMasker⁶⁵ (v.4.1.0). We then subjected the masked assemblies to StringDecomposer⁷⁷ using α-satellite monomers derived from the T2T-CHM13 reference genome⁴ (v.2.0). This tool identifies the location of α-satellite monomers in the assemblies, and we used this to extract the α-satellite monomers from the HOR/dimeric array and monomeric regions into multi-FASTA files. We randomly selected 100 and 50 α-satellite monomers from the HOR/dimeric array and monomeric regions, respectively, and aligned them with MAFFT^79,80 (v.7.453). We used IQ-TREE⁸¹ (v.2.1.2) to reconstruct the maximum-likelihood phylogeny with model selection and 1,000 bootstraps. The resulting tree file was visualized in iTOL⁸².

To estimate sequence divergence along the pericentromeric regions, we first mapped each NHP centromere assembly to the CHM13 centromere assembly using minimap2⁵⁰ (v.2.17-r941) with the following parameters: -ax asm20 –eqx -Y -t 8 -r 500000. We then generated a BED file of 10 kb windows located within the CHM13 centromere assembly. We used the BED file to subset the BAM file, which was subsequently converted into a set of FASTA files. FASTA files contained at least 5 kb of sequence from one or more NHP centromere assemblies mapping to orthologous chromosomes. Pairs of human and NHP sequences were realigned using MAFFT^79,80 (v.7.453) with the following command: mafft –maxiterate 1000 –localpair. Next, we calculated the SNV density and Ti/Tv ratios from these alignments, limiting our analysis to only those regions with one-to-one unambiguous mapping and excluding segmental duplications and satellite repeats (Supplementary Table 10). As a control, we also calculated the SNV density and Ti/Tv ratios from 500 uniquely mapping regions across the genomes (Supplementary Table 11). We estimated the sequence divergence using the Tamura-Nei substitution model⁸³, which accounts for recurrent mutations and differences between transversions and transitions as well as within transitions. The mutation rate per segment was estimated using Kimura’s model of neutral evolution⁸⁴. In brief, we modelled the estimated divergence (D) as a result of between-species substitutions and within-species polymorphisms, that is:

$$D=2\mu t+4{N}_{{\rm{e}}}\,\mu $$

where N_e is the ancestral human effective population size, t is the divergence time for a given human–NHP pair and μ is the mutation rate. We assumed a generation time of [20, 29] years and the following divergence times: human–macaque = [23 × 10⁶, 25 × 10⁶] years, human–orangutan = [12 × 10⁶, 14 × 10⁶] years, human–chimpanzee = [4 × 10⁶, 6 × 10⁶] years. To convert the genetic unit to a physical unit, our computation also assumes N_e = 10,000 and uniformly drawn values for the generation and divergence times.

Human-specific phylogenetic analysis

To determine the phylogenetic relationship and divergence times between centromeric regions from chromosomes 5, 7 and 10–14 in the CHM1, CHM13 and 56 other diverse human genomes (sequenced and assembled by the HPRC¹⁰ and HGSVC²³), we first identified contigs with complete and accurately assembled centromeric α-satellite HOR arrays, as determined by RepeatMasker⁶⁵ (v.4.1.0) and NucFreq²² analysis. We then aligned each of these contigs to the T2T-CHM13 reference genome⁴ (v.2.0) using minimap2⁵⁰ (v.2.24). We also aligned the chimpanzee whole-genome assembly to the T2T-CHM13 reference genome⁴ (v.2.0) to serve as an outgroup in our analysis. We identified 20 kb regions in the flanking monomeric α-satellite or unique regions on the p- or q-arms and ensured that the region we had selected had only a single alignment from each haplotype to the reference genome. We next aligned these regions to each other using MAFFT^79,80 (v.7.453) with the following command: mafft –auto –thread {num_of_threads} {multi-fasta.fasta}. We used IQ-TREE⁸¹ (v.2.1.2) to reconstruct the maximum-likelihood phylogeny with model selection and 1,000 bootstraps. The resulting tree file was visualized in iTOL⁸². Timing estimates were calculated by applying a molecular clock based on the branch-length distance to individual nodes and assuming a divergence time between human and chimpanzee of 6 million years ago. Clusters of α-satellite HOR arrays with a single monophyletic origin were assessed for gains and losses of α-satellite base pairs, monomers, HORs and distinct structural changes manually.

Polymorphic TE analysis

To detect polymorphic TEs between the CHM1 and CHM13 centromeric regions, we first ran RepeatMasker⁶⁵ (v.4.1.0) on the CHM1 and CHM13 centromeric regions. We then masked all satellite repeats within these regions using BEDtools⁶⁴ maskfasta (v.2.29.0). We aligned the masked CHM1 fasta to the masked CHM13 fasta using minimap2⁵⁰ and the following command: minimap2 -t {threads} –eqx -c -x asm20 –secondary=no {ref.fasta} {query.fasta}. Using the resulting PAF, we extracted the regions with structural variants that were >50 bp long. We next intersected these regions with the RepeatMasker annotation file to identify those variants that overlapped SINE, LINE or LTR repeat classes by >75%. We considered the following LINE and SINE subgroups: LINE/CR1, LINE/L1, LINE/L1-Tx1, LINE/L2, LINE/Penelope, LINE/RTE-BovB, LINE/RTE-X, SINE/5S-Deu-L2, SINE/Alu, SINE/MIR, SINE/tRNA, SINE/tRNA-Deu, SINE/tRNA-RTE. We then determined the variation in length of these regions between the two centromeric regions, and we plotted their position and length using R⁶⁸ (v.1.1.383) and the ggplot2 package⁶⁶.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Category: Blog

Adult Puzzle Books