A robust phylogenomic timetree for biotechnologically and 1 medically important fungi from Aspergillaceae ( Eurotiomycetes , 2 Ascomycota ) 3 4

Abbreviations: NT, nucleotide; AA, amino acid; CI, credible interval; RCV, relative composition variability; IC, internode certainty; GSF, gene support frequencies; GLS, gene-wise log-likelihood scores; DVMC, degree of violation of a molecular clock The filamentous fungal family Aspergillaceae contains > 1,000 known species, mostly in the genera Aspergillus and Penicillium. Several species are used in the food, biotechnology, and drug industries (e.g., Aspergillus oryzae, Penicillium camemberti), while others are dangerous human and plant pathogens (e.g., Aspergillus fumigatus, Penicillium digitatum). To infer a robust phylogeny and pinpoint poorly resolved branches and their likely underlying contributors, we used 81 genomes spanning the diversity of Aspergillus and Penicillium to construct a 1,668-gene data matrix. Phylogenies of the nucleotide and amino acid versions of this full data matrix as well as of five additional 834-gene data matrices constructed by subsampling the top 50% of genes according to different criteria associated with strong phylogenetic signal were generated using three different maximum likelihood schemes (i.e., gene-partitioned, unpartitioned, and coalescence). Examination of the topological agreement among these 36 phylogenies and measures of internode certainty identified 12 / 78 (15.4%) bipartitions that were incongruent and pinpoint the likely underlying contributing factors (incomplete lineage sorting, hybridization or introgression, and reconstruction artifacts associated with poor taxon sampling). Relaxed molecular clock analyses suggest that Aspergillaceae likely originated in the lower Cretaceous and the Aspergillus and Penicillium genera in the upper Cretaceous. Our results shed light on the ongoing debate on Aspergillus systematics and taxonomy and provide a robust evolutionary and temporal framework for comparative genomic analyses in Aspergillaceae. More broadly, our approach provides a general template for phylogenomic identification of resolved and contentious branches in densely genome-sequenced lineages across the tree of life.

the Aspergillus and Penicillium genera dating back to 84.3 mya (95% CI: 90.9 -77.6) and 77.4 48 mya (95% CI: 94.0 -61.0), respectively. Our results provide a robust evolutionary and temporal 49 framework for comparative genomic analyses in Aspergillaceae, while our general approach 50 provides a widely applicable template for phylogenomic identification of resolved and 51 contentious branches in densely genome-sequenced lineages across the tree of life. 52 7 introgression (or very high levels of incomplete lineage sorting), and (iii) 7 deeper bipartitions 122 with varying levels of incongruence likely driven by reconstruction artifacts likely linked with 123 poor taxon sampling. We also estimated divergence times across Aspergillaceae using relaxed 124 molecular clock analyses. Our results suggest Aspergillaceae originated in the lower Cretaceous, 125 125.1 (95% Confidence Interval (CI): 146.7 -102.1) million years ago (mya), and that 126 Aspergillus and Penicillium originated 84.3 mya (95% CI: 90.9 -77.6) and 77.4 mya (95% CI: 127 94.0 -61.0), respectively. We believe this phylogeny and timetree provides a state-of-the-art 128 platform for comparative genomic, ecological, and chemodiversity studies in this ecologically 129 diverse and biotechnologically and medically significant family of filamentous fungi. 130 Data collection and quality assessment 152 To collect a comprehensive set of genomes representative of Aspergillaceae, we used 153 'Aspergillaceae' as a search term in NCBI's Taxonomy Browser and downloaded a 154 representative genome from every species that had a sequenced genome as of February 5 th 2018. 155 We next confirmed that each species belonged to Aspergillaceae according to previous literature 156 reports 31,50 . Altogether, 80 publicly available genomes and 1 newly sequenced genome spanning 157 5 genera (45 Aspergillus species; 33 Penicillium species; one Xeromyces species; one Monascus 158 species; and one Penicilliopsis species) from the family Aspergillaceae were collected (File S1). 159 We also retrieved an additional 12 fungal genomes from representative species in the order 160 Eurotiales but outside the family Aspergillaceae to use as outgroups. 161

162
To determine if the genomes contained gene sets of sufficient quality for use in phylogenomic 163 analyses, we examined their gene set completeness using Benchmarking Universal Single-Copy 164 Orthologs (BUSCO), version 2.0.1 51 ( Figure S2). In brief, BUSCO uses a consensus sequence 165 built from hidden Markov models derived from 50 different fungal species using HMMER, 166 version 3.1b2 52 as a query in TBLASTN 53, 54 to search an individual genome for 3,156 predefined 167 orthologs (referred to as BUSCO genes) from the Pezizomycotina database (creation date: 02-13-168 2016) available from ORTHODB, version 9 55 . To determine the copy number and completeness 169 of each BUSCO gene in a genome, gene structure is predicted using AUGUSTUS, version 170 2.5.5 56 , from the nucleotide coordinates of putative genes identified using BLAST and then 171 aligned to the HMM alignment of the same BUSCO gene. Genes are considered "single copy" if shorter than 95% of the aligned sequence lengths from the 50 different fungal species, and 175 "missing" if there is no predicted gene. 176 177 Phylogenomic data matrix construction 178 In addition to their utility as a measure of genome completeness, BUSCO genes have also proven 179 to be useful markers for phylogenomic inference 57 , and have been successfully used in 180 phylogenomic studies of clades spanning the tree of life, such as birds 58 , insects 59 , and budding 181 yeasts 38 . To infer evolutionary relationships, we constructed nucleotide (NT) and amino acid 182 (AA) versions of a data matrix comprised of the aligned and trimmed sequences of numerous 183 We used IQ-TREE for downstream analysis because a recent study using diverse empirical 244 phylogenomic data matrices showed that it is a top-performing software 79 as well as because IQ-245 TREE's gene partitioning scheme can account for different models of rate heterogeneity per 246 gene 80 . 247

248
To determine the phylogeny of Aspergillaceae using a partitioned scheme where each gene has 249 its own model of sequence substitution and rate heterogeneity parameters, we created an 250 additional input file describing these and gene boundary parameters. More specifically, we 251 created a nexus-style partition file that was used as input with the "-spp" parameter 80 . To 252 increase the number of candidate trees used during maximum likelihood search, we set the "-253 nbest" parameter to 10. Lastly, we conducted 5 independent searches for the maximum 254 likelihood topology using 5 distinct seeds specified with the "-seed" parameter and chose the 255 search with the best log-likelihood score. We used the phylogeny inferred using a partitioned 256 scheme on the full NT data matrix as the reference one for all subsequent comparisons ( Figure  257 1). 258

259
To determine the phylogeny of Aspergillaceae using a non-partitioned scheme, we used all the 260 same parameters as above; the only difference was that we used a single model of sequence 261 substitution and rate heterogeneity parameters across the entire matrix. The most appropriate 262 single model was determined by counting which best fitting model was most commonly 263 observed across single gene trees. The most commonly observed model was 264 respectively, ( Figure S5). In each analysis, the chosen model was specified using the "-m" 267 parameter. 268

269
To determine the phylogeny of Aspergillaceae using coalescence, a method that estimates 270 species phylogeny from single gene trees under the multi-species coalescent 67 , we combined all 271 NEWICK 82,83 formatted single gene trees inferred using their best fitting models into a single file. 272 The resulting file was used as input to ASTRAL-II, version 4.10.12 68 with default parameters. 273

274
To evaluate support for single gene trees and for the reference phylogeny (Figure 1) To implement UFBoot for the NT 1,668-gene data matrix and single  289   gene trees, we used the "-bb" option in IQ-TREE with 5,000 and 2,000 ultrafast bootstrap  290 replicates, respectively. 291 292 Evaluating topological support 293 To identify and quantify incongruence, we used two approaches. In the first approach, we 294 compared the 36 topologies inferred from the full 1,668-gene NT and AA data matrices and five 295 additional 834-gene data matrices (constructed by selecting the genes that have the highest 296 scores in five measures previously shown to be associated with strong phylogenetic signal; see 297 above) using three different maximum likelihood schemes (i.e., gene partitioned, non-298 partitioned, coalescence) and identified all incongruent bipartitions between the reference 299 phylogeny ( Figure 1) and the other 35. In the second approach, we scrutinized each bipartition in 300 the reference phylogeny using measures of internode certainty (IC) measures for complete and 301 partial single gene trees 29,44,45 . To better understand single gene support among conflicting 302 bipartitions, we calculated gene-wise log-likelihood scores (GLS) 42 and gene support frequencies 303 (GSF) for the reference and alternative topologies at conflicting bipartitions. 304

305
Identifying internodes with conflict across subsampled data matrices 306 To identify incongruent bipartitions between the reference phylogeny and the other 35 307 phylogenies, we first included the 36 generated phylogenetic trees into a single file. We next 308 evaluated the support of all bipartitions in the reference topology among the other 35 309 phylogenies using the "-z" option in RAXML. Any bipartition in the reference phylogeny that 310 was not present in the rest was considered incongruent; each conflicting bipartition was 311 identified through manual examination of the conflicting phylogenies. To determine if sequence 312 type, subsampling method, or maximum likelihood scheme was contributing to differences in 313 observed topologies among conflicting internodes, we conducted multiple correspondence 314 analysis of these features among the 36 phylogenies and visualized results using the R, version indicate that there is near-equal support for an alternative, conflicting bipartition among a set of 322 trees compared to a given bipartition present in the reference topology, which is indicative of 323 high conflict. Therefore, we investigated incongruence in all internodes in the reference 324 phylogeny ( Figure 1) that exhibited IC scores lower than 0.1. To calculate IC values for each 325 bipartition for the reference phylogeny, we created a file with all 1,668 complete and partial 326 single gene trees. The resulting file of gene trees, specified with the "-z" parameter in RAXML, 327 were used to calculate IC values using the "-f i" argument. The topology was specified with the 328 "-t" parameter. Lastly, we used the Lossless corrected IC scoring scheme, which corrects for 329 variation in taxon number across single gene trees 44 . We also used these IC values to inform 330 which data type (NT or AA) provided the strongest signal for the given set of taxa and 331 sequences. We observed that NTs consistently exhibited higher IC scores than AAs (hence our 332 decision to use the topology inferred from the full NT data matrix using a gene-partitioned 333 scheme -shown in Figure 1 -as the 'reference' topology in all downstream analyses).

Examining gene-wise log-likelihood scores for incongruent internodes 336
To determine the per gene distribution of phylogenetic signal supporting a bipartition in the 337 reference phylogeny or a conflicting bipartition, we calculated gene-wise log-likelihood scores 338 (GLS) 42 using the NT data matrix. We chose to calculate GLS using the NT data matrix because 339 distributions of IC values from phylogenies inferred using NTs had consistently higher IC values 340 across schemes and data matrices ( Figure S6). To do so, we used functions available in IQ-341 TREE. More specifically, we inputted a phylogeny with the reference or alternative topology 342 using the "-te" parameter and informed IQ-TREE of gene boundaries, their corresponding 343 models, and optimal rate heterogeneity parameters in the full 1,668-gene data matrix using the "-344 spp" parameter. Lastly, we specified that partition log-likelihoods be outputted using the "-wpl" 345 parameter. To determine if a gene provided greater support for the reference or alternative 346 bipartition, we calculated the difference in GLS (ΔGLS) using the following formula: at shallow depths in the phylogeny (i.e., among closely related species), we required all taxa to 369 be present in a single gene tree; for conflicting bipartitions near the base of the phylogeny (i.e., 370 typically involving multiple sections), we required at least one species to be present from each 371 section of interest (with the exception of Exilicaulis because this section is not monophyletic). 372 Scripts to determine GSF were written using functions provided in NEWICK UTILITIES, version 373 Currently, large phylogenomic data matrices that contain hundreds to thousands of genes and 385 many dozens of taxa are intractable for Bayesian inference of divergence times; thus, we 386 identified and used only those genes that appear to have evolved in a "clock-like" manner in the 387 inference of divergence times. To identify genes evolving in a "clock-like" manner, we 388 calculated the degree of violation of a molecular clock (DVMC) 98 for single gene trees. DVMC 389 is the standard deviation of root to tip distances in a phylogeny and is calculated using the 390 following formula: 391 where ti represents the distance between the root and species i across n species. Using this 393 method, genes with low DVMC values evolve in a "clock-like" manner compared to those with 394 higher values. We took the top scoring 834 (50%) genes and bootstrap subsampled 250 genes 395 without replacement. We decided to use 250 genes because a previous study with a similar 396 number of taxa used a similar number of genes 98 . 397

(ii) Estimating substitution rate 399
To estimate the substitution rate across the 250 genes, we used BASEML from the PAML package, 400 version 4.9d 96 . We estimated substitution rate using a "GTR+G" model of substitutions (model = 401 7) and a strict clock model (clock = 1). Additionally, we point calibrated the root of the tree to 96 million years ago (mya) according to TIMETREE 99 , which is based on previous estimations 100  To estimate divergence times, we used the resulting gradient and Hessian results from the 418 previous step for use in MCMC analysis using MCMCTREE 96 . To do so, a gamma distribution 419 prior shape and scale must be specified. The gamma distribution shape and scale is determined 420 from the substitution rate determined in step ii where shape is a=(s/s) 2 and scale is b=s/s 2 and s 421 is the substitution rate. Therefore, a=1 and b=25 and the "rgene_gamma" parameter was set to 422 "1 25." We also set the "sigma2_gamma" parameter to "1 4.5." To minimize the effect of initial 423 values on the posterior inference, we discarded the first 100,000 results. Thereafter, we sampled 424 every 500 iterations until 10,000 samples were gathered. Altogether, we ran 5.1 million iterations MCMC analysis 108 . Lastly, we set the "finetune" parameter to 1. 427

428
To determine the stability of inferred divergence time estimates, we constructed two additional 429 matrices of 250 genes, repeated the analyses in steps ii-iv, and compared results. The two 430 additional data matrices were constructed by independently bootstrap subsampling 250 genes 431 without replacement from the 834 genes with the best DVMC values. We subsequently repeated 432 steps ii-iv and conducted correlation analyses between the three sets of 250 genes to determine 433 the stability of inferred divergence times. Examination of the gene content differences between the 5 NT subsampled data matrices as well 467 as between the 5 AA data matrices revealed that they are composed of variable sets of genes 468 genes that were shared between all NT matrices except the completeness-based one; similarly, 470 the largest intersection among AA data matrices was 228 genes and was shared between all AA 471 matrices except the completeness-based one ( Figure S8a, b). Examination of the number of gene 472 overlap between the NT and AA data matrices for each criterion ( Figure S8c) showed that three 473 criteria yielded identical or nearly identical NT and AA gene sets. These were completeness (834 474 / 834; 100% shared genes; rs = 1.00, p < 0.01; Figure S7c phylogenies showed that they were very often associated with data type (NT or AA) and scheme 521 employed (concatenation or coalescence). For example, the first instance of incongruence 522 concerns the identity of the sister species to Penicillium biforme (I60; Figure 1 and 3a); this 523 species is P. camemberti in the reference phylogeny but analyses of the full and two subsampled 524 AA data matrices with coalescence recover instead Penicillium fuscoglaucum. The data type and 525 analytical scheme employed also appear to underlie the second and third instances of 526 incongruence, which concern the polyphyly of section Exilicaulis (I74 and I78; Figures 1 and  527 3b), the fourth and fifth instances, which concern relationships among Aspergillus sections (I24 528 and I35; Figures 1 and 3c), as well as the sixth instance, which concerns relationships among the 529 sections Digitata, Chrysogena, and Roquefortorum (I63; Figure 1 and 3d). The seventh instance 530 is also associated with data type, but not with the scheme employed; while the reference as well 531 as most subsampled NT matrices support the Aspergillus persii and Aspergillus sclerotiorum 532 clade as sister to Aspergillus westerdijkiae (I33; Figure 1 and 3e), most AA data matrices recover analysis of one AA subsampled data matrix with coalescence instead recovered Aspergillus 537 luchuensis as the sister group. 538 539 For each of these bipartitions (Figure 3), we examined clustering patterns using multiple 540 correspondence analysis of matrix features (i.e., sequence type and subsampling method) and 541 analysis scheme among trees that support the reference and alternative topologies ( Figure S11). 542 Distinct clustering patterns were observed for I74, I78, and I33 (Figure 3 and S11). For I74 and 543 I78, there are three alternative, conflicting topologies, with the first two clustering separately 544 from the third (Figure 3b and S11b). For I33, phylogenies that support the reference and 545 alternative topologies formed distinct clusters (Figure 3e). Examination of the contribution of 546 variables along the second dimension, which is the one that differentiated variables that 547 supported each topology, revealed that the distinct clustering patterns were driven by sequence 548 type ( Figure S11g and h). internodes I74 and I78, which concern relationships among the sections Lanata-divaricata, 608 Exilicaulis and Citrina, were observed in 26 / 36 (72.2%) phylogenies; the remaining 10 / 36 609 (27.8%) phylogenies recovered three alternative, conflicting bipartitions. Both reference 610 bipartitions had IC scores of 0.01, and GSFNT and GSFAA scores of 0.11 and 0.07, respectively. 611 The reference bipartitions for internodes I24 and I35, which concern the placement of 612 Chrysogena, which includes the antibiotic penicillin producing species P. chrysogenum, 646 originated 6.4 (95% CI: 11.5 -3.2) mya. Additionally, section Citrina, which contains P. 647 citrinum, which the first statin was isolated from and is commonly associated with moldy citrus

651
Our analyses provide a robust evaluation of the evolutionary relationships and diversification 652 among Aspergillaceae, a family of biotechnologically and medically significant fungi. We 653 scrutinized our proposed reference phylogeny (Figure 1) against 35 other phylogenies recovered 654 using all possible combinations of six multi-gene data matrices (full or subsamples thereof), 655 three maximum likelihood schemes, and two sequence types and complemented this analysis 656 with bi-partitioned based measures of support (Figures 1 and 2). Through these analyses, we 657 found that 12 / 78 (15.4%) bipartitions were incongruent (Figure 3 and 4) (Figure 1, Figure S14). The robust resolution of our phylogeny is likely 665 due to the very large size of our data matrix, both in terms of genes as well as in terms of 666 sequence. For example, the placement of Aspergillus section Nigri has been unstable in previous 667 phylogenomic analyses ( Figure S1) 31,33,34 , but our denser sampling of taxa in this section as well 668 as inclusion of representative taxa from sections Nidulantes, Versicolores, Usti, and 669 incongruence (Figures 3 and 4). In general, gene tree incongruence can stem from biological or 674 analytical factors 30,42 . Biological processes such as incomplete lineage-sorting (ILS) 121  will often artifactually group with other long branches 133 . 693 only group the 12 incongruent internodes with respect to their patterns of conflict but also to 696 postulate putative drivers of the observed incongruence. For example, both I15 and I60 are 697 shallow internodes exhibiting low levels of incongruence, suggesting that one likely driver of the 698 observed incongruence is ILS. In contrast, the shallow internodes I33, I43, and I55 exhibit much 699 higher levels of incongruence that are most likely to be the end result of processes, such as 700 hybridization or repeated introgression. Finally, the remaining seven incongruent internodes (I3, 701 I4, I24, I35, I63, I74, and I78) exhibit varying levels of incongruence and are typically associated 702 with single taxon long branches (Figures 1, 3, and 4), implicating taxon sampling as a likely 703 driver of the observed incongruence. Given that inclusion of additional taxa robustly resolved the 704 previously ambiguous placement of the long-branched Aspergillus section Nigri (see discussion 705 above), we predict that additional sampling of taxa that break up the long branches associated 706 with these seven internodes will lead to their robust resolution. Internode numbers refer to internodes that have at least one conflicting topology among the 36 773 phylogenetic trees inferred from the full and five subsampled data matrices across three different 774 schemes and two data types. The internode recovered from the analysis of the 1,668-gene 775 nucleotide matrix (Figure 1) is shown on the left and the conflicting internode(s) on the right. 776 Next to each of the internodes, the nucleotide (nt) and amino acid (aa) gene support frequency 777 (GSF) values are shown. On the far right, the sequence type, scheme, and data matrix 778 characteristics of the phylogenies that supports the conflicting internodes are shown. Nt and aa 779 sequence types are represented using black and white squares, respectively; partitioned 780 concatenation, non-partitioned concatenation, and coalescence analytical schemes are depicted as 781 black, grey, or white circles, respectively; and the matrix subset is written next to the symbols.