INTRODUCTION: Molecular data is one of the most important bases for many biological studies, including phylogeny, ecology, and biogeography etc. Incomplete sampling may lead to biased results and inadequate conclusions. However, few studies have evaluated current state of sampling density for sequencing DNA data comprehensively. Plastid DNA sequences have been applied in scientific studies of plants extensively due to their easy accessibility, uniparental inheritance, and moderate rate of mutation. Therefore, it is essential to investigate the current state of sampling density for sequencing plastid DNA data in species and geographic area for researchers to better utilize it.
RATIONALE: The GenBank is the biggest and most commonly used database of sequencing DNA data. The data gap of plastid DNA in species and geographic area for vascular plants was investigated based on the GenBank database in this study. Firstly, the plastid DNA data of vascular plant species were downloaded from the GenBank database and cleaned. Secondly, species names were standardized according to the World Checklist of Vascular Plants (WCVP) database. Thirdly, to evaluate the current state of sampling density for plastid DNA data of vascular plants, we counted the number of species with plastid DNA sequenced and the proportion of missing data of lineages representing orders and families. We also mapped the proportion of missing data in each region to evaluate the current state of sampling density of plastid DNA data geographically. To further investigate the potential influencing factors of the plastid DNA data gap, Spearman’s correlations between the proportion of missing data and species diversity among major groups of vascular plants or regions were calculated.
RESULTS: Only 33.75% vascular plant species have at least one record of DNA in GenBank, covering 139 005 vascular plant species (angiosperms: 131 220 species, gymnosperms: 1 154 species, and pteridophytes: 6 631 species). For data gap in species, sequenced species were unevenly sampled among lineages, with the proportion of missing data generally correlated with species richness within the lineages. The top three orders of the highest proportion of missing data were Paracryphiales, Piperales, and Dilleniales, and the top three families were Triuridaceae, Pentaphragmataceae, and Xyridaceae. For data gap in geographic area, the proportion of missing data of plastid DNA of vascular plant species showed a trend of latitudinal gradient, with the degree of missing data decreasing from the equator to the poles. Regions with high proportion of missing data usually possess high biodiversity, including many biodiversity hotspots. In addition, endemic species were generally with the high proportion of missing data in the majority of regions.
CONCLUSION:Our research evaluated the current state of sampling density for plastid DNA data in species and geographic area comprehensively. Our results suggested that about 140 000 vascular plant species have been sequenced for the plastid DNAs. However, there are still large data gaps for the plastid DNA of vascular plants in the following three aspects: (1) Only 1/3 vascular plant species have been sequenced; (2) Ratios of species with plastid DNA sequenced are uneven among lineages; (3) The proportion of missing data decreases from the equator to the poles, with more deficiencies in biodiversity hotspots and endemic species. Based on the results of this study, we propose to give priority to collection and sequencing of vascular plants for groups with high proportion of missing data and regions with high biodiversity, particularly for the endemic species. Our research points out the direction of filling plastid DNA data gap and will be beneficial to biodiversity protection.