Molecular
data are the basis for many biological studies in the big data era.
Understanding the current state of sequencing data is beneficial for
researchers to better utilize the data. Plastid DNA sequences have been
extensively applied in scientific studies of plants due to their easy
accessibility, uniparental inheritance, and moderate rate of mutation. In this
study, current situation of sequenced plastid DNA data of the vascular plants
in the world were evaluated based on the GenBank database. The results showed
that the proportion of sequenced species was low, with only 33.75% vascular
plants having plastid DNA data. Sequenced species were unevenly sampled among
lineages. The ratios of missing data are generally correlated with species
richness within the lineages. The top three orders of the highest missing data
ratio were Paracryphiales, Piperales, and Dilleniales, and the top three
families were Triuridaceae, Pentaphragmataceae, and Xyridaceae. In the geographic
space, the missing data ratio of plastid DNA of vascular plants showed a trend
of latitudinal gradient, with the degree of missing data decreasing from the
equator to the poles. Regions with high missing data ratio of plastid DNA
usually possess high biodiversity, including many biodiversity hotspots. In
addition, endemic species were generally with the high proportion of missing
data in the majority of regions. Based on the results of this study, we suggest
that priority should be given to data collection for groups with high missing
data ratio and regions with high biodiversity, particularly for endemic
species, to improve the sampling of genetic data of these species and regions.