A new electron density-derived base-pair descriptor scheme is combined with machine learning methodology to uncover the functional relationships between DNA/RNA structure at the genetic level and activity at the physiological level. To this end, electronic property distributions of nucleic acid base-pair sequences are represented through Transferable Atom Equivalent Reconstruction (TAE/RECON) using a base-pair descriptor library designed to provide accurate representations of the electronic environment around base-pairs in DNA and RNA sequences. Using four-letter code descriptors, current bioinformaticstechniques are beginning to be able to identify motifs involved in transcription factor binding to DNA, as well as regions important to promoter functions. With the addition of these new, rapidly accessible descriptors that can provide features directly related to the physics of interaction between base-pairs and DNA-binding proteins, a new level of information mining is now available for genomic data.
TAE/RECON descriptors provide over 150 channels of electron density-based property information per base-pair, and take into account its DNA environment. In the present work, base-pair contributions are combined to provide spatially-resolved electronic information about DNA sequences and provide a more accurate - and chemically-relevant - way of representing genomic data. The descriptor patterns for any DNA sequence are thus represented in the form of"fuzzy bar codes" comprised of position-specific DNA pixels ("dixels"). The integration of information from "dixel" maps with data from DNA/protein co-crystal structures can provide insight into the chemistry of DNA-protein interactions relevant to the regulation of gene expression.