Sign upSign inSign upSign in–ListenShareTranscription factors (TFs) play a crucial role in gene regulation by binding to specific DNA sequences known as transcription factor binding sites (TFBS). Accurately identifying these sites is essential for understanding gene expression and its impact on cellular functions. Traditional computational approaches, such as position weight matrices (PWMs) and support vector machines (SVMs), have been widely used for TFBS prediction.In this article, I explore an alternative method that combines wavelet transforms with deep learning to classify TFBS. By transforming DNA sequences into visual representations, we can leverage powerful image classification models like ResNet50 to enhance prediction accuracy and gain deeper insights into TF-DNA interactions.The code to replicate the analysis can be found at:github.comTranscription factors (TFs) are proteins that play a crucial role in regulating gene expression. They do this by binding to specific DNA sequences known as transcription factor binding sites (TFBS). These binding sites are typically short, conserved sequences (called motifs) located in the promoter or enhancer regions of genes. By binding to these sites, transcription factors can either activate or repress the transcription of associated genes, thereby controlling the production of proteins essential for various cellular processes.Understanding the locations and mechanisms of transcription factor binding to DNA is essential for decoding the complex regulatory networks that control cellular functions. Furthermore, accurately identifying potential transcription factor binding sites (TFBS) can contribute to the development of personalized treatments. For instance, consider a scenario in which a genetic variant in an oncogene alters the neighboring DNA sequence, creating a novel TFBS that could potentially initiate oncogenesis. However, experimentally identifying these binding sites is both challenging and time-consuming. Computational methods provide a powerful alternative, enabling faster and more scalable predictions of TFBS, thereby facilitating more efficient biological analysis and medical advancements.Typically, PWM-based models like MEME and STREME, as well as SVM-based models, are used to address this problem. PWM models rely on statistical methods, while SVM models utilize machine learning to identify motifs within positive sequences (i.e., sequences containing TFBS). In this article, I explored an alternative approach by applying wavelet transforms and deep learning to replicate these results for the CTCF transcription factor.One of the most extensively studied transcription factors is the CCCTC-binding factor (CTCF), a highly conserved protein essential for chromatin architecture organization. CTCF primarily functions as an insulator, preventing interactions between enhancers and promoters to ensure proper gene expression. However, in some cases, it also facilitates the formation of chromatin loops, bringing distant regulatory elements into proximity with their target genes and enhancing gene expression.Since CTCF is a key player in gene regulation, it’s a go-to for TFBS prediction studies — so I figured it was the perfect choice for this experiment!Wavelet transforms are mathematical tools that decompose signals into different frequency components, allowing for the analysis of localized variations within the signal. Unlike Fourier transforms, which provide only frequency information, wavelet transforms offer both frequency and temporal information, making them particularly useful for analyzing non-stationary signals.By overlapping a mother wavelet M scaled by a factor σ and shifted by a factor Δ onto a given signal S, we compute the inner product between M(σ, Δ) and S. This operation is then repeated for a range of shifts A* and scales B*.For more details on wavelet transforms, I suggest you to read the following article.medium.comThe wavelet transform generates a set of coefficients for a given signal, which can be structured in a matrix-like form. This reveals a powerful feature of wavelet transforms: by computing the modulus of the complex numbers in the matrix, we obtain an intensity matrix. By mapping these intensities to colors, we can create a colormap — essentially converting a signal into an image! This transformation enables us to leverage the vast array of pre-trained image classification models, such as ResNets, to classify signals based on their colormaps.However, since we’re working with DNA sequences, we first need to determine how to convert DNA strings into signals. There are many ways to convert DNA sequences into signals, all of which fall under the field of Genomic Signal Processing. For this experiment, I decided to go with the easiest possible mapping: each nucleotide (A, C, G, T) corresponds to a specific integer (1, 2, 3, 4).The full pipeline can be visualized in the image below:The ResNet50 model, short for Residual Network with 50 layers, is a deep convolutional neural network (CNN) architecture renowned for its effectiveness in image classification tasks. Introduced by Microsoft Research in 2015, ResNet50 addresses the challenge of vanishing gradients in very deep networks through the use of skip connections or residual blocks. These connections allow gradients to flow directly through the network, enabling the training of much deeper models without losing performance. ResNet50 has become a benchmark in computer vision due to its ability to learn hierarchical features from images, making it highly suitable for tasks like object classification.ResNet50 weights were frozen except for the last convolutional block (Layer4) and the final fully connected layer.ResNet50 was trained on 20,000 images over 20 epochs, delivering strong results with a peak accuracy of 77%, precision of 80%, and recall of 80% in spotting TFBS images compared to background sequences, just after the first two epochs. What’s really cool is that each epoch took less than two minutes to run! That’s a big deal because other models can take up to 30 minutes to handle the same amount of data.So far, I haven’t performed any hyperparameter tuning. However, exploring this could be an interesting way to improve the results.Grad-CAM (Gradient-weighted Class Activation Mapping) is an explainability technique used to visualize which regions of an input image influence a deep learning model’s predictions. It works by analyzing the gradients flowing into the final convolutional layer of a CNN, identifying the most activated neurons for each object class. This produces a heatmap-like representation that highlights the key regions of the image responsible for its classification into a specific label.For a more in-depth explanation about Grad-CAM, I suggest the following article:medium.comBy examining the activations in the three convolutional layers of Layer4, it appears that ResNet50 primarily focuses on the central region of the sequences to identify positive samples. This is a significant pattern, as transcription factor binding sites (TFBS) are typically located near the center of positive sequences, expecially for CTCF.Interestingly, I observed a similar trend when using SVMs to classify the same CTCF-positive sequences. In this case, the SVM assigned higher weights to k-mers (subsequences of length k) located near the sequence center — precisely where motifs are expected to be.This study explored an alternative approach to transcription factor binding site (TFBS) prediction by leveraging wavelet transforms and deep learning, specifically using ResNet50. By converting DNA sequences into signal-based colormaps, we were able to take advantage of pre-trained image classification models, demonstrating strong performance in distinguishing TFBS from background sequences.The results showed that ResNet50 could quickly learn to recognize relevant patterns, achieving a peak accuracy of 77% within just two epochs. Additionally, explainability techniques like Grad-CAM provided valuable insights into how the model focused on the central region of sequences — where TFBS are typically located!While these initial results are promising, there is still room for improvement. Future work could explore hyperparameter tuning, alternative genomic signal processing techniques for DNA encoding, or applying wavelet transforms for other exiting DNA related tasks.—-HelpStatusAboutCareersPressBlogPrivacyRulesTermsText to speech