Structural biology and AlphaFold

@Anindyadeep | 20 minutes read

Welcome to the Annotated AlphaFold series. Disclaimer: I do not expect you to have any prior knowledge about biology (that’s why this blogpost is here to give a nice headsup). However I do expect you to have some knowledge about Neural Networks and Machine Learning. Specifically knowledge about transformers is an requirement. If you are not familiar with such concepts then I would strongly encourage you to have a good idea about those concepts and come back.

<aside> ❄️

Another disclaimer: I am not an expert of this field. I am just a fellow learner like you all sharing my learnings along the way in the form of blogs. So feel free to DM me on twitter or mail me for any kind of flaws or feedbacks you see. I will iterate more if required.

</aside>

Structural biology delves into fundamental biological structures such as proteins, DNA, and RNA, examining their behaviors, formations, and conformations. The integration of structural biology with AI has opened numerous avenues, including predicting 3D structures of proteins and simulating interactions, thereby advancing fields like omics biology in general and medicine 3.0, artificial drug discovery in particular. Among these challenges, protein folding—understanding how proteins fold and predicting the folding patterns of unknown proteins—stands out as a very fundamental problem. Accurate predictions can accelerate biological research by minimizing wet lab tests and escalating the usage of dry and soft research tools. DeepMind's AlphaFold models, particularly AlphaFold 2, have made groundbreaking strides in this area, achieving accuracies exceeding 90% in protein structure predictions. This achievement was recognized with the 2024 Nobel Prize in Chemistry awarded to Demis Hassabis and John Jumper of DeepMind, alongside David Baker of the University of Washington, for their pioneering work in AI-driven protein structure prediction and computational protein design.

This blog post aims to help fellow readers understand the significance of secondary and tertiary structure of protein and folding in general. How does a protein fold? How does it even form in the first place? What’s the relationship between structures like DNA, RNA, and proteins—and why is this trio one of the most fundamental building blocks of what we call life? Tighten your seatbelts, because we’re about to uncover the answers to these questions one layer at a time.

How do Proteins form

Well, lot of us tech bros only care about proteins to gain good muscle lol. But if you think about it, every living organism right from tiniest bacteria to humans is fundamentally run by some macro and supra molecules like DNA, DNP (Deoxy Ribo Nucleo protein), RNP (Ribo Nucleo Protein), Enzymes etc. They do everything (like literally everything) from providing structure to cells, catalyzing reactions, transmitting signals, defending our body against invaders (like diverse pathogen, likeVirus) etc. In the end each of the proteins and how they work is just some set of massive chains of biochemical reactions. However they are fascinating. Now the question comes how and from where does protein originates? In one word, It all starts with DNA (the instruction manual of life). Let’s brush up our high school knowledge about cells and take a roller coaster ride right from cells to DNA.

A small rollercoaster ride of Analogies

Since I expect most readers here have very little background in biology, let’s start from the basics. You probably know this: every living organism is made up of cells. Now, inside a cell, we have the nucleus. And inside the nucleus, there are tiny openings called nuclear pores—think of them as selective gates that allow specific molecules to enter and exit.

Within the nucleus, we find chromosomes. Chromosomes are essentially highly organized, tightly coiled structures made up of DNA wrapped around histone proteins (imagine threads wrapped around tiny spools).

DNA, or deoxyribonucleic acid, is this long, twisted molecule (a double helix, if you remember) that carries all the genetic instructions needed to build and operate an organism. Specific sections of the DNA that actually do something meaningful are called genes (we’ll come back to this later).

Fig 1: Image generated by ChatGPT-4o

Now, programmatically, you can think of DNA as the main gateway to a massive, ancient codebase. Some parts of this codebase make perfect sense, while other parts are just commented-out, outdated, or gibberish lines of code that don’t seem useful at all. The meaningful, functional parts of the codebase are what we call genes.

And here's another fun analogy: think of genes like classes in object-oriented programming. A class acts as a blueprint for creating objects; similarly, genes are blueprints for making proteins. Now, just like how we instantiate an object from a class, we instantiate something called mRNA from a gene.

And just like how you use objects to perform different functions in a program, mRNA is used as the working copy to build proteins. The process of creating mRNA from DNA is called transcription, and the process of creating proteins from mRNA is called translation.

Time to dive deeper to Transcription

Now let’s dive deeper into how proteins form and fold. As we know, DNA is the starting point for creating proteins. DNA is massive in length, but most of it “does not make sense” in terms of coding for proteins. The sections of DNA that contain useful information are called genes (see the image below).

Fig 2: Image generated by ChatGPT-4o. The purple colored part is gene and off / faded colors are junk DNA.

Above is a simplified structure of DNA. The four letters you see here—A, T, C, G—are called nucleotides. A stands for Adenine, T for Thymine, C for Cytosine, and G for Guanine. These are complex chemical structures where A always pairs with T (through two hydrogen bonds), and C always pairs with G (through three hydrogen bonds). The 3’ and 5’ marks indicate the direction of the strands. You can think of DNA like a coiled zip, where one strand (from 3' to 5') is the sense strand, and the other (from 5' to 3') is the antiparallel strand (behave as anitsense strand). These are useful as you will see in the next section.

We begin with the process of transcription, which takes place inside the nucleus. During transcription, a protein called RNA polymerase binds to the DNA and unwinds it, causing the double helix to "unzip." RNA polymerase then starts “parsing” the DNA from the 3' to 5' direction. In the process, it synthesizes a single-stranded structure known as mRNA (messenger RNA).

Fig 3: Animation of transcription. Source: Reddit post

This mRNA is complementary to the DNA template strand and is synthesized in the 5' to 3' direction. As you can see in the gif above, the purple-yellow double coiled structure is DNA and a single coiled structure is the mRNA. The big-blob like structure is our RNA polymerase. One fun fact, it’s interesting on how RNA polymerase finds where to start and where to end.

A fun fact: It's interesting how RNA polymerase knows where to start and stop. If you're familiar with tokenization in LLMs, it's similar to how we have special tokens like <BOS> and <EOS>. In DNA, specific sequences called promoters signal where RNA polymerase should begin, and terminators mark the end, guiding the process just like special tokens in LLMs. mRNA or messenger RNA is a coiled single stranded structure.