A high profile paper about another epigenome that says the same thing, yet again.
I think YOU need a little more serious science discussion in your life! So here is post that originally went up on Mad Art Lab that is a review of a study that uses a new way to probe genomes, called ATACseq. Put your thinking cap on and dive in!
This post is written by Mad Art Lab contributor, Elizabeth Finn. Elizabeth is a PhD Candidate studying Genetics in the San Francisco Bay Area. She specializes in the epigenetics of mammalian development. In her free time, she is an aerialist, a dancer, a clothing designer, and an author. You can find her on tumblr at madgeneticist.tumblr.com, on twitter at @lysine_rich, and also on facebook or google+.
A recent study developed a new way of probing the genome for “accessible” sites, and found the same things that we’ve been seeing for years.
A new method, called ATACseq, uses a transposase to insert into the genome at random intervals, and sequences from these insertions. Researchers applied this method to single cells, sorted into individual wells via microfluidics, and…. didn’t find very much.
Combining high-throughput DNA sequencing methods with microfluidics to get a “single cell genome” of sorts is a hot topic right now. It seems promising, but the fact of the matter is that sequencing modalities simply by nature have noise. There are random fluctuations in which molecules get sequenced (basically, sampling bias), and these are compounded by consistent biases due to chemistry. It can all be washed out, and averaged out, when you have two million copies of every gene to go from. But it’s problematic when you have a single genome — two copies of each region of interest, which means the “maximum” signal is 2, the “minimum” is 0, and there is only one intermediate value. Even a little bit of noise can make a big difference there. The authors don’t really propose anything new to deal with this noise, other than averaging over genomic distance and over many cells: meaning that the big advantage of using the microfluidics to get “single-cell” data here is that your mean also comes with a standard deviation. You can’t say much from any individual cell — you can still only interrogate them on average. And having a standard deviation is an important enough thing that most of the findings of the paper were based on leveraging the variance rather than the mean. So, okay.
My biggest problem, though, wasn’t the oversell on getting ‘single cell’ data from this. My biggest problem was that they didn’t see anything biologically interesting: “accessibility” isn’t a fundamentally new way of looking at the genome, and this study didn’t find anything fundamentally exciting and new in their analysis. I spent a long time wondering how to write about this because I couldn’t distill it down into one cool, big, finding. There isn’t one. They see roughly the same things that we have always seen or would expect to see: cell-type-specific transcription factors, with levels that vary widely between cells, have binding sites that vary greatly in accessibility between cells. This might be because in some cells, the transcription factor is present and binding, and in others it isn’t. General transcription factors, with levels that do not vary as widely, have sites that do not vary greatly in accessibility. By looking at which clusters of sites tend to vary in accessibility together, they found a pattern of blocks of open chromatin and closed chromatin which neatly replicated other studies that had seen blocks of open chromatin and closed chromatin.
Don’t get me wrong: this is a good piece of work, and my sticking point is by no means limited to this work or these researchers. ATACseq, and transposase-mediated methods, seem like they would improve high-throughput sequencing studies. Most studies built around high-throughput sequencing of DNA fragments require cutting the DNA into small pieces, and then gluing adapters (which can guide the sequencing reaction) on to the end of those pieces. Methods like ATACseq are more straightforward and thereby yield improved reliability and results. In the next few years, I can see transposases replacing DNAses for much of sequencing library preparation, and reducing the amount of time and effort it takes to do high-throughput sequencing studies, which would be great. But while the technical improvement noted in this study was significant, the biology they observed was all stuff we’ve seen already. As a paper demonstrating an improved way to study chromatin accessibility, it’s great. As a paper demonstrating principles of chromatin accessibility, it doesn’t go very far. If it was limited to this one paper, it wouldn’t be an issue at all. But this kind of paper is more and more common; another massive data set that fewer and fewer people can understand, published even though it lacks much by way of new, interesting biology.
Things that would have been cool, but I didn’t see: a way to turn this data into a model of the shape of the chromosome that can be tested. A way to multiplex this analysis with RNA sequencing or RTPCR, to see if more “open” chromatin is actually more transcribable in any individual cell. A measure of whether the specific cells with higher levels of a transcription factor were actually more “accessible” at the binding sites of that transcription factor, or a way to start teasing apart the tricky, important chicken-and-egg question of which happens first: DNA being easily accessible to proteins, or proteins binding to DNA. (Perhaps even a way to identify so-called “pioneer factors” that can reorganize chromatin, as opposed to more mundane “transcription factors” that cannot.) Most importantly: a single, overarching hypothesis, and an elegant experiment to back up the biology.
I’ve been seeing a lot of studies like this recently. I’ve been pulled into a couple. I want to start fighting back against it, but it’s hard to see how. Right now in biology we’re in love with our tools: every grad student and post doc is told to make a tool to make their name. It’s worked for many of our mentors, so why change it? Mostly, because I want to study biology, not study how to build tools to study biology. We’re swimming in so much data that it can be hard to make heads or tails of any of it, and it seems like some of the best scientists — or at least the most successful ones — get ahead by adding to that cacophony more than clarifying it. It’s only made worse by the fact that most readers, most reviewers, and even many authors don’t understand the statistics they need to analyze their own data. I’d love to see a return to simple, elegant experiments with clear results used in tandem with these “next-generation” techniques. But as long as we remain enamored with new techniques and big data, and as long as we don’t understand enough math and statistics to interpret that data, we’ll be hard pressed not to get more and more undecipherable cacophony.
Featured image derived from “Nucleosome1” by Thomas Splettstoesser. Licensed under CC BY-SA 3.0 via Wikimedia Commons
I think a lot of valid points are raised here, but I take issue with the characterization of “data dumps” as a “cacaphony.” Data science is a growing, thriving field that has arisen to tackle the excess that exists, and I personally am not sure I can be convinced that it is ever a bad thing to have “too much data.” I agree, though, that it can be frustrating, as a grad student, if one’s interests align more with basic biology rather than bioengineering or applied statistics and s/he is forced toward the latter (and boy how do grad students have plenty of experience in the industrious field of “Doing Shit We’d Rather Not!”)
I’d have liked to see a link to the study itself; this article discusses what the study does NOT show but I’d be interested in reading what it does. Different perspectives seeing things differently and all.
Here is a link to what (I think) is the paper.
I agree with alwaysanswera, characterizing data as a “cacophony” is simply demonstrating that you don’t understand how it all fits together.
Understanding how individual cells are different from each other is going to be critically important in the human brain, where ~25% of cells have large (and unique) CNVs (of mega-base size). In the brain, each cell is doing something different (connecting to ~10,000 other cells in a unique pattern). Are those CNVs important? CNVs have been associated with neurological disorders (like autism and schizophrenia). If the “normal” brain has a zillion cells with CNVs, are germ-cell CNVs important or does the brain generate its own?
The problem is that there isn’t enough funding for individuals to analyze complex data (or even data that isn’t so complex).
You must log in to post a comment.