This past July, I attended the Explorations in Data Analyses for Metagenomic Advances in Microbial Ecology (EDAMAME) Workshop at the Michigan State University, Kellogg Biological Station in July 2016 led by Drs. Ashley Shade, Adina Howe, and Tracy Teal. EDAMAME started from the ground up and was a great way to get hands-on experience using cutting-edge metagenomics techniques. Although, I would expect, any one of the organizers to say that a great deal of effort and critical thinking is required to stay abreast on current technologies.
I learned a great deal by attending EDAMAME, which included gaining new colleagues, new perspectives, and new questions. Reflecting back on this experience, it is helpful for me (and maybe for others?) to summarize what I thought were the big takeaway messages. They might seem obvious but I think they are important for people like me who are ever so slightly trying to get a handle on metagenomics. I’ll explain these in more detail below but briefly, I believe my three main takeaways from #EDAMAME2016 encompass key steps in advancing “big data” studies. They include (1) pursuing hypothesis-driven research, (2) planning ahead, and (3) practicing reproducible science.
Sequencing technology advancements in the last 10 to 15 years have made it possible for microbial ecologists as well as non-experts, like myself, to peer into the proverbial microbial “black box” of diverse environments. With this ability, it is very tempting to sequence as many samples as time and money allow for surveying who is there and what genes are present. Looking through this large amount of data in search of interesting findings without a specific research question in mind is referred to as discovery-driven research. I do not wish to discount this type of work; we talked about how important it can be for discovering novel medicines. However, at EDAMAME we spent a great deal of time discussing the emerging need for hypothesis-driven research in the field of microbial ecology; when you have a specific research question in mind. I focus here on hypothesis-driven research because our discussions at EDAMAME helped me realize this approach provides a more structured way for inexperienced people like me to effectively utilize cutting-edge microbial ecology data and tools. What I mean to say is that hypothesis-driven approach allows the researcher to stay focused on addressing the question at hand as well as using statistical tools to test for the significance of a treatment effect. Prosser (2013) also points out that hypothesis-driven research can “provide counter observational, non-intuitive predictions and conceptual frameworks”. That is to say, this type of approach may help us explain when and why we something unexpected. When addressing questions concerning the linkage between microbial and ecosystem scale processes, which is of particular interest to me, the hypothesis-driven approach is helpful and encouraged (Schimel & Gulledge 1998, Rocca et al. 2015). To read more about the pros and cons of these approaches you can look to Janssen & Prosser (2013).
Another interesting and prevalent topic of our discussions at EDAMAME centered around taking the time to really plan ahead. This is somewhat related to my first takeaway because having a hypothesis then necessitates planning an experiment to addresses your hypothesis. Dr. Howe walked us through some back-of-the-envelope calculations at the start of the course that were meant to help us determine what type and how much sequencing data we would need to adequately answer a particular research question. By going through this type of calculation in the planning stages of your experiment, you can appropriately estimate costs and get the data you need to address your hypothesis. Key questions here were: How rare is(are) the organism(s) I am looking for? and How many times do I need to see that(those) organism(s) (i.e., coverage or depth)?. Once you are ready to work with your data, planning ahead also applies to data analysis. This point is further emphasized by Loman & Watson (2013) and Shade & Teal (2015). We also had some interesting conversations about the importance of biological replicates and mock communities in microbial ecology. It may be worthwhile to read up on recommendations for these and develop a plan before starting any experiments.
Last but not least, we talked a great deal about and practiced reproducible research at EDAMAME. I imagine that Drs. Shade, Howe, and Teal envision a future where all the figures and tables of a peer-reviewed paper can be easily reproduced by researchers inside and outside of the authoring laboratory group. We discussed and experimented with the many tools available to manage and visualize data. Many of these tools, such as R and GitHub, are open-source. For general tips see Shade & Teal (2015) and Loman & Watson (2013). The key here is that you want to clearly document the entire data analysis storyline from pre-processing to visualization. You also want to be sure to keep track of the changes you make along the way using a version control software (e.g. GitHub). If you are not familiar with the software that can help you achieve reproducible research nirvana, you can visit tutorials offered by websites such as lynda.com (potentially free through your university library), software-carpentry.org (free to all), coursera.org (free to all), and many others.
What’s really interesting to me is that despite the workshop being geared toward microbial ecologists, I think these takeaways apply to diverse fields including my own personal favorites: hydrology and biogeochemistry. The necessity for scientists to plan ahead, manage, analyze, and synthesize big data is growing. As I approach the end of my Ph.D. program, I have noticed future employers are demanding recent graduates to lead the charge in data management and visualization. This need is further demonstrated in recent publications by Laurance et al. (2016) and Rode et al. (2016). I agree with both of these articles that big data brings big opportunities, but I also think we need to accept the fact that with big data must come the enthusiasm to answer pressing questions and manage those data effectively so future generations of scientists can make use of them.
Laurance, W., F. Achard, S. Peedell, S. Schmitt. 2016. Big data, big opportunities. Frontiers in Ecology and the Environment. 14(7):347.
Loman, N. and M. Watson. 2013. So you want to be a computational biologist?. Nature Biotechnology. 31:996-998.
Prosser, J. 2013. Think before you sequence. Nature. 494:41.
Rocca, J., E. Hall, J. Lennon, S. Evans, M. Waldrop, J. Cotner, D. Nemergut, E. Graham, and M. Wallenstein. 2015. Relationships between protein-encoding gene abundance and corresponding process are commonly assumed yet rarely observed. ISME Journal. 9:1693–1699.
Rode, M., A. Wade, M. Cohen, R. Hensley, M. Bowes, J. Kirchner, G. Arhonditsis, P. Jordan, B. Kronvang, S. Halliday, R. Skeffington, J. Rozemeijer, A. Aubert, K. Rinke, and S. Jomaa. 2016. Sensors in the stream: the high-frequency wave of the present. Environmental Science and Technology. doi: 10.1021/acs.est.6b02155.
Shade, A. and T. Teal. 2015. Computational workflows for biologists: a roadmap. PLoS Biology. 13(11): e1002303. doi:10.1371/journal.pbio.1002303.
Schimel, J. and J. Gulledge. 1998. Microbial community structure and global trace gasses. Global Change Biology. 4(7):745-758.
About the Author: Sheila Saia is a Ph.D. student in the Biological and Environmental Engineering Department at Cornell University and a member of the Soil and Water Lab. She studies the linkages between hydrology and microbiology on phosphorous cycling in streams and soils.