In March 29, 2012, the Obama Administration announced what it called a “Big Data Research and Development Initiative". The initiative is compared to historical funding that brought us the internet with a focus on transforming scientific discovering, environmental and biomedical research, education, and national security. The first batch of this fund will go to federal departments to spend on projects they deem important such as:
- National Institutes of Health (NIH): funding to manage, analyze, visualize and extract information from large datasets related to health and disease. Specifically, the 1000 Genomes Project (200 terabyte dataset available on the AWS cloud for free). For more information, see aws.amazon.com/1000genomes
- National Science Foundation (NSF): $10 million ‘Expeditions in Computing’ project by UC Berkeley. Additional $2 million in research grants to train undergraduates in Big Data tools for visualization. And $1.4 million to a focused research group analyzing protein structures and biological pathways.
- Department of Defense: $60 million towards building autonomous systems as well improving text mining in any language.
- Defense Advanced Research Projects Agency (DARPA): $25 million towards the XDATA program to support open source software toolkits to improve computational techniques for analyzing big data.
- Department of Energy: $25 million towards the Scalable Data Management, Analysis, and Visualization (SDAV) Institute at UC Berkley (lucky university) which will allow collaboration of six national laboratories and seven universities to use the department’s supercomputers.
- US Geological Survey: Grant program through the John Wesley Powell Center for Analysis and Synthesis.
There are two challenges with investing in Big Data advancements. The first is investing in the right people and the right technology and the second is making sure the data remains free so that more advancements can be made without requiring additional investments. In terms of the latter, the government is trying its best to invest in projects where it will be the ‘owner’ of the data. One of the reasons why electronic medical records, big cloud based platforms, and other technological advances of our time have been slow to innovate is because the data is protected as much as a company’s intellectual property is protected. There is this notion that if a company keeps its data repositories private then it holds a competitive advantage to either A) innovate down the line using its big data, or B) to increase the value of its acquisition because of the potential value inherent in its data repository. To this end, projects like the 1000 genome project are a step in the right direction since it makes data accessible to everyone with a bright idea, not just the already established companies. When you spit into a tube and send it off to 23andme or to ancestry.com (and soon enough to helix.com) you essentially paying for someone else to profit off of your DNA. The initial value proposition will likely be in the form of something interesting but not particularly useful like mapping your ancestry to one of the main continents or telling you what your wine preference is based on your genes. The true value is in the unlocked secrets your DNA holds when grouped with thousands of others. When companies don’t have the time to collect all this data, they hire ‘data brokers’ to buy data from them. One example is Acxiom which collects consumer information from their online presence. This opened an opportunity for a different kind of company, Abine.com to remove your information from the web for a fee.
The silver lining for innovation is that there is enough data to go around. In fact with more than a Quintilian (10^30) bytes generated every day, the world will never run out of data to analyze. That is why everyone from non-profits like Wikipedia, Project Gutenberg, to governments (data.gov, data.gov.uk) provide free datasets to allows researchers, scientists, and startups to build the smart algorithms that could power our future. This is not old news, this is actually a new phenomenon that is taking place for the past 2-3 years.
A list of more than 70 websites with free datasets is available here <bigdata-madesimple.com/70-websites-to-get-large-data-repositories-for-free>
Buying and Selling of Big Data: cnn.com/2012/08/23/tech/web/big-data-acxiom