Over at Scientific Computing’s blog, Gord Sissons of IBM makes an interesting case that the availability of large amounts of inexpensive storage has started to drive the evolution of HPC, much as cheap computing has changed the HPC scene over the last few decades.
“HPC is changing again, and the catalyst this time around is Big Data. As storage becomes more cost-effective, and we acquire the means to electronically gather more data faster than ever before, data architectures are being re-considered once again. What happened to compute over two decades ago is happening today with storage.”
Read the full post here.
Scientific American recently published an interesting article on their Observation Blog, titled Why Big Data Isn’t Necessarily Better Data. It nicely highlights one of the pitfalls of implementing Big Data analytics, namely, believing your data and/or data analysis are better than they are. The example used in the article is Google Flu Trends (GFT) which seek to use Google’s search data to track outbreaks of influenza. The results have been mixed, with results being similar to the CDC tracking data but overestimating the prevalence of flu in 100 of 108 weeks in 2011-2013 which one study examined. Google itself has found that the GFT data is highly susceptible to media coverage, which may explain some of these issues.
The main takeaway from this article is to be conscious of factors that could affect the quality of both your data and your analysis. This quote sums it up nicely:
Big data hubris is the “often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.”
Big data is a useful tool, but like all tools it must be used properly. Some food for thought.
I’ve previously mentioned Globus, the service that allows you to transfer files to/from/between WestGrid sites from a handy web application. That article can be found here.
Compute Canada recently started rolling out a new service from Globus, called Globus Connect Server, that further enhances this application. The main benefit from the new Globus Connect Server is file sharing. With this, you will be able to share files from any WestGrid site and make them available to anyone, anywhere, as long as they have a free Globus account. It is currently only configured to share from Silo, however.
In November, I attended the Supercomputing conference in Denver, CO. The theme of this year’s conference was ‘HPC Everywhere’. Hundreds of presentations were given on all aspects of high-performance computing, from traditional HPC to big data and fast networking. All the major players in the HPC space, and many of the smaller ones had exhibits making for a lot to see in just a few days.
Data volume is growing by leaps and bound; more and more data is being generated every day that needs to be computed, analyzed, and stored. While traditional data transfer tools like FTP, scp, sftp and rsync work well for local or small transfers, transferring big data long distances is still slow and prone to interruption.
Fortunately, it’s now easier than ever to get your data onto WestGrid by using Globus. This service is provided by the University of Chicago, and allows quick and easy file transfers to and from any Globus-enabled system. By using the free Globus Connect program, this includes your desktop.
Today, in the second post in a series on big data and data mining, I’ll be discussing MapReduce, a strategy for handling large amounts of data quickly by exploiting the power of many computers working in parallel.
The original implementation of MapReduce, along with the name, came out of Google. MapReduce originally referred to the proprietary technology Google used to handle the huge quantities of data generated by crawling the World Wide Web. As the ideas behind the technique became better known, other implementations, such as Hadoop, have been created and are available to the world at large.
I’ve recently returned from a conference on Big Data and data-mining. These topics are becoming more and more important in the world of research computing, so I would like to present a series of short articles to cover the issues surrounding modern data management.