Wednesday, July 24, 2013

BigData Challenges

Problem: Huge amounts of data are produced and accumulated daily, but large-scale processing of that data on commodity computers is difficult → Big Data is difficult
  • Commodity Hardware: We have lots of resources (1000s of cheap PCs), but they are very hard to utilize
  • Parallel Programming: We have clusters with over 10k cores, but it is hard to program 10k concurrent threads
  • Fault Tolerance: We have 1000s of storage devices, but some may break daily. Failure is a norm rather than an assumption...
  • Scalable: Scale Up vs Scale Out
  • Expensive: There are many technologies available in the market for Big Data processing, but are proprietary in nature
Solution:
  1. Hadoop: Runs on commodity hard ware, supports scale out and it is free ware
  2. HDFS(Hadoop Distributed File System): Supports data replication, thereby high availability
  3. MapReduce: Supports parallel execution of tasks