Tuesday, June 4, 2013

Introduction to Big Data

Big Data: is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction.

Big data is typically broken down by 4characteristics: 
  1. Volume: How much data
  2. Velocity: How fast that data is processed
  3. Variety: The various types of data
  4. Veracity: How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense?



Big Data types:
  1. Structured data
  2. Unstructured data
  3. Semi-structured dat
The sources of data are divided into 2 categories:
  1. Computer- or machine-generated: Machine-generated data generally refers to data that is created by a machine without human intervention.
  2. Human-generated: This is data that humans, in interaction with computers, supply.
Structured Data: refers to data that has a defined length and format. Examples of structured data include numbers, dates, and strings. It is usually stored in a database.

Example of structured machine-generated structured data can include the following:
  1. Sensor data: radio frequency Id(RFID) tags, smart meters, Global Positioning System(GPS) data
  2. Web log data: when servers, applications, networks, and so on operate, they capture all kinds of data about their activity. This can amount to huge volumes of data that can be useful, for example, to deal with service-level agreements or to predict security breaches.
  3. Point-of-sale data: When the cashier swipes the bar code of any product that you are purchasing, all that data associated with the product is generated. Just think of all the products across all the people who purchase them and you can understand how big this data set can be.
  4. Financial data: stock trading
Examples of structured human-generated data might include the following:
  1. Input data: Any input given by a user through html forms, etc.
  2. Click-stream data: Data is generated every time you click a link on a website. This data can be analyzed to determined customer behavior and buying patterns
  3. Gaming-related data: Every move you make in a game can be recorded. This can be useful in understanding how end users move though a gaming portfolio.
Unstructured data: is data that does not follow a specified format. Until recently, however, the technology didn’t really support doing much with it except storing it or analyzing it manually.

Examples of machine-generated unstructured data: 
  1. Satellite images: weather data, Google Earth
  2. Scientific data: seismic imagery, atmospheric data, and high energy physics
  3. Photographs and video: This includes security, surveillance, and traffic video
  4. Radar or sonar data: vehicular, meteorological, and oceanographic seismic profiles 
Examples of human-generated unstructured data:
  1. Text internal to your company: text within documents, logs, survey results, and e-mails.
  2. Social media data: YouTube, Facebook, Twitter, LinkedIn, and Flickr
  3. Mobile data: This includes data such as text messages and location information
  4. Website content: This comes from any site delivering unstructured content
Semi-structured data: is a kind of data that falls between structured and unstructured data. It does not necessarily conform to a fixed schema but may be self-describing and may have label/value pairs.

Examples: EDI, SWIFT, and XML

Source of this tutorial:
  1. Big Data for Dummies Book
  2. The real-world user case of Big Data