Big Data Lab

Big Data is everywhere. Several industries ranging from technology to finance to governments want to use Big Data Analysis techniques for knowledge discovery. Tools and techniques have been developed recently to handle various aspects of Big Data Analysis such as data aggregate tools, scalable storage, efficient retrieval, and faster analysis. It is important for our students to get the awareness of this recent development in the area of Big Data Analytic tools and techniques. The new benefits that big data analytics brings to the table, however, are speed and efficiency. Whereas a few years ago a business would have gathered information, run analytics and unearthed information that could be used for future decisions, today that business can identify insights for immediate decisions. The lab generate an ability in our students to work faster – and stay agile – gives them a competitive edge they didn’t have before. Each terminal is configured with CORE i7 - 6700 CPU @ 3.40GHz, 32 GB RAM and 2TB HDD.

Activities:

A. Faculty Development and Big Data Analytics Workshop
Immediately after the establishment of the Big Data Centre, a three-week intensive full-time program was run for faculty development. Faculty was trained on all aspects of Big Data: Technological aspects and Analytical aspects. For example, faculty worked extensively with data preprocessing tools, querying and extraction of data and data mining using various well known algorithms such as for clustering and, classification.
On 25th April, 2014, a one-day Big Data Analytics Workshop was organized in the Institute and eminent speakers were invited for talks with students and delegates from Universities, Academic Institutions and Corporate bodies. The workshop was a big success with over five hundred participants.

B. Students Program:
• A 15-day full time program in Big Data and Data Analytics was organized by the Institute for Engineering Students of third and fourth years from IT and Computer Science disciplines. The complete course structure was designed and program was taught by our faculty. The program was a complete package comprising a.) Big data System Installation b) Working on Big Data and data extraction c) Data mining and analyzing data using R. Students were encouraged to consider seriously the option of starting a start-up on Big Data or Data Analytics (or both) when they graduate from the College. Subsequent to this course a number of students are continuing to work on Big data related projects.
• A 30-day (90 Hours) summer training program in Big Data and Data Analytics was organized by GLBITM and IBM for third and fourth year’s students of IT, Computer Science and engineering and third year student of MCA. The complete course structure was designed and program was taught by our faculty and IBM experts. The program was a complete package comprising
o Big data System Installation
o Working on Big Data and data extraction
o Data mining and analyzing data using R.
o Data Science, Machine Learning and Artificial Intelligence.
Students were encouraged to consider seriously the option of starting a start-up on Big Data or Data Analytics (or both) when they graduate from the College. Subsequent to this course a number of students are continuing to work on Big data related projects.

C.Long term Advanced program in Big Data and Data Analytics
The Institute is now offering an ‘Advanced program on Big Data and Data Analytics’. The program is an open program for all students and executives. It is a 140 hours program to be conducted on weekends and late-evenings. The course is divided in five modules. In preparation of this course a new and modern laboratory with faster machines is being developed.

D. Skill enhancement program in Big Data and Data Analytics
Our department is now offering a skill enhancement program on Big data and data analytics of our third year student. It is a 48 hours program to be conducted on 3 hours per week.

Many Projects are also developed in Big DATA lab, such as:
1 Facial Keypoints Detection
Problem Statement: Detect the location of keypoints on face images.
Objective : The objective of this task is to predict keypoint positions on face images. This can be used as a building block in several applications, such as:
o Tracking faces in images and video
o Analysing facial expressions
o Detecting dysmorphic facial signs for medical diagnosis
o Biometrics / face recognition
o Detecing facial keypoints is a very challenging problem.
Facial features vary greatly from one individual to another, and even for a single individual, there is a large amount of variation due to 3D pose, size, position, viewing angle, and illumination conditions.
Computer vision research has come a long way in addressing these difficulties, but there remain many opportunities for improvement.

2. Digit Recognizer
Problem Statement: Classify handwritten digits using the famous MNIST data
Objective: The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.
The data for this competition were taken from the MNIST dataset. The MNIST ("Modified National Institute of Standards andTechnology") dataset is a classic within the Machine Learning community that has been extensively studied.

3. Titanic: Machine Learning from Disaster
Problem Statement: Predict survival on the Titanic using Excel, Python, R & Random Forests.
Objective: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international communityand led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for thepassengers and crew. Although there was some element of luck involved in surviving the sinking, some groups ofpeople were more likely to survive than others, such as women, children, and the upper-class.

4 FOREST SOIL COVER ANALYSIS
Problem Statement: Forest Soil analysis using R. In this we try to predict the soil cover using CONFUSION MATRIX and DECISION TREE and plot its graph. We have use different packages of R such as “C50” and “caret” which is basically used for table partition or classification and for implementation of decision tree and Rule based model.
Objective: Objective is to provide forest soil cover analysis, to check the wilderness of soil and its cover type on the basis of train data and test data.

5. Caterpillar Tube Pricing
Problem Statement: Model quoted prices for industrial tube assemblies.
Objective: Walking past a construction site, Caterpillar's signature bright yellow machinery is one of the first thingswe'll notice. Caterpillar sells an enormous variety of larger-than-life construction and mining equipment to companies across the globe. Each machine relies on a complex set of tubes (yes, tubes!) to keep the forklift lifting, the loader loading, and the bulldozer from dozing off.
Like snowflakes, it's difficult to find two tubes in Caterpillar's diverse catalogue of machinery that are exactly alike. Tubes can vary across a number of dimensions, including base materials, number of bends, bend radius, bolt patterns, and end types.
Currently, Caterpillar relies on a variety of suppliers to manufacture these tube assemblies, each having their own unique pricing model. They provides detailed tube, component, and annual volume datasets, and challenges us to predict the price a supplier will quote for a given tube assembly.