Introduction to Data Science

DATA   SCIENCE

What is DATA SCIENCE?

It involves extracting knowledge from the data in either structured or unstructured manner. Getting valuable information from the raw data. It involves the knowledge of mathematics (statistics, probability), computer science(algorithms), reasoning skills and domain/business knowledge.

 

Who is a DATA SCIENTIST?

A person who is better at statistics than any software engineer and better at software engineering than any statistician.

The workflow or stages of Data Science:

  1. Question or problem definition.
  2. Acquire training and testing data.
  3. Wrangle, prepare, cleanse the data.
  4. Analyze, identify patterns, and explore the data.
  5. Model, predict and solve the problem.
  6. Visualize, report, and present the problem-solving steps and final solution.
  7. Supply or submit the results.

 The data science process includes the following activities:

  1. Data selection.
  2. Preprocessing.
  3. Transformation.
  4. Data Mining.
  5. Interpretation and evaluation.

It is an iterative process in which some, or all, steps may be repeated.

Fig: To understand the workflow of DATA SCIENCE in real world applications.

 

To get started what are the basic requirements of learning DATA SCIENCE

  1. Having a grasp knowledge of Machine learning algorithms like Naïve Bayes, Decision Trees, Random Forest, K-means clustering.
  2. Data Scraping tools and libraries like beautiful soup.
  3. Complete understanding of languages: R or Python, SQL.
  4. To get started one should have the following tools:

R Studio

Python

Jupyter Notebook.

 

To get started let’s view into basic coding skills of Python & R

# Here we will learn how to read contents of data from a csv file

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier

%matplotlib inline

pd.__version__

import os

os.getcwd()

df_swing = pd.read_csv('chocolate.csv')

print(df_swing)

# OUTPUT

 

# After cleaning data, one can start analyzing or visualizing the information from data as shown #below.

import numpy as np

x = np.sort(df_swing['Rating'])

y = np.arange(1, len(x)+1)/len(x)

_ = plt.plot(x, y, marker='.', linestyle='none')

_ = plt.xlabel('Rating')

_ = plt.ylabel('Cocoa\nPercent')

plt.margins(0.04)

#OUTPUT

# other visualisations using seaborn

import pandas as pd

import seaborn as sns

sns.set()

_=plt.hist(df_swing['Cocoa\nPercent'])

_=plt.xlabel('Rating')

_=plt.ylabel('Cocoa\nPercent')

plt.show()

 

 #OUTPUT

 

# Plotting Pie Chart

from IPython.display import Image

from collections import Counter

import matplotlib.pyplot as plt

count_Class=pd.value_counts(df['Bean\nType'], sort= True)

count_Class.plot(kind = 'pie',  autopct='%1.0f%%')

plt.title('Pie chart')

plt.ylabel('')

plt.show()

 

#OUTPUT

 

# PLOTTING BAR CHART

count_Class=pd.value_counts(df["Company\nLocation"], sort= True)

count_Class.plot(kind= 'bar', color= ["blue", "orange"])

plt.title('Bar chart')

plt.show()

 

#OUTPUT

# APPLYING MACHINE LEARNING ALGORITHMS

import numpy as np

twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

predicted = text_clf.predict(twenty_test.data)

np.mean(predicted == twenty_test.target)

# OUTPUT

 

# APPLYING ML (TfidfTransformer)

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_train_tfidf.shape

#OUTPUT

There were few examples to show case how data science looks and work. If you are interested stay connected to this tutorial on Data Science by Anwesha Sinha(Certified Data Science by John Hopkins University) on Techtud to learn and evaluate more.

 

 

Contributor's Info

Created:
0Comment