What is Machine learning?
It is the thing which makes the machine to learn the things on its own by the experiences. Machine learning uses the training examples to learn things. It is the derived branch of Artificial intelligence. The machine is done by using some of the learning algorithms. These analytical algorithms are suited for a large number of datasets. So we can say as machine learning algorithms are not scalable to big data analytics. Some of the prominent algorithms are.
 Supervised learning algorithm
 Unsupervised learning algorithm
 Semisupervised learning algorithm
 Reinforced learning algorithm
Supervised learning algorithm:
In this type of learning algorithm, the machine is trained with the labeled data. It can accurately do predict the output for the given test data. Supervised learning consists of two parts.
 Classification
 Regression
Unsupervised learning algorithm:
In this type, the machine is trained with the unlabeled data. This cannot predict the output precisely but It will generate a lot of inferences from the given training data set.
Semisupervised learning algorithm:
This is typically the combination of both supervised and unsupervised learning algorithms. The accuracy will be more when compared with the supervised and unsupervised learning. Most probably it contains the little amount of labeled data and the huge amount of unlabeled data.
Reinforced learning algorithm:
Trial and error is the most Important method in the reinforcement learning algorithm. The two main components present in this type of algorithm is the agent and the environment. Always there will be communication between these two. Every time the environment sends the reward feedback to the agent so the agent will decide the best move or the action. That feedback is also termed as a reinforcement signal.
What is classification?
It is one of the core parts of the machine learning. It will categorize the upcoming item or new value to which subpopulations it belongs to among the different categories or subpopulation. It is done based upon the training data which consists of records whose membership values are previously known. The value which we going to predict is the discrete value either 0 or 1.
For example, if we want to classify whether the person will be given credit card or not, we have to train the machine with the data set contains the average balance in the account, Number of transaction doing per month, profession, CIBIL score(based on the previous things this will be calculated), etc.,. if the CIBIL score is more than 750 then the person can be approved with the credit card. Else we cannot.
If we want to check whether a person is having the eligibility to have a credit card or not we have to map his/her average balance with the number of transactions and do calculate the CIBIL score. It lies() in this region he is unable to get the credit card. If it lies() in this region he can get the credit card.
Applications of classification:
 Finding the spam emails.
 In cancer diagnosis.
 Selfdriving cars.
 Identifying blood groups.
Types of classification techniques:
 Rulebased classifier
 Decision trees
 Naïve Bayes classifier
 Support vector machines
 Artificial neural networks
 Nearest neighbor(KNN)
Decision tree classification:
It builds a decision tree which consists of at least two nodes. It can handle both numerical and categorical data. The topmost decision node is called as the root node. It will make decisions from the class labeled dataset. The decision tree can be built using many algorithms. Some of them are,
 ID3.
 CART.
 C4.5.
 CHAID.
 HUNT’S.
ID3 is the base algorithm for everyone. CART (classification and regression tree) is the derived algorithm from the ID3.
CHAID (chisquare automatic interaction detector) algorithm is used for the classification of categorical data. It is used for the searching of patterns from a large amount of categorical data. By using this relationship among the data can be easily visualized.
C4.5 is the extension of ID3. It is also called the statistical classifier.
Example for the decision tree using hunt’s algorithm,
What is Hunt’s algorithm?
Hunt’s algorithm is one among the decision tree building algorithms. The features of the Hunt’s algorithm are as follows:
 If the decision tree belongs to the same class label, then the leaf node is labeled with the class name.
 If the decision tree belongs to more than one class label, then use an attribute and split the data into smaller subsets.
Example:
PROFESSION

AVERAGE BALANCE

CIBIL SCORE

APPROVED

Doctor

Low

Sufficient

No

Doctor

High

Sufficient

No

Doctor

High

Insufficient

No

Software

Low

Sufficient

No

Software

Low

Insufficient

No

Software

High

Insufficient

Yes

Software

High

Sufficient

Yes

Business

Low

Sufficient

Yes

Business

High

Insufficient

Yes

Business

High

Sufficient

Yes

Given a dataset like that we will perform the decision tree classification using any of the algorithms. Here I'm using HUNT'S algorithm.
For finding the decision tree we have to calculate entropy and information gain of each and every attribute.
Entropy: entropy is the amount of the impurities in the given data set, like redundancy, replication etc.
Information gain: it is the result of the difference between the entropy of the parent and the entropy the present working attribute.
PROFESSION

AVERAGE BALANCE

CIBIL SCORE

APPROVED

Software

Low

Sufficient

No

Software

Low

Insufficient

No

Software

High

Insufficient

Yes

Software

High

Sufficient

Yes

Subtable:
Rulebased classifier:
in this classification is done using the IFTHEN rules, it is the simplest form of classification as we can directly retrieve the results from the dataset or decision trees or neural network etc.
the main things in this classification are,
 Antecedent: IF part is called the antecedent.
 Consequent: THEN part is called the consequent.
Example:
PROFESSION

AVERAGE BALANCE

CIBIL SCORE

APPROVED

Doctor

Low

Sufficient

No

Doctor

High

Sufficient

No

Doctor

High

Insufficient

No

Software

Low

Sufficient

No

Software

Low

Insufficient

No

Software

High

Insufficient

Yes

Software

High

Sufficient

Yes

Business

Low

Sufficient

Yes

Business

High

Insufficient

Yes

Business

High

Sufficient

Yes

R1) IF profession= SOFTWARE, average balance= HIGH, THEN credit card can be approved.
R2) IF profession=DOCTOR, CIBIL score= SUFFICIENT, THEN credit card cannot be approved.
There we are using two attributes to find the conclusion. We may use any number of attributes to make a rule.
Characteristics of rulebased classifiers:
 Mutually exclusive.
 Mutually exhaustive.
Mutually exclusive:
This will say that the rules which are derived from the dataset will differ from one another but not the same. There should be the common attributes as a pair in any of the two rules.
R1 ∩ R2=∅
Mutually exhaustive:
it is quite opposite to the previous one. In this, we have to derive the combination of attributes from the given data set. This will accept the redundancy.
Example: we can make the rules as follows,
 Profession & average balance
 Cibil score & average balance
 Profession & CIBIL score
 Profession & average balance & CIBILscore
Now we have the number of rules at hand. If a statement triggers more than one rule at a time. Then conflicts will arise. So, we need conflict resolution strategies for rulebased classifiers.
Conflict resolution strategies:
 Sizeordering scheme.
 Ruleordering scheme.
Sizeordering scheme:
If the statement triggers both the rules at once, this sizeordering scheme will decide which rule should be activated based upon the number of attributes used for making the rule.
The more matching attributes we use the rule will be triggered first and vice versa.
Example:
R1 consists of 2 attributes.
R2 consists of 4 attributes.
R3 consists of 7 attributes.
The coming statement will trigger R3 first and then R2 if necessary then R3.
Ruleordering scheme:
It again consists of two subcategories based on them it will decide which rule should be triggered.
Classbased:
In this type, the coming statement is passed into the different number of classes. And the matching class rule will be triggered first.
Example:
If class1 is from 20 to 30(c1>=20 & <30)
If class2 is from 30 to 50(c1>=30 & <50)
If class3 is from 50 (c1>=50)
If the statement consists of value 33, class2 will be triggered.
Rulebased:
In this type, the coming statement is matched with the rules based on the [priority of the rule given.
Example:
Rule1 → 2
Rule2 → 3
Rule3 → 1
Rule4 → 4
If the coming statements contain the attributes which are present in both rule3 and rule1 then based upon the priority rule3 will be triggered because of high priority.
Approaches for rulebased classifier:
 Direct method
 Indirect method
Direct method:
In this method, we use directly the data sets given to infer the rules. Based upon the attributes in the data set we will make the rules with the combinations among them.
Sequential covering algorithm:
This is the algorithm used in the direct method to infer the rules.
//Initially make the rule set empty
Ruleset= { }
//check for every class C pass it through the LEARNONERULE algorithm.
For each class C do
Rule= LEARNONERULE (dataset, attributes, class)
Remove attributes covered by previous rules from dataset
// if a1, a2 are used in the above rule, in next iteration avoid using a1 and a2
Ruleset= Ruleset+ Rule ();
end for ()
return Ruleset
what inside LEARNONERULE?
 Consider one class.
 Pass it through the training data and check with every rule.
 Find the attribute that increases the accuracy of the current ruleset.
 Append the attribute to the current ruleset.
Indirect method:
In this method, we will make use of different things to infer the rules. As like,
 Neural networks.
 Decision trees
 Perceptron models.
Consider decision trees,
If the statement passes through rule1 and rule4 from the abovementioned diagram,
IF (R1 & R4)
Then true.
If the statement passes through rule1 and rule3 and rule7 from the abovementioned diagram,
IF (R1 & R3 & R7)
Then true.
Like this indirect method proceeds.
Rule pruning:
The central idea of this pruning method is to cut off the unnecessary rules from the necessary rule set which are contributing for the classifier to classify the things.
Example:
If the classifier is made using
R1, R2, R4, R6, R9 only.
But we have R1, R2, R3, R4, R5, R6, R7, R8, R9. we have to remove the remaining rules and keep the contributing rules only
So, final ruleset will be R1, R2, R4, R6, R9.
It is mainly used by the c4.5 algorithm as it is a classbased thing based on which it will generate the decision tree.
Quality measures of rulebased classifiers:
 Coverage:
it includes the ratio of the number of tuples used for the classification to the total number of tuples.
Example:
If we use 2 attributes out of 10 attributes, then
2/10 = 20% will become the coverage ratio.
 Accuracy
it will determine the accuracy of classification.
Example:
If we classified 4 correct out of 4 covers, then
4/4 = 100% accuracy.
NaïveBayes classifier:
It is the classification technique used when we are given with the data set and are asked for an unfamiliar condition which is not present in the given data set. We will find the class label based on the Bayes theorem.
P(A/B) = probability of A after happening B, B must be true.
P(B/A) = likelihood term.
P(A) = probability of happening A.
P(B) = probability of happening B.
Example:
PROFESSION

AVERAGE BALANCE

CIBIL SCORE

APPROVED

Doctor

Low

Sufficient

No

Doctor

High

Sufficient

No

Doctor

High

Insufficient

No

Software

Low

Sufficient

No

Software

Low

Insufficient

No

Software

High

Insufficient

Yes

Software

High

Sufficient

Yes

Business

Low

Sufficient

Yes

Business

High

Insufficient

Yes

Business

High

Sufficient

Yes

If we are asked with a new condition to find the class label,
Software

High

Sufficient

?????

So, class label is
Software

High

Sufficient

yes

Support vector machines:
It will classify the new or upcoming item into their respective classes. It is generally a discriminative classifier. It returns the hyperplane as the output based on that it will categorize the items. The hyperplane is surrounded by another two support vectors.
The distance of support vectors must be maximized from the hyperplane.
SVM classifies the things into respective classes.
The distance between the hyperplane and the support vectors is called a margin.
The marking of the hyperplane is done by using several functions, inbuilt libraries, quadratic equations etc.
Above diagram shows the points in a 2D manner. If the points are like below, then we need to convert into a 3D plot. And we should map with the 2D plot.
We have to convert into y z plane so that the point will be divided into different classes. This will be done by the SKLEARN library.
After converting
How to know where the hyperplane can be located?
Example:
Artificial neural network:
It is the simulation of the biological neurons. As human brains have the capabilities of processing information, making instantaneous decisions under some critical situations. Artificial neural networks simulate the human brains whereas natural neurons are replaced with the artificial neurons.
The artificial neural network will only take numerical data but not categorical data. But in the case of the decision tree, it will accept both numerical and categorical data.
The basic structure of an artificial neural network:
Input layer:
In this layer, it will accept the given input values from the user and passes to the hidden layer.
Hidden layer:
It will compute the realvalued integer output. It computes based on the weighted inputs.
Output layer:
The output from the hidden layer is the input to the output layer. This will compute the output of the neural network.
The weighted edges between the nodes are also called as synapsis (→).
The synapsis represents the knowledge gained by the neuron.
There are two types of networks in the neural networks,
 Feedforward network.
 Feed backward network.
Feedforward neural network:
This type of network proceeds from left to right and no feedback is given to the input layer. It is again divided into two types.
 Fully connected neural networks.In this, every neuron is connected to every other neuron in the network.
 Partially connected neural networks.
In this, some neurons are connected to the neurons in the other layers.
Feed backward neural network:
This type of network provides the feedback to the input layer so that the weights can be adjusted according to it and the error will be rectified.
Backpropagation neural network:
In this neural network, it assumes that every neuron is divided into two parts.
 ∑ → It represents the summation if the weights of the neural network.
∑= x_{1 }w_{1 }+ x_{2}w_{2}+ x_{3}w_{3}…..
Whereas,
X= input given to the input layer.
W= net weight (edge weight) from one neuron to another neuron in the different layer.
 → It represents the activation symbol of the network. Based on that it classifies the behavior of the neural network.
There as several activation networks,
What is the activation function?
Based on that it defines the output of the nodes in the output layer.
 Sigmoidal function
 Step function
 Signum function.
 Linear function.
Sigmoidal function:
The graph of the sigmoidal function is
Step function:
The graph of the step function is
Signum function:
The graph of the linear function is
Linear function:
The graph of the linear function is
Backpropagation neural network example:
When to consider the neural networks?
 When the input is raw data (directly from the sensors).
 Noisy data.
 The output is random or discrete.
 When the target function is unknown.
KNN classifier:
It is the classification algorithm used to classify the new item to which class it belongs to among K different classes. The new item is generally defined with letter C. we have to categorize C among K.
Example:
The above diagram consists of two classes on is represented by the red circle () and another one with blue triangle (). The upcoming class C star () is to go to either circle or triangle.
For that, we have to draw a circle which consists of K nearest neighbors.
Consider the K value as 5.
So, draw a circle which is inscribed with 5 neighbors.
Among them calculate voting for each class.
→ 3 votes
→ 2 votes
The circle has more votes when compared with the triangle, so the star will belong to the circle class only.
Limitation:
 K must not be the multiple of the C( number of classes).
 K must be an odd number, if even there may be a chance of having the equal number of different classes.
 The time complexity of this algorithm is much higher than the remaining.