I am a novice at machine learning techniques, as I mainly use statistical tools for my research. But I find learning algorithms fascinating. Therefore, when I am not researching, I find it exciting to spend time on these new tools.

Here is my first attempt at K-Nearest Neighbors algorithm, without using any package.

So here are steps to predict "class" in testing data, using kNN function that I made in R. This simple algorithm is helpful in showing how kNN works, which is often masked in high level packages.

Step 1: load both training and testing datasets in R

Step 2: Run the following code:

# Name of training data in the following example is: training.dat

knn<-function(df, k){

var<-c( " " ) #variable names"

colnames(df)<-var

output <- matrix(ncol=1, nrow=nrow(df))

for (i in 1:nrow(df)) {

c.mat<-rbind(as.matrix(df[i,]), as.matrix(training.dat))

var2<-c( " " ) #variable names"

distance<-as.matrix(dist(c.mat[,var2]))

dist<-round(as.numeric(distance[,1]), 2)

c.mat1<-as.data.frame(cbind(c.mat,dist))

c.mat1$dist<-as.numeric(as.character(c.mat1$dist))

c.mat2<-c.mat1[order(as.numeric(c.mat1$dist)),] #sort

# now k=10; top is the new row to be classes

c.mat3<-c.mat2[1:k,]

# count frequency

n<-as.data.frame(count(c.mat3, "TAX"))

nn<-ifelse(max(n$freq==1), min(as.numeric(as.character(n$TAX))), as.numeric(as.character(n$Var1[n$freq==max(n$freq)])))

output[i,]<-nn

}

write.csv(output, "output.csv")

}

Step 3: Run the testing data using function knn

knn(testing.dat, 4)

(Note: Value 4 in the function is the k value. The size of this value determines how much error it produces. You can use cross validation method to find what might be the best value, which I have not covered in this blog. The output is exported as CSV file. You can now check the predicted class with the actual class in testing data and see how good the algorithm was.)

Here is my first attempt at K-Nearest Neighbors algorithm, without using any package.

**What it does**: You need a training and testing data sets (say you extract random 25% observations from a large data set, as testing data; and use the rest as training). Let variable "class" be the variable that is in the training dataset, which you want to predict in the testing dataset, using all other variables. In other words, all variables in the two data sets are identical, and the "class" variable in the testing data is 9999 or NA.So here are steps to predict "class" in testing data, using kNN function that I made in R. This simple algorithm is helpful in showing how kNN works, which is often masked in high level packages.

Step 1: load both training and testing datasets in R

Step 2: Run the following code:

# Name of training data in the following example is: training.dat

knn<-function(df, k){

var<-c( " " ) #variable names"

colnames(df)<-var

output <- matrix(ncol=1, nrow=nrow(df))

for (i in 1:nrow(df)) {

c.mat<-rbind(as.matrix(df[i,]), as.matrix(training.dat))

var2<-c( " " ) #variable names"

distance<-as.matrix(dist(c.mat[,var2]))

dist<-round(as.numeric(distance[,1]), 2)

c.mat1<-as.data.frame(cbind(c.mat,dist))

c.mat1$dist<-as.numeric(as.character(c.mat1$dist))

c.mat2<-c.mat1[order(as.numeric(c.mat1$dist)),] #sort

# now k=10; top is the new row to be classes

c.mat3<-c.mat2[1:k,]

# count frequency

n<-as.data.frame(count(c.mat3, "TAX"))

nn<-ifelse(max(n$freq==1), min(as.numeric(as.character(n$TAX))), as.numeric(as.character(n$Var1[n$freq==max(n$freq)])))

output[i,]<-nn

}

write.csv(output, "output.csv")

}

Step 3: Run the testing data using function knn

knn(testing.dat, 4)

(Note: Value 4 in the function is the k value. The size of this value determines how much error it produces. You can use cross validation method to find what might be the best value, which I have not covered in this blog. The output is exported as CSV file. You can now check the predicted class with the actual class in testing data and see how good the algorithm was.)