[mang] enhance cluster analysis

2c844d8a · MichaelAng · e86c5a99 · 2c844d8a · 2c844d8a
Commit 2c844d8a authored Dec 12, 2023 by MichaelAng
Hide whitespace changes
Inline Side-by-side

Showing with 80 additions and 32 deletions

group-4-project.Rmd group-4-project.Rmd +80 -32

group-4-project.pdf group-4-project.pdf +0 -0

No files found.
--- a/group-4-project.Rmd
+++ b/group-4-project.Rmd
@@ -149,94 +149,142 @@ write.csv(vehicles, "vehicles_cleaned.csv", row.names=FALSE)

 # Clustering

-> Can we use clustering to determine the condition of the vehicle?
+This analysis aims to ascertain the viability of utilizing clustering on variables to predict a vehicle's condition. We can summarize this analysis in the following statement.
+
+> Can we use clustering analysis to determine the condition of the vehicle?

 Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data. The overall approach taken is as follows:

 1.  Perform the clustering on the numerical data.
-2.  Rejoin the clustering output dataframes back to the original dataset. 
+2.  Rejoin the clustering output dataframes back to the original dataset. So that we do not use condition as a predictor.
 3.  Then examine the results of vehicle condition by cluster 
+4.  Run clustering over differernt variations of variables

 ```{r}
 vehicles = read.csv("vehicles_cleaned.csv", row.names="id")

-run_clustering = function(variables) {
+run_clustering = function(cluster, variables) {
  print(variables)
-  vehicles_clustering = vehicles[variables]
+  vehicles_clustering = cluster[variables]

-  d = dist(as.matrix(vehicles_clustering)) 
+  d = dist(as.matrix(vehicles_clustering))
  hc = hclust(d)
-  
+
  vehicles_clustering$complete_3 = cutree(hc, 3)
  vehicles_clustering$complete_5 = cutree(hc, 5)
-  
-  hc2 = hclust(d, method="ward.D") 
-  
+
+  hc2 = hclust(d, method="ward.D")
+
  vehicles_clustering$ward_3 = cutree(hc2, 3)
  vehicles_clustering$ward_5 = cutree(hc2, 5)
+
+  km_3 = kmeans(vehicles_clustering, 3)
+  km_5 = kmeans(vehicles_clustering, 5)
+
+  vehicles_clustering$kmeans_3 = km_3$cluster
+  vehicles_clustering$kmeans_5 = km_5$cluster
  
-  vehicles_clustered = merge(vehicles_clustering[c('complete_3', 'complete_5', 'ward_3', 'ward_5')], vehicles, by=0, all.x=TRUE)
+  vehicles_clustered = merge(vehicles_clustering[c('complete_3','complete_5', 'ward_3', 'ward_5', 'kmeans_3', 'kmeans_5')], cluster, by=0, all.x=TRUE)
  
  vehicles_clustered
 }

 print_clusters = function(cluster) {
-  
+
+
  print('complete_3')
  for (i in 1:3) {
+    print(sprintf('group %i', i))
    print(table(cluster[cluster[,'complete_3'] == i, ]$condition))
  }
-  
+
  print('complete_5')
  for (i in 1:5) {
+    print(sprintf('group %i', i))
    print(table(cluster[cluster[,'complete_5'] == i, ]$condition))
  }
-  
+
  print('ward_3')
  for (i in 1:3) {
+    print(sprintf('group %i', i))
    print(table(cluster[cluster[,'ward_3'] == i, ]$condition))
  }
-  
+
  print('ward_5')
  for (i in 1:5) {
+    print(sprintf('group %i', i))
    print(table(cluster[cluster[,'ward_5'] == i, ]$condition))
  }
+
+  print('kmeans_3')
+  for (i in 1:3) {
+    df
+    print(sprintf('group %i', i))
+    print(table(cluster[cluster[,'kmeans_3'] == i, ]$condition))
+  }
+    
+  print('kmeans_5')
+  for (i in 1:5) {
+    print(sprintf('group %i', i))
+    print(table(cluster[cluster[,'kmeans_5'] == i, ]$condition))
+  }
 }

 # commented to preserve resource and output space
-#output = run_clustering(c('price', 'odometer', 'year', 'long', 'lat'))
-#print_clusters(output)
-
-#output = run_clustering(c('price', 'odometer'))
-#print_clusters(output)
-
-#output = run_clustering(c('price', 'odometer', 'year'))
-#print_clusters(output)
-
+# output = run_clustering(vehicles, c('price', 'odometer', 'year', 'long', 'lat'))
+# print_clusters(output)
+# 
+# output = run_clustering(vehicles, c('price', 'odometer'))
+# print_clusters(output)
+# 
+# output = run_clustering(vehicles, c('price', 'odometer', 'year'))
+# print_clusters(output)
 ```

 \pagebreak

 ## Findings

-I inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive method did produce something interesting in each run.
+We inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive and kmeans method did produce something interesting in each run. If we look at the last run for variables `price`, `odometer`, and `year`, the Divisive ward algorithm with 3 clusters was a strong predictor of `excellent` condition vehicles. In addition, the kmeans algorithm with 3 clusters was a strong predictor of `new` condition vehicles. See output below. 
+
+For the ward_3 cluster 3 algorithm, we may use the group 2 to cluster `excellent` condition vehicles. For the kmeans cluster 3 algorithm, we may predict `new` condition vehicles.

-If we look at the last run for variable price, odometer, and year.  The Divisive ward algorithm with 3 clusters was a strong predictor of excellent, good, and like new condition vehicles. See output below.
 ```
 ward_3
-excellent      fair      good  like new       new   salvage 
-     4335       341      3423       455        32        13 

-excellent      fair      good  like new       new   salvage 
-    10479       129      5247      1778        39        18 
+group 1
+  excellent      fair      good  like new       new   salvage 
+       4335       341      3423       455        32        13 
+
+group 2
+  excellent      fair      good  like new       new   salvage 
+      10479       129      5247      1778        39        18 
+
+group 3
+  excellent      fair      good  like new       new   salvage 
+       2672        16      1074      1530        98         4 
+
+----
+
+kmeans_3
+
+group 1
+  excellent      fair      good  like new       new   salvage 
+       6118        42      2615      2312       111        13 
+
+group 2
+  excellent      fair      good  like new       new   salvage 
+       9643       229      5470      1286        57        12 
+
+group 3
+  excellent      fair      good  like new       new   salvage 
+       1725       215      1659       165         1        10 

-excellent      fair      good  like new       new   salvage 
-     2672        16      1074      1530        98         4 
 ```

 ## Clustering Conclusion 

-While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward algorithm, specifically with three clusters, yielded a particularly intriguing outcome. This specific configuration of the Ward algorithm demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent, good, or like-new conditions.
+While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward and kmeans algorithm, specifically with three clusters, yielded a particularly intriguing outcome. These specific configuration of the algorithms demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent or new condition vehicles.




--- a/group-4-project.pdf
+++ b/group-4-project.pdf