[mang] update with cluster analysis

e86c5a99 · MichaelAng · d181ab3a · e86c5a99
Commit e86c5a99 authored Dec 05, 2023 by MichaelAng
Show whitespace changes
Inline Side-by-side

Showing with 98 additions and 0 deletions

group-4-project.Rmd group-4-project.Rmd +98 -0

No files found.
--- a/group-4-project.Rmd
+++ b/group-4-project.Rmd
@@ -144,3 +144,101 @@ nrow(vehicles)
 # Writing to csv to cache results 
 write.csv(vehicles, "vehicles_cleaned.csv", row.names=FALSE)
 ```
+\pagebreak
+# Clustering
+> Can we use clustering to determine the condition of the vehicle?
+Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data. The overall approach taken is as follows:
+1.  Perform the clustering on the numerical data.
+2.  Rejoin the clustering output dataframes back to the original dataset. 
+3.  Then examine the results of vehicle condition by cluster 
+```{r}
+vehicles = read.csv("vehicles_cleaned.csv", row.names="id")
+run_clustering = function(variables) {
+  print(variables)
+  vehicles_clustering = vehicles[variables]
+  d = dist(as.matrix(vehicles_clustering)) 
+  hc = hclust(d)
+  vehicles_clustering$complete_3 = cutree(hc, 3)
+  vehicles_clustering$complete_5 = cutree(hc, 5)
+  hc2 = hclust(d, method="ward.D") 
+  vehicles_clustering$ward_3 = cutree(hc2, 3)
+  vehicles_clustering$ward_5 = cutree(hc2, 5)
+  vehicles_clustered = merge(vehicles_clustering[c('complete_3', 'complete_5', 'ward_3', 'ward_5')], vehicles, by=0, all.x=TRUE)
+  vehicles_clustered
+}
+print_clusters = function(cluster) {
+  print('complete_3')
+  for (i in 1:3) {
+    print(table(cluster[cluster[,'complete_3'] == i, ]$condition))
+  }
+  print('complete_5')
+  for (i in 1:5) {
+    print(table(cluster[cluster[,'complete_5'] == i, ]$condition))
+  }
+  print('ward_3')
+  for (i in 1:3) {
+    print(table(cluster[cluster[,'ward_3'] == i, ]$condition))
+  }
+  print('ward_5')
+  for (i in 1:5) {
+    print(table(cluster[cluster[,'ward_5'] == i, ]$condition))
+  }
+}
+# commented to preserve resource and output space
+#output = run_clustering(c('price', 'odometer', 'year', 'long', 'lat'))
+#print_clusters(output)
+#output = run_clustering(c('price', 'odometer'))
+#print_clusters(output)
+#output = run_clustering(c('price', 'odometer', 'year'))
+#print_clusters(output)
+```
+\pagebreak
+## Findings
+I inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive method did produce something interesting in each run.
+If we look at the last run for variable price, odometer, and year.  The Divisive ward algorithm with 3 clusters was a strong predictor of excellent, good, and like new condition vehicles. See output below.
+```
+ward_3
+excellent      fair      good  like new       new   salvage 
+     4335       341      3423       455        32        13 
+excellent      fair      good  like new       new   salvage 
+    10479       129      5247      1778        39        18 
+excellent      fair      good  like new       new   salvage 
+     2672        16      1074      1530        98         4 
+```
+## Clustering Conclusion 
+While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward algorithm, specifically with three clusters, yielded a particularly intriguing outcome. This specific configuration of the Ward algorithm demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent, good, or like-new conditions.