Commit e86c5a99 authored by MichaelAng's avatar MichaelAng

[mang] update with cluster analysis

parent d181ab3a
......@@ -144,3 +144,101 @@ nrow(vehicles)
# Writing to csv to cache results
write.csv(vehicles, "vehicles_cleaned.csv", row.names=FALSE)
```
\pagebreak
# Clustering
> Can we use clustering to determine the condition of the vehicle?
Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data. The overall approach taken is as follows:
1. Perform the clustering on the numerical data.
2. Rejoin the clustering output dataframes back to the original dataset.
3. Then examine the results of vehicle condition by cluster
```{r}
vehicles = read.csv("vehicles_cleaned.csv", row.names="id")
run_clustering = function(variables) {
print(variables)
vehicles_clustering = vehicles[variables]
d = dist(as.matrix(vehicles_clustering))
hc = hclust(d)
vehicles_clustering$complete_3 = cutree(hc, 3)
vehicles_clustering$complete_5 = cutree(hc, 5)
hc2 = hclust(d, method="ward.D")
vehicles_clustering$ward_3 = cutree(hc2, 3)
vehicles_clustering$ward_5 = cutree(hc2, 5)
vehicles_clustered = merge(vehicles_clustering[c('complete_3', 'complete_5', 'ward_3', 'ward_5')], vehicles, by=0, all.x=TRUE)
vehicles_clustered
}
print_clusters = function(cluster) {
print('complete_3')
for (i in 1:3) {
print(table(cluster[cluster[,'complete_3'] == i, ]$condition))
}
print('complete_5')
for (i in 1:5) {
print(table(cluster[cluster[,'complete_5'] == i, ]$condition))
}
print('ward_3')
for (i in 1:3) {
print(table(cluster[cluster[,'ward_3'] == i, ]$condition))
}
print('ward_5')
for (i in 1:5) {
print(table(cluster[cluster[,'ward_5'] == i, ]$condition))
}
}
# commented to preserve resource and output space
#output = run_clustering(c('price', 'odometer', 'year', 'long', 'lat'))
#print_clusters(output)
#output = run_clustering(c('price', 'odometer'))
#print_clusters(output)
#output = run_clustering(c('price', 'odometer', 'year'))
#print_clusters(output)
```
\pagebreak
## Findings
I inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive method did produce something interesting in each run.
If we look at the last run for variable price, odometer, and year. The Divisive ward algorithm with 3 clusters was a strong predictor of excellent, good, and like new condition vehicles. See output below.
```
ward_3
excellent fair good like new new salvage
4335 341 3423 455 32 13
excellent fair good like new new salvage
10479 129 5247 1778 39 18
excellent fair good like new new salvage
2672 16 1074 1530 98 4
```
## Clustering Conclusion
While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward algorithm, specifically with three clusters, yielded a particularly intriguing outcome. This specific configuration of the Ward algorithm demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent, good, or like-new conditions.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment