Commit 2c844d8a authored by MichaelAng's avatar MichaelAng

[mang] enhance cluster analysis

parent e86c5a99
...@@ -149,94 +149,142 @@ write.csv(vehicles, "vehicles_cleaned.csv", row.names=FALSE) ...@@ -149,94 +149,142 @@ write.csv(vehicles, "vehicles_cleaned.csv", row.names=FALSE)
# Clustering # Clustering
> Can we use clustering to determine the condition of the vehicle? This analysis aims to ascertain the viability of utilizing clustering on variables to predict a vehicle's condition. We can summarize this analysis in the following statement.
> Can we use clustering analysis to determine the condition of the vehicle?
Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data. The overall approach taken is as follows: Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data. The overall approach taken is as follows:
1. Perform the clustering on the numerical data. 1. Perform the clustering on the numerical data.
2. Rejoin the clustering output dataframes back to the original dataset. 2. Rejoin the clustering output dataframes back to the original dataset. So that we do not use condition as a predictor.
3. Then examine the results of vehicle condition by cluster 3. Then examine the results of vehicle condition by cluster
4. Run clustering over differernt variations of variables
```{r} ```{r}
vehicles = read.csv("vehicles_cleaned.csv", row.names="id") vehicles = read.csv("vehicles_cleaned.csv", row.names="id")
run_clustering = function(variables) { run_clustering = function(cluster, variables) {
print(variables) print(variables)
vehicles_clustering = vehicles[variables] vehicles_clustering = cluster[variables]
d = dist(as.matrix(vehicles_clustering)) d = dist(as.matrix(vehicles_clustering))
hc = hclust(d) hc = hclust(d)
vehicles_clustering$complete_3 = cutree(hc, 3) vehicles_clustering$complete_3 = cutree(hc, 3)
vehicles_clustering$complete_5 = cutree(hc, 5) vehicles_clustering$complete_5 = cutree(hc, 5)
hc2 = hclust(d, method="ward.D") hc2 = hclust(d, method="ward.D")
vehicles_clustering$ward_3 = cutree(hc2, 3) vehicles_clustering$ward_3 = cutree(hc2, 3)
vehicles_clustering$ward_5 = cutree(hc2, 5) vehicles_clustering$ward_5 = cutree(hc2, 5)
km_3 = kmeans(vehicles_clustering, 3)
km_5 = kmeans(vehicles_clustering, 5)
vehicles_clustering$kmeans_3 = km_3$cluster
vehicles_clustering$kmeans_5 = km_5$cluster
vehicles_clustered = merge(vehicles_clustering[c('complete_3', 'complete_5', 'ward_3', 'ward_5')], vehicles, by=0, all.x=TRUE) vehicles_clustered = merge(vehicles_clustering[c('complete_3','complete_5', 'ward_3', 'ward_5', 'kmeans_3', 'kmeans_5')], cluster, by=0, all.x=TRUE)
vehicles_clustered vehicles_clustered
} }
print_clusters = function(cluster) { print_clusters = function(cluster) {
print('complete_3') print('complete_3')
for (i in 1:3) { for (i in 1:3) {
print(sprintf('group %i', i))
print(table(cluster[cluster[,'complete_3'] == i, ]$condition)) print(table(cluster[cluster[,'complete_3'] == i, ]$condition))
} }
print('complete_5') print('complete_5')
for (i in 1:5) { for (i in 1:5) {
print(sprintf('group %i', i))
print(table(cluster[cluster[,'complete_5'] == i, ]$condition)) print(table(cluster[cluster[,'complete_5'] == i, ]$condition))
} }
print('ward_3') print('ward_3')
for (i in 1:3) { for (i in 1:3) {
print(sprintf('group %i', i))
print(table(cluster[cluster[,'ward_3'] == i, ]$condition)) print(table(cluster[cluster[,'ward_3'] == i, ]$condition))
} }
print('ward_5') print('ward_5')
for (i in 1:5) { for (i in 1:5) {
print(sprintf('group %i', i))
print(table(cluster[cluster[,'ward_5'] == i, ]$condition)) print(table(cluster[cluster[,'ward_5'] == i, ]$condition))
} }
print('kmeans_3')
for (i in 1:3) {
df
print(sprintf('group %i', i))
print(table(cluster[cluster[,'kmeans_3'] == i, ]$condition))
}
print('kmeans_5')
for (i in 1:5) {
print(sprintf('group %i', i))
print(table(cluster[cluster[,'kmeans_5'] == i, ]$condition))
}
} }
# commented to preserve resource and output space # commented to preserve resource and output space
#output = run_clustering(c('price', 'odometer', 'year', 'long', 'lat')) # output = run_clustering(vehicles, c('price', 'odometer', 'year', 'long', 'lat'))
#print_clusters(output) # print_clusters(output)
#
#output = run_clustering(c('price', 'odometer')) # output = run_clustering(vehicles, c('price', 'odometer'))
#print_clusters(output) # print_clusters(output)
#
#output = run_clustering(c('price', 'odometer', 'year')) # output = run_clustering(vehicles, c('price', 'odometer', 'year'))
#print_clusters(output) # print_clusters(output)
``` ```
\pagebreak \pagebreak
## Findings ## Findings
I inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive method did produce something interesting in each run. We inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive and kmeans method did produce something interesting in each run. If we look at the last run for variables `price`, `odometer`, and `year`, the Divisive ward algorithm with 3 clusters was a strong predictor of `excellent` condition vehicles. In addition, the kmeans algorithm with 3 clusters was a strong predictor of `new` condition vehicles. See output below.
For the ward_3 cluster 3 algorithm, we may use the group 2 to cluster `excellent` condition vehicles. For the kmeans cluster 3 algorithm, we may predict `new` condition vehicles.
If we look at the last run for variable price, odometer, and year. The Divisive ward algorithm with 3 clusters was a strong predictor of excellent, good, and like new condition vehicles. See output below.
``` ```
ward_3 ward_3
excellent fair good like new new salvage
4335 341 3423 455 32 13
excellent fair good like new new salvage group 1
10479 129 5247 1778 39 18 excellent fair good like new new salvage
4335 341 3423 455 32 13
group 2
excellent fair good like new new salvage
10479 129 5247 1778 39 18
group 3
excellent fair good like new new salvage
2672 16 1074 1530 98 4
----
kmeans_3
group 1
excellent fair good like new new salvage
6118 42 2615 2312 111 13
group 2
excellent fair good like new new salvage
9643 229 5470 1286 57 12
group 3
excellent fair good like new new salvage
1725 215 1659 165 1 10
excellent fair good like new new salvage
2672 16 1074 1530 98 4
``` ```
## Clustering Conclusion ## Clustering Conclusion
While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward algorithm, specifically with three clusters, yielded a particularly intriguing outcome. This specific configuration of the Ward algorithm demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent, good, or like-new conditions. While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward and kmeans algorithm, specifically with three clusters, yielded a particularly intriguing outcome. These specific configuration of the algorithms demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent or new condition vehicles.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment