[mang] adding final backup incase computer crash

d40ddd18 · MichaelAng · 2c844d8a · d40ddd18 · d40ddd18
Commit d40ddd18 authored Dec 16, 2023 by MichaelAng
Hide whitespace changes
Inline Side-by-side

Showing with 482 additions and 15 deletions

final copy.Rmd final copy.Rmd +446 -0

group-4-project.Rmd group-4-project.Rmd +36 -15

No files found.
--- a/final copy.Rmd
+++ b/final copy.Rmd
+---
+title: "IT270 Final"
+author: "Michael Ang"
+date: "2023-12-10"
+output: 
+  pdf_document:
+    toc: true
+    toc_depth: 3
+    number_sections: true
+editor_options: 
+  markdown: 
+    wrap: 120
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+library(FactoMineR)
+library(psych)
+library(C50)
+library(corrplot)
+library(class)
+```
+
+\pagebreak
+
+# Question 1
+
+## Should there a gender discrimination and equability concern?
+
+Equity concerns regarding gender representation are evident across various metrics within our organization. Our analysis
+delved into the female-to-male (F/M) ratio in terms of both employee distribution and MMI (Median Monthly Income). A
+detailed chart illustrating these findings will be presented at the conclusion of this summary.
+
+Throughout the entirety of our organization, the F/M ratio stands at 67%, a trend consistent across departments, job
+levels, and roles. While the disparity in MMI raises some equity issues, it's notably less pronounced compared to
+employee distribution. At an organizational level, the f/m ratio for MMI is 98%.
+
+Upon closer examination within departments, job levels, and roles, our focus turns primarily to job levels and roles.
+Notably, the highest positions (job level 5) demonstrate a 75% F/M ratio in MMI. Furthermore, specific roles---roles 1
+and 4---highlight a concerning 65% and 78% F/M ratio in MMI, respectively.
+
+No further concerns were detected after the ratio analysis. The breakdown of these equity concerns is summarized in the
+chart below.
+
+## How may we increase overall job satisfaction?
+
+To significantly enhance overall job satisfaction, our focus should prioritize:
+
+-   Maintaining consistent manager assignments
+-   Reducing attrition rates and commuting demands
+-   Enhancing stock options, promoting work-life balance, implementing salary hikes, optimizing training duration,
+    elevating job levels, and augmenting monthly income.
+
+Our conclusion stems from an in-depth decision tree analysis, rigorously tested for accuracy using a dedicated set of
+data. Our modeling demonstrates an approximate accuracy rate of 70%, validating the significance of these specified
+areas as the primary contributors to job satisfaction. While other factors were taken into account, these identified
+areas emerged as the most influential.
+
+\pagebreak
+
+## Analysis
+
+```{r}
+e_df = read.csv("employee_survey_data_clean.csv")
+g_df = read.csv("general_data.csv")
+m_df = read.csv("manager_survey_data.csv")
+
+employee_df = merge(e_df, g_df, by.x = "EmployeeID", by.y = "EmployeeID")
+employee_df = merge(employee_df, m_df, by.x = "EmployeeID", by.y = "EmployeeID")
+
+# Inspecting data. Commented out 
+# summary(employee_df)
+# ncol(employee_df)
+# nrow(employee_df)
+# anyNA(employee_df)
+
+# Setting to zero because vales are N/A
+employee_df['TotalWorkingYears'][is.na(employee_df$TotalWorkingYears), ] = 0
+employee_df['NumCompaniesWorked'][is.na(employee_df$NumCompaniesWorked), ] = 0
+
+# Drop single value columns
+employee_df = subset(employee_df, select = -c(EmployeeCount,Over18,StandardHours))
+```
+
+\pagebreak
+
+### Gender Discrimination and Equability Analysis
+
+```{r}
+# Commented out to reduce output
+# discrim_emp_df = employee_df
+#
+# # Figure 1: First Analysis: Is there an equal gender distribution org wide?
+# table(discrim_emp_df$Gender)
+# 
+# # Figure 2: Second Analysis: Is there an equal gender distribution across department, Job Level, JobRole?
+# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$Department),FUN=length)
+# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobLevel),FUN=length)
+# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobRole),FUN=length)
+
+# # Figure 3: Third Analysis: Is there a MonthlyIncome, gender distribution across department, Job Level, JobRole?
+# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender),FUN=median)
+# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$Department),FUN=median)
+# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobLevel),FUN=median)
+# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobRole),FUN=median)
+
+# # Figure 4: Fourth Analysis.  Are there any strong coorelations between variables when split up on Gender, Department, or Job Role?
+# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Human Resources'] = 0
+# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Research & Development'] = 1
+# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Sales'] = 2
+# discrim_emp_df$Department = as.numeric(discrim_emp_df$Department)
+# 
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Healthcare Representative'] = 0
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Human Resources'] = 1
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Laboratory Technician'] = 2
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Manager'] = 3
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Manufacturing Director'] = 4
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Research Director'] = 5
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Research Scientist'] = 6
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Sales Executive'] = 7
+# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Sales Representative'] = 8
+# discrim_emp_df$JobRole = as.numeric(discrim_emp_df$JobRole)
+# 
+# male_emp_df = discrim_emp_df[discrim_emp_df['Gender'] == 'Male',]
+# female_emp_df = discrim_emp_df[discrim_emp_df['Gender'] == 'Female',]
+# 
+# discrim_vars = c('JobLevel', 'MonthlyIncome', 'Age', 'PercentSalaryHike', 'StockOptionLevel', 'YearsAtCompany', 'Department', 'JobRole')
+# 
+# corrplot(cor(male_emp_df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
+# corrplot(cor(female_emp_df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
+# 
+# # Correlation by Department
+# for(i in levels(factor(male_emp_df$Department))) {
+#   df = male_emp_df[male_emp_df['Department'] == as.integer(i),]
+#   corrplot(cor(df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
+# }
+# 
+# for(i in levels(factor(female_emp_df$Department))) {
+#   df = female_emp_df[female_emp_df['Department'] == as.integer(i),]
+#   corrplot(cor(df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
+# }
+# 
+# # Correlation by Job Role
+# for(i in levels(factor(male_emp_df$JobRole))) {
+#   df = male_emp_df[male_emp_df['JobRole'] == as.integer(i),]
+#   corrplot(cor(df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
+# }
+# 
+# for(i in levels(factor(female_emp_df$JobRole))) {
+#   df = female_emp_df[female_emp_df['JobRole'] == as.integer(i),]
+#   corrplot(cor(df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
+# }
+```
+
+\pagebreak
+
+```         
+# Female to Male Ratio
+# Across Organization 
+67% = 1737 / 2590
+
+## Across Department 
+60% = 71   / 117   # Human Resources
+67% = 1142 / 1682    # Research & Development   
+66% = 524  / 791     # Sales
+
+## Across JobLevel 
+62% = 610 / 982 # Job Level 1       
+72% = 662 / 912 # Job Level 2           
+58% = 237 / 407 # Job Level 3           
+77% = 137 / 176 # Job Level 4           
+80% = 91  / 113 # Job Level 5       
+
+## Across JobRole 
+67% = 153 / 228 # Job Role 0            
+63% = 60  / 94  # Job Role 1
+63% = 296 / 464 # Job Role 2
+93% = 145 / 155 # Job Role 3
+69% = 175 / 253 # Job Role 4
+
+# Female to Male Median Income
+## Across Organization 
+ 98% = 48980 / 49680
+# Across Department
+146% = 63230 / 43190 # Human Resources  
+ 96% = 49630 / 51630 # Research & Development   
+ 93% = 45590 / 48690 # Sales
+
+## Across JobLevel 
+ 89% = 43415 / 48340 # Job Level 1
+101% = 50820 / 49990 # Job Level 2
+107% = 50790 / 47390 # Job Level 3
+ 93% = 54290 / 58225 # Job Level 4
+ 75% = 40140 / 53470 # Job Level 5
+
+## Across JobRole 
+108% = 54050 / 49980 # Job Role 0               
+ 65% = 36900 / 56610 # Job Role 1   
+ 93% = 49690 / 53240 # Job Role 2   
+121% = 53680 / 44010 # Job Role 3   
+ 78% = 45810 / 58550 # Job Role 4 
+```
+
+\pagebreak
+
+### Improving Overall Job Satisfaction Analysis
+
+```{r}
+
+# Bin the data 
+js_emp_df= employee_df
+js_emp_df$Age  = cut(js_emp_df$Age, breaks=c(15, 30, 40, 60), right = TRUE, labels = FALSE)
+js_emp_df$DistanceFromHome  = cut(js_emp_df$DistanceFromHome, breaks=c(0, 15, 25, 30), right = TRUE, labels = FALSE)
+js_emp_df$MonthlyIncome  = cut(js_emp_df$MonthlyIncome, breaks=c(0, 20000, 50000, 100000,150000,200000), right = TRUE, labels = FALSE)
+js_emp_df$NumCompaniesWorked  = cut(js_emp_df$NumCompaniesWorked, breaks=c(0, 3, 5, 10), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
+js_emp_df$PercentSalaryHike  = cut(js_emp_df$PercentSalaryHike, breaks=c(10, 15, 20, 25), right = TRUE, include.highest= TRUE, labels = FALSE)
+js_emp_df$TotalWorkingYears  = cut(js_emp_df$TotalWorkingYears, breaks=c(0, 5, 10, 20, 40), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
+js_emp_df$YearsAtCompany  = cut(js_emp_df$YearsAtCompany, breaks=c(0, 5, 10, 20, 40), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
+js_emp_df$YearsSinceLastPromotion  = cut(js_emp_df$YearsSinceLastPromotion, breaks=c(0, 5, 10, 15), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
+js_emp_df$YearsWithCurrManager  = cut(js_emp_df$YearsWithCurrManager, breaks=c(0, 5, 10, 15, 20), right = TRUE, include.lowest= TRUE, labels = FALSE)
+
+# Convert the columns into factors
+for (i in 1:ncol(js_emp_df)){
+  js_emp_df[,i] = as.factor(js_emp_df[,i])
+}
+
+# Create the full model formula as a string
+vars = paste(names(js_emp_df[,c(-5)]), collapse=' + ')
+attr_formula = paste("JobSatisfaction", vars, sep=" ~ ")
+
+
+# Split data into Training and Test
+js_emp_df$rndnum = runif(nrow(js_emp_df), 1,100)
+employee_train = js_emp_df[js_emp_df[, "rndnum"] <= 80 , ] 
+employee_test = js_emp_df[js_emp_df[, "rndnum"] > 80 , ]
+
+# Helper function to print values
+dt_moa = function(evaluation, predicted_values){
+  tab = table(evaluation, predicted_values)
+  dimnames(tab) = list("Actual" = levels(evaluation), "Predicted" = levels(evaluation))
+  
+  pcttab = rbind(tab[1, ]/sum(tab[1, ]), tab[2, ]/sum(tab[2, ]), tab[3, ]/sum(tab[3, ]), tab[4, ]/sum(tab[4, ]))
+  dimnames(pcttab) = list("Actual" = levels(evaluation), "Predicted" = levels(evaluation))
+  round(pcttab, 2)
+}
+```
+
+```{r}
+# C5.0 algorithm ran
+fm_fit_c50 = C5.0(as.formula(attr_formula), data = employee_train)
+employee_test$predict = predict(fm_fit_c50, newdata=employee_test, type='class')
+c50_fm_chart = dt_moa(employee_test$JobSatisfaction, employee_test$predict)
+```
+
+```{r}
+f_sm_c50 = "JobSatisfaction ~ YearsWithCurrManager + Attrition + StockOptionLevel + DistanceFromHome + WorkLifeBalance + PercentSalaryHike + TrainingTimesLastYear + JobLevel + MonthlyIncome"
+sm_fit_c50 = C5.0(as.formula(f_sm_c50), data = employee_train)
+employee_test$predict = predict(sm_fit_c50, newdata=employee_test, type='class')
+c50_sm_chart = dt_moa(employee_test$JobSatisfaction, employee_test$predict)
+```
+
+```{r}
+print("full model c5.0") 
+c50_fm_chart
+print("simple model c5.0")
+c50_sm_chart
+```
+
+\pagebreak
+
+# Question 2
+
+## a
+
+My colleague was incorrect in the first place because of the neuron estimates. My colleague opted for 3 hidden layers, I
+delved into a more intricate design with 38 hidden layers broken into two tiers; c(25, 13). At the time of 10000
+observations, this provided a decent enough reponse time for the NN algorithm to run.
+
+## b
+
+This isn't a good rational because my colleague is forgetting to consider number of observations and the input layer. In
+addition, given that the next question will be using 60k observations and time is a precious resource, 40 hidden layers
+will be very slow. I would go with a value even higher than 40 hidden layers. It also does not increases the accuracy
+either.
+
+## c
+
+hidden=c(25, 13)
+
+| neuron estimates | Execution Time | \# Accurate | \% Accurate |
+|------------------|----------------|-------------|-------------|
+| c(40)            | 56.0837 mins   | 6790        | 11.3        |
+| c(100, 50)       | 8.050967 mins  | 6784        | 11.3        |
+| c(200, 100)      | 9.487618 mins  | 6786        | 11.3        |
+| c(400, 200)      | 14.57694 mins  | 6791        | 11.3        |
+
+## d
+
+### Part 1
+
+1.  KNN - This is a classification algorithm based on nearest node
+2.  Clustering - This is a literal classificaion algorithm
+
+### Part 2
+
+```{r}
+# mnist_raw_raw = read.csv("mnist_train.csv", header=FALSE)
+mnist_raw = mnist_raw_raw[1:60000,]
+
+mnist_raw <- replace(mnist_raw, is.na(mnist_raw), 0)
+
+mnist_raw$rndnum = runif(nrow(mnist_raw), 1,100)
+
+train = mnist_raw[mnist_raw[, "rndnum"] <= 80 , ] 
+test = mnist_raw[mnist_raw[, "rndnum"] > 80 , ]
+
+start.time = Sys.time()
+knnmdl = knn(train[-c(1, ncol(train))], test[-c(1, ncol(test))], train[,1], k = 3,prob = TRUE)
+end.time = Sys.time()
+
+actual = test[,1]
+cm = table(actual,knnmdl)
+accuracy = sum(diag(cm))/length(actual)
+time.taken = end.time - start.time
+print(time.taken)
+print(sprintf("%s Accuracy: %.2f%%", names(mnist_raw[1]), accuracy*100))
+```
+
+## e
+
+For lower quantities of values, the KNN algorthim may be more suitable for classification.
+
+## f
+
+Nothing? This is a data mining class/employment and we want to promote a positive work environment
+
+\pagebreak
+
+# Question 3
+
+\pagebreak
+
+## a
+
+The prior analysis proves its worth by showcasing our efficiency in variable selection. This strategic approach helps us
+trim down our variables significantly, reducing the burden from 36 player attributes to a concise set of 5 attributes.
+Remarkably, this streamlined approach retains a commendable 75% accuracy level. This not only simplifies our management
+but also ensures a high level of predictive performance.
+
+## aa
+
+To build an ideal team, I would seek out an algorithm that would try identify the ideal position the player should play
+based off the player attributes. The following code are the fields that are necessary to perform the grouping and how I
+will be describing the groups.
+
+```{r}
+fifa_raw = read.csv("fifa.csv")
+fifa_fa = fifa_fa[,c(56:ncol(fifa_fa) - 1)]
+
+# Necssary fields to perform grouping
+names(fifa_fa)
+
+# Here is how I will describe the groups
+levels(factor(fifa_raw$Position))
+```
+
+## ab
+
+I will not be able to assess which clubs the players belong to because there isn't a strong relationship between the
+fields that I have selected and clubs. Knowing which club a player belonged to would probably be only useful when and if
+I need to trade the player.
+
+## ac
+
+My initial analysis would be useful for future analysis because I can leverage my new 5 factors to perform further an
+analysis rather than my 36 player attributes.
+
+## ad
+
+One Interesting analytics we can perform is to see if we are paying or valuing players fairly. Based off the KNN
+analysis, we may be over or under paying players.
+
+```{r}
+fifa = read.csv("fifa.csv")
+
+fifa$Value = as.numeric(substr(fifa$Value, 2, nchar(fifa$Value)-1))
+fifa$Wage = as.numeric(substr(fifa$Wage, 2, nchar(fifa$Wage)-1))
+
+fifa = fifa[complete.cases(fifa), ]
+
+run_knn = function (df) {
+  df$rndnum = runif(nrow(df), 1,100)
+  train = df[df[, "rndnum"] <= 80 , ] 
+  test = df[df[, "rndnum"] > 80 , ]
+
+  knnmdl = knn(train[-c(1, ncol(train))], test[-c(1, ncol(test))], train[,1], k = 10,prob = TRUE)
+  actual = test[,1]
+  cm = table(actual,knnmdl)
+  accuracy = sum(diag(cm))/length(actual)
+  
+  print(sprintf("%s Accuracy: %.2f%%", names(df[1]),accuracy*100))
+}
+
+fifa_fa = fifa[,c(12,56:ncol(fifa) - 1)]
+run_knn(fifa_fa)
+
+fifa_fa = fifa[,c(13,56:ncol(fifa) - 1)]
+run_knn(fifa_fa)
+```
+
+# Question 4
+
+## a
+
+I agree with Michele Piccinno because without understanding the problem, using algorithms like neural networks is like
+shooting in the dark. Understanding the problem helps form the right questions and guides us toward finding answers or
+refining our approach. Plus, if we don't understand the problem, we can't tell if our solution works or what success
+looks like.
+
+## b
+
+Accuracy, typically measured as a percentage of success or error in relation to the total, is a common method to gauge
+performance across techniques. It's suitable because it provides insight into how well a technique performs. Achieving
+100% accuracy isn't always necessary; often, an 80% success rate might still offer sufficient business value to pursue a
+course of action. Additionally, without measurement, it's impossible to ascertain the effectiveness of a technique.
+
+## c
+
+When starting a project, dealing with data can be tricky. Understanding what the data means, checking it thoroughly,
+making sure it looks right, cleaning up any mistakes, and getting rid of weird bits are all challenges. To tackle these,
+I'd start by learning about the data sources, looking at the data closely, fixing any issues with how it's organized,
+cleaning it up to make it better, and sorting out any strange or unexpected parts. If we do not perform the prework, the
+validity of the analysis can be compromised.
+
+## d
+
+A decision tree predicts continuous variables by sorting them into groups, like age ranges such as 0-12 for kids, 13-18
+for teens, and so on. This sorting helps handle continuous data better. It's important to pick sensible groups that
+match the data we're working with in the dataset.
+
+## e
+
+An ethical issue in data mining involves using personal information like birthdates to crack PIN numbers, which breaches
+privacy. Even seemingly harmless data, when processed using certain methods, could identify people, like using operating
+system preferences to pinpoint individuals submitting homework. Analysts should focus on securely storing and
+transferring data to protect personal information and prevent misuse.
--- a/group-4-project.Rmd
+++ b/group-4-project.Rmd
@@ -149,15 +149,18 @@ write.csv(vehicles, "vehicles_cleaned.csv", row.names=FALSE)

 # Clustering

-This analysis aims to ascertain the viability of utilizing clustering on variables to predict a vehicle's condition. We can summarize this analysis in the following statement.
+This analysis aims to ascertain the viability of utilizing clustering on variables to predict a vehicle's condition. We
+can summarize this analysis in the following statement.

 > Can we use clustering analysis to determine the condition of the vehicle?

-Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data. The overall approach taken is as follows:
+Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data.
+The overall approach taken is as follows:

 1.  Perform the clustering on the numerical data.
-2.  Rejoin the clustering output dataframes back to the original dataset. So that we do not use condition as a predictor.
-3.  Then examine the results of vehicle condition by cluster 
+2.  Rejoin the clustering output dataframes back to the original dataset. So that we do not use condition as a
+    predictor.
+3.  Then examine the results of vehicle condition by cluster
 4.  Run clustering over differernt variations of variables

 ```{r}
@@ -239,17 +242,37 @@ print_clusters = function(cluster) {
 # 
 # output = run_clustering(vehicles, c('price', 'odometer', 'year'))
 # print_clusters(output)
+
+# What if we collapsed conditions?
+# vehicles_c = vehicles
+# vehicles_c['condition'][vehicles_c['condition'] == 'like new'] = 'new'
+# vehicles_c['condition'][vehicles_c['condition'] == 'fair'] = 'good'
+# 
+# output = run_clustering(vehicles_c, c('price', 'odometer', 'year', 'long', 'lat'))
+# print_clusters(output)
+# 
+# output = run_clustering(vehicles_c, c('price', 'odometer'))
+# print_clusters(output)
+# 
+# output = run_clustering(vehicles_c, c('price', 'odometer', 'year'))
+# print_clusters(output)
 ```

 \pagebreak

 ## Findings

-We inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive and kmeans method did produce something interesting in each run. If we look at the last run for variables `price`, `odometer`, and `year`, the Divisive ward algorithm with 3 clusters was a strong predictor of `excellent` condition vehicles. In addition, the kmeans algorithm with 3 clusters was a strong predictor of `new` condition vehicles. See output below. 
+We inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method
+did not produce a result that was useful at all. However, the Divisive and kmeans method did produce something
+interesting in each run. If we look at the last run for variables `price`, `odometer`, and `year`, the Divisive ward
+algorithm with 3 clusters was a strong predictor of `excellent` condition vehicles. In addition, the kmeans algorithm
+with 3 clusters was a strong predictor of `new` condition vehicles. See output below.

-For the ward_3 cluster 3 algorithm, we may use the group 2 to cluster `excellent` condition vehicles. For the kmeans cluster 3 algorithm, we may predict `new` condition vehicles.
+For the ward_3 cluster 3 algorithm, we may use the group 2 to cluster `excellent` condition vehicles. For the kmeans
+cluster 3 algorithm, we may predict `new` condition vehicles. Collapsing the `condition` variable did not produce a more
+accurate reading.

-```
+```         
 ward_3

 group 1
@@ -279,14 +302,12 @@ group 2
 group 3
  excellent      fair      good  like new       new   salvage 
       1725       215      1659       165         1        10 
-
 ```

-## Clustering Conclusion 
-
-While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward and kmeans algorithm, specifically with three clusters, yielded a particularly intriguing outcome. These specific configuration of the algorithms demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent or new condition vehicles.
-
-
-
-
+## Clustering Conclusion

+While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role
+in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely
+effective. Nonetheless, the Divisive Ward and kmeans algorithm, specifically with three clusters, yielded a particularly
+intriguing outcome. These specific configuration of the algorithms demonstrated a robust predictive capacity, especially
+in categorizing vehicles as being in excellent or new condition vehicles.