Commit d40ddd18 authored by MichaelAng's avatar MichaelAng

[mang] adding final backup incase computer crash

parent 2c844d8a
title: "IT270 Final"
author: "Michael Ang"
date: "2023-12-10"
toc: true
toc_depth: 3
number_sections: true
wrap: 120
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Question 1
## Should there a gender discrimination and equability concern?
Equity concerns regarding gender representation are evident across various metrics within our organization. Our analysis
delved into the female-to-male (F/M) ratio in terms of both employee distribution and MMI (Median Monthly Income). A
detailed chart illustrating these findings will be presented at the conclusion of this summary.
Throughout the entirety of our organization, the F/M ratio stands at 67%, a trend consistent across departments, job
levels, and roles. While the disparity in MMI raises some equity issues, it's notably less pronounced compared to
employee distribution. At an organizational level, the f/m ratio for MMI is 98%.
Upon closer examination within departments, job levels, and roles, our focus turns primarily to job levels and roles.
Notably, the highest positions (job level 5) demonstrate a 75% F/M ratio in MMI. Furthermore, specific roles---roles 1
and 4---highlight a concerning 65% and 78% F/M ratio in MMI, respectively.
No further concerns were detected after the ratio analysis. The breakdown of these equity concerns is summarized in the
chart below.
## How may we increase overall job satisfaction?
To significantly enhance overall job satisfaction, our focus should prioritize:
- Maintaining consistent manager assignments
- Reducing attrition rates and commuting demands
- Enhancing stock options, promoting work-life balance, implementing salary hikes, optimizing training duration,
elevating job levels, and augmenting monthly income.
Our conclusion stems from an in-depth decision tree analysis, rigorously tested for accuracy using a dedicated set of
data. Our modeling demonstrates an approximate accuracy rate of 70%, validating the significance of these specified
areas as the primary contributors to job satisfaction. While other factors were taken into account, these identified
areas emerged as the most influential.
## Analysis
e_df = read.csv("employee_survey_data_clean.csv")
g_df = read.csv("general_data.csv")
m_df = read.csv("manager_survey_data.csv")
employee_df = merge(e_df, g_df, by.x = "EmployeeID", by.y = "EmployeeID")
employee_df = merge(employee_df, m_df, by.x = "EmployeeID", by.y = "EmployeeID")
# Inspecting data. Commented out
# summary(employee_df)
# ncol(employee_df)
# nrow(employee_df)
# anyNA(employee_df)
# Setting to zero because vales are N/A
employee_df['TotalWorkingYears'][$TotalWorkingYears), ] = 0
employee_df['NumCompaniesWorked'][$NumCompaniesWorked), ] = 0
# Drop single value columns
employee_df = subset(employee_df, select = -c(EmployeeCount,Over18,StandardHours))
### Gender Discrimination and Equability Analysis
# Commented out to reduce output
# discrim_emp_df = employee_df
# # Figure 1: First Analysis: Is there an equal gender distribution org wide?
# table(discrim_emp_df$Gender)
# # Figure 2: Second Analysis: Is there an equal gender distribution across department, Job Level, JobRole?
# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$Department),FUN=length)
# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobLevel),FUN=length)
# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobRole),FUN=length)
# # Figure 3: Third Analysis: Is there a MonthlyIncome, gender distribution across department, Job Level, JobRole?
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender),FUN=median)
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$Department),FUN=median)
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobLevel),FUN=median)
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobRole),FUN=median)
# # Figure 4: Fourth Analysis. Are there any strong coorelations between variables when split up on Gender, Department, or Job Role?
# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Human Resources'] = 0
# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Research & Development'] = 1
# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Sales'] = 2
# discrim_emp_df$Department = as.numeric(discrim_emp_df$Department)
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Healthcare Representative'] = 0
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Human Resources'] = 1
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Laboratory Technician'] = 2
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Manager'] = 3
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Manufacturing Director'] = 4
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Research Director'] = 5
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Research Scientist'] = 6
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Sales Executive'] = 7
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Sales Representative'] = 8
# discrim_emp_df$JobRole = as.numeric(discrim_emp_df$JobRole)
# male_emp_df = discrim_emp_df[discrim_emp_df['Gender'] == 'Male',]
# female_emp_df = discrim_emp_df[discrim_emp_df['Gender'] == 'Female',]
# discrim_vars = c('JobLevel', 'MonthlyIncome', 'Age', 'PercentSalaryHike', 'StockOptionLevel', 'YearsAtCompany', 'Department', 'JobRole')
# corrplot(cor(male_emp_df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
# corrplot(cor(female_emp_df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
# # Correlation by Department
# for(i in levels(factor(male_emp_df$Department))) {
# df = male_emp_df[male_emp_df['Department'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
# }
# for(i in levels(factor(female_emp_df$Department))) {
# df = female_emp_df[female_emp_df['Department'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
# }
# # Correlation by Job Role
# for(i in levels(factor(male_emp_df$JobRole))) {
# df = male_emp_df[male_emp_df['JobRole'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
# }
# for(i in levels(factor(female_emp_df$JobRole))) {
# df = female_emp_df[female_emp_df['JobRole'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
# }
# Female to Male Ratio
# Across Organization
67% = 1737 / 2590
## Across Department
60% = 71 / 117 # Human Resources
67% = 1142 / 1682 # Research & Development
66% = 524 / 791 # Sales
## Across JobLevel
62% = 610 / 982 # Job Level 1
72% = 662 / 912 # Job Level 2
58% = 237 / 407 # Job Level 3
77% = 137 / 176 # Job Level 4
80% = 91 / 113 # Job Level 5
## Across JobRole
67% = 153 / 228 # Job Role 0
63% = 60 / 94 # Job Role 1
63% = 296 / 464 # Job Role 2
93% = 145 / 155 # Job Role 3
69% = 175 / 253 # Job Role 4
# Female to Male Median Income
## Across Organization
98% = 48980 / 49680
# Across Department
146% = 63230 / 43190 # Human Resources
96% = 49630 / 51630 # Research & Development
93% = 45590 / 48690 # Sales
## Across JobLevel
89% = 43415 / 48340 # Job Level 1
101% = 50820 / 49990 # Job Level 2
107% = 50790 / 47390 # Job Level 3
93% = 54290 / 58225 # Job Level 4
75% = 40140 / 53470 # Job Level 5
## Across JobRole
108% = 54050 / 49980 # Job Role 0
65% = 36900 / 56610 # Job Role 1
93% = 49690 / 53240 # Job Role 2
121% = 53680 / 44010 # Job Role 3
78% = 45810 / 58550 # Job Role 4
### Improving Overall Job Satisfaction Analysis
# Bin the data
js_emp_df= employee_df
js_emp_df$Age = cut(js_emp_df$Age, breaks=c(15, 30, 40, 60), right = TRUE, labels = FALSE)
js_emp_df$DistanceFromHome = cut(js_emp_df$DistanceFromHome, breaks=c(0, 15, 25, 30), right = TRUE, labels = FALSE)
js_emp_df$MonthlyIncome = cut(js_emp_df$MonthlyIncome, breaks=c(0, 20000, 50000, 100000,150000,200000), right = TRUE, labels = FALSE)
js_emp_df$NumCompaniesWorked = cut(js_emp_df$NumCompaniesWorked, breaks=c(0, 3, 5, 10), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$PercentSalaryHike = cut(js_emp_df$PercentSalaryHike, breaks=c(10, 15, 20, 25), right = TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$TotalWorkingYears = cut(js_emp_df$TotalWorkingYears, breaks=c(0, 5, 10, 20, 40), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$YearsAtCompany = cut(js_emp_df$YearsAtCompany, breaks=c(0, 5, 10, 20, 40), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$YearsSinceLastPromotion = cut(js_emp_df$YearsSinceLastPromotion, breaks=c(0, 5, 10, 15), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$YearsWithCurrManager = cut(js_emp_df$YearsWithCurrManager, breaks=c(0, 5, 10, 15, 20), right = TRUE, include.lowest= TRUE, labels = FALSE)
# Convert the columns into factors
for (i in 1:ncol(js_emp_df)){
js_emp_df[,i] = as.factor(js_emp_df[,i])
# Create the full model formula as a string
vars = paste(names(js_emp_df[,c(-5)]), collapse=' + ')
attr_formula = paste("JobSatisfaction", vars, sep=" ~ ")
# Split data into Training and Test
js_emp_df$rndnum = runif(nrow(js_emp_df), 1,100)
employee_train = js_emp_df[js_emp_df[, "rndnum"] <= 80 , ]
employee_test = js_emp_df[js_emp_df[, "rndnum"] > 80 , ]
# Helper function to print values
dt_moa = function(evaluation, predicted_values){
tab = table(evaluation, predicted_values)
dimnames(tab) = list("Actual" = levels(evaluation), "Predicted" = levels(evaluation))
pcttab = rbind(tab[1, ]/sum(tab[1, ]), tab[2, ]/sum(tab[2, ]), tab[3, ]/sum(tab[3, ]), tab[4, ]/sum(tab[4, ]))
dimnames(pcttab) = list("Actual" = levels(evaluation), "Predicted" = levels(evaluation))
round(pcttab, 2)
# C5.0 algorithm ran
fm_fit_c50 = C5.0(as.formula(attr_formula), data = employee_train)
employee_test$predict = predict(fm_fit_c50, newdata=employee_test, type='class')
c50_fm_chart = dt_moa(employee_test$JobSatisfaction, employee_test$predict)
f_sm_c50 = "JobSatisfaction ~ YearsWithCurrManager + Attrition + StockOptionLevel + DistanceFromHome + WorkLifeBalance + PercentSalaryHike + TrainingTimesLastYear + JobLevel + MonthlyIncome"
sm_fit_c50 = C5.0(as.formula(f_sm_c50), data = employee_train)
employee_test$predict = predict(sm_fit_c50, newdata=employee_test, type='class')
c50_sm_chart = dt_moa(employee_test$JobSatisfaction, employee_test$predict)
print("full model c5.0")
print("simple model c5.0")
# Question 2
## a
My colleague was incorrect in the first place because of the neuron estimates. My colleague opted for 3 hidden layers, I
delved into a more intricate design with 38 hidden layers broken into two tiers; c(25, 13). At the time of 10000
observations, this provided a decent enough reponse time for the NN algorithm to run.
## b
This isn't a good rational because my colleague is forgetting to consider number of observations and the input layer. In
addition, given that the next question will be using 60k observations and time is a precious resource, 40 hidden layers
will be very slow. I would go with a value even higher than 40 hidden layers. It also does not increases the accuracy
## c
hidden=c(25, 13)
| neuron estimates | Execution Time | \# Accurate | \% Accurate |
| c(40) | 56.0837 mins | 6790 | 11.3 |
| c(100, 50) | 8.050967 mins | 6784 | 11.3 |
| c(200, 100) | 9.487618 mins | 6786 | 11.3 |
| c(400, 200) | 14.57694 mins | 6791 | 11.3 |
## d
### Part 1
1. KNN - This is a classification algorithm based on nearest node
2. Clustering - This is a literal classificaion algorithm
### Part 2
# mnist_raw_raw = read.csv("mnist_train.csv", header=FALSE)
mnist_raw = mnist_raw_raw[1:60000,]
mnist_raw <- replace(mnist_raw,, 0)
mnist_raw$rndnum = runif(nrow(mnist_raw), 1,100)
train = mnist_raw[mnist_raw[, "rndnum"] <= 80 , ]
test = mnist_raw[mnist_raw[, "rndnum"] > 80 , ]
start.time = Sys.time()
knnmdl = knn(train[-c(1, ncol(train))], test[-c(1, ncol(test))], train[,1], k = 3,prob = TRUE)
end.time = Sys.time()
actual = test[,1]
cm = table(actual,knnmdl)
accuracy = sum(diag(cm))/length(actual)
time.taken = end.time - start.time
print(sprintf("%s Accuracy: %.2f%%", names(mnist_raw[1]), accuracy*100))
## e
For lower quantities of values, the KNN algorthim may be more suitable for classification.
## f
Nothing? This is a data mining class/employment and we want to promote a positive work environment
# Question 3
## a
The prior analysis proves its worth by showcasing our efficiency in variable selection. This strategic approach helps us
trim down our variables significantly, reducing the burden from 36 player attributes to a concise set of 5 attributes.
Remarkably, this streamlined approach retains a commendable 75% accuracy level. This not only simplifies our management
but also ensures a high level of predictive performance.
## aa
To build an ideal team, I would seek out an algorithm that would try identify the ideal position the player should play
based off the player attributes. The following code are the fields that are necessary to perform the grouping and how I
will be describing the groups.
fifa_raw = read.csv("fifa.csv")
fifa_fa = fifa_fa[,c(56:ncol(fifa_fa) - 1)]
# Necssary fields to perform grouping
# Here is how I will describe the groups
## ab
I will not be able to assess which clubs the players belong to because there isn't a strong relationship between the
fields that I have selected and clubs. Knowing which club a player belonged to would probably be only useful when and if
I need to trade the player.
## ac
My initial analysis would be useful for future analysis because I can leverage my new 5 factors to perform further an
analysis rather than my 36 player attributes.
## ad
One Interesting analytics we can perform is to see if we are paying or valuing players fairly. Based off the KNN
analysis, we may be over or under paying players.
fifa = read.csv("fifa.csv")
fifa$Value = as.numeric(substr(fifa$Value, 2, nchar(fifa$Value)-1))
fifa$Wage = as.numeric(substr(fifa$Wage, 2, nchar(fifa$Wage)-1))
fifa = fifa[complete.cases(fifa), ]
run_knn = function (df) {
df$rndnum = runif(nrow(df), 1,100)
train = df[df[, "rndnum"] <= 80 , ]
test = df[df[, "rndnum"] > 80 , ]
knnmdl = knn(train[-c(1, ncol(train))], test[-c(1, ncol(test))], train[,1], k = 10,prob = TRUE)
actual = test[,1]
cm = table(actual,knnmdl)
accuracy = sum(diag(cm))/length(actual)
print(sprintf("%s Accuracy: %.2f%%", names(df[1]),accuracy*100))
fifa_fa = fifa[,c(12,56:ncol(fifa) - 1)]
fifa_fa = fifa[,c(13,56:ncol(fifa) - 1)]
# Question 4
## a
I agree with Michele Piccinno because without understanding the problem, using algorithms like neural networks is like
shooting in the dark. Understanding the problem helps form the right questions and guides us toward finding answers or
refining our approach. Plus, if we don't understand the problem, we can't tell if our solution works or what success
looks like.
## b
Accuracy, typically measured as a percentage of success or error in relation to the total, is a common method to gauge
performance across techniques. It's suitable because it provides insight into how well a technique performs. Achieving
100% accuracy isn't always necessary; often, an 80% success rate might still offer sufficient business value to pursue a
course of action. Additionally, without measurement, it's impossible to ascertain the effectiveness of a technique.
## c
When starting a project, dealing with data can be tricky. Understanding what the data means, checking it thoroughly,
making sure it looks right, cleaning up any mistakes, and getting rid of weird bits are all challenges. To tackle these,
I'd start by learning about the data sources, looking at the data closely, fixing any issues with how it's organized,
cleaning it up to make it better, and sorting out any strange or unexpected parts. If we do not perform the prework, the
validity of the analysis can be compromised.
## d
A decision tree predicts continuous variables by sorting them into groups, like age ranges such as 0-12 for kids, 13-18
for teens, and so on. This sorting helps handle continuous data better. It's important to pick sensible groups that
match the data we're working with in the dataset.
## e
An ethical issue in data mining involves using personal information like birthdates to crack PIN numbers, which breaches
privacy. Even seemingly harmless data, when processed using certain methods, could identify people, like using operating
system preferences to pinpoint individuals submitting homework. Analysts should focus on securely storing and
transferring data to protect personal information and prevent misuse.
......@@ -149,15 +149,18 @@ write.csv(vehicles, "vehicles_cleaned.csv", row.names=FALSE)
# Clustering
This analysis aims to ascertain the viability of utilizing clustering on variables to predict a vehicle's condition. We can summarize this analysis in the following statement.
This analysis aims to ascertain the viability of utilizing clustering on variables to predict a vehicle's condition. We
can summarize this analysis in the following statement.
> Can we use clustering analysis to determine the condition of the vehicle?
Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data. The overall approach taken is as follows:
Some notable features about this data is that it is heavily skewed towards categorical data and less on numerical data.
The overall approach taken is as follows:
1. Perform the clustering on the numerical data.
2. Rejoin the clustering output dataframes back to the original dataset. So that we do not use condition as a predictor.
3. Then examine the results of vehicle condition by cluster
2. Rejoin the clustering output dataframes back to the original dataset. So that we do not use condition as a
3. Then examine the results of vehicle condition by cluster
4. Run clustering over differernt variations of variables
......@@ -239,17 +242,37 @@ print_clusters = function(cluster) {
# output = run_clustering(vehicles, c('price', 'odometer', 'year'))
# print_clusters(output)
# What if we collapsed conditions?
# vehicles_c = vehicles
# vehicles_c['condition'][vehicles_c['condition'] == 'like new'] = 'new'
# vehicles_c['condition'][vehicles_c['condition'] == 'fair'] = 'good'
# output = run_clustering(vehicles_c, c('price', 'odometer', 'year', 'long', 'lat'))
# print_clusters(output)
# output = run_clustering(vehicles_c, c('price', 'odometer'))
# print_clusters(output)
# output = run_clustering(vehicles_c, c('price', 'odometer', 'year'))
# print_clusters(output)
## Findings
We inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method did not produce a result that was useful at all. However, the Divisive and kmeans method did produce something interesting in each run. If we look at the last run for variables `price`, `odometer`, and `year`, the Divisive ward algorithm with 3 clusters was a strong predictor of `excellent` condition vehicles. In addition, the kmeans algorithm with 3 clusters was a strong predictor of `new` condition vehicles. See output below.
We inspected the output of the clustering and contrasted it against the vehicle's conditions. The Agglomerative method
did not produce a result that was useful at all. However, the Divisive and kmeans method did produce something
interesting in each run. If we look at the last run for variables `price`, `odometer`, and `year`, the Divisive ward
algorithm with 3 clusters was a strong predictor of `excellent` condition vehicles. In addition, the kmeans algorithm
with 3 clusters was a strong predictor of `new` condition vehicles. See output below.
For the ward_3 cluster 3 algorithm, we may use the group 2 to cluster `excellent` condition vehicles. For the kmeans cluster 3 algorithm, we may predict `new` condition vehicles.
For the ward_3 cluster 3 algorithm, we may use the group 2 to cluster `excellent` condition vehicles. For the kmeans
cluster 3 algorithm, we may predict `new` condition vehicles. Collapsing the `condition` variable did not produce a more
accurate reading.
group 1
......@@ -279,14 +302,12 @@ group 2
group 3
excellent fair good like new new salvage
1725 215 1659 165 1 10
## Clustering Conclusion
While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely effective. Nonetheless, the Divisive Ward and kmeans algorithm, specifically with three clusters, yielded a particularly intriguing outcome. These specific configuration of the algorithms demonstrated a robust predictive capacity, especially in categorizing vehicles as being in excellent or new condition vehicles.
## Clustering Conclusion
While our dataset lacked an expansive range of continuous variables, the limited variables still played a crucial role
in our clustering analysis. Among the various clustering algorithms tested, not all clustering algorithm proved entirely
effective. Nonetheless, the Divisive Ward and kmeans algorithm, specifically with three clusters, yielded a particularly
intriguing outcome. These specific configuration of the algorithms demonstrated a robust predictive capacity, especially
in categorizing vehicles as being in excellent or new condition vehicles.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment