Skip to content
Projects
Groups
Snippets
Help
Loading...
Sign in
Toggle navigation
I
it270-project
Project
Project
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Michael Ang
it270-project
Commits
d41cc1da
Commit
d41cc1da
authored
Dec 17, 2023
by
MichaelAng
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[mang] remove cuz i already submitted
parent
d40ddd18
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
0 additions
and
446 deletions
+0
-446
final copy.Rmd
final copy.Rmd
+0
-446
No files found.
final copy.Rmd
deleted
100644 → 0
View file @
d40ddd18
---
title: "IT270 Final"
author: "Michael Ang"
date: "2023-12-10"
output:
pdf_document:
toc: true
toc_depth: 3
number_sections: true
editor_options:
markdown:
wrap: 120
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(FactoMineR)
library(psych)
library(C50)
library(corrplot)
library(class)
```
\pagebreak
# Question 1
## Should there a gender discrimination and equability concern?
Equity concerns regarding gender representation are evident across various metrics within our organization. Our analysis
delved into the female-to-male (F/M) ratio in terms of both employee distribution and MMI (Median Monthly Income). A
detailed chart illustrating these findings will be presented at the conclusion of this summary.
Throughout the entirety of our organization, the F/M ratio stands at 67%, a trend consistent across departments, job
levels, and roles. While the disparity in MMI raises some equity issues, it's notably less pronounced compared to
employee distribution. At an organizational level, the f/m ratio for MMI is 98%.
Upon closer examination within departments, job levels, and roles, our focus turns primarily to job levels and roles.
Notably, the highest positions (job level 5) demonstrate a 75% F/M ratio in MMI. Furthermore, specific roles---roles 1
and 4---highlight a concerning 65% and 78% F/M ratio in MMI, respectively.
No further concerns were detected after the ratio analysis. The breakdown of these equity concerns is summarized in the
chart below.
## How may we increase overall job satisfaction?
To significantly enhance overall job satisfaction, our focus should prioritize:
- Maintaining consistent manager assignments
- Reducing attrition rates and commuting demands
- Enhancing stock options, promoting work-life balance, implementing salary hikes, optimizing training duration,
elevating job levels, and augmenting monthly income.
Our conclusion stems from an in-depth decision tree analysis, rigorously tested for accuracy using a dedicated set of
data. Our modeling demonstrates an approximate accuracy rate of 70%, validating the significance of these specified
areas as the primary contributors to job satisfaction. While other factors were taken into account, these identified
areas emerged as the most influential.
\pagebreak
## Analysis
```{r}
e_df = read.csv("employee_survey_data_clean.csv")
g_df = read.csv("general_data.csv")
m_df = read.csv("manager_survey_data.csv")
employee_df = merge(e_df, g_df, by.x = "EmployeeID", by.y = "EmployeeID")
employee_df = merge(employee_df, m_df, by.x = "EmployeeID", by.y = "EmployeeID")
# Inspecting data. Commented out
# summary(employee_df)
# ncol(employee_df)
# nrow(employee_df)
# anyNA(employee_df)
# Setting to zero because vales are N/A
employee_df['TotalWorkingYears'][is.na(employee_df$TotalWorkingYears), ] = 0
employee_df['NumCompaniesWorked'][is.na(employee_df$NumCompaniesWorked), ] = 0
# Drop single value columns
employee_df = subset(employee_df, select = -c(EmployeeCount,Over18,StandardHours))
```
\pagebreak
### Gender Discrimination and Equability Analysis
```{r}
# Commented out to reduce output
# discrim_emp_df = employee_df
#
# # Figure 1: First Analysis: Is there an equal gender distribution org wide?
# table(discrim_emp_df$Gender)
#
# # Figure 2: Second Analysis: Is there an equal gender distribution across department, Job Level, JobRole?
# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$Department),FUN=length)
# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobLevel),FUN=length)
# aggregate(discrim_emp_df[c('Gender')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobRole),FUN=length)
# # Figure 3: Third Analysis: Is there a MonthlyIncome, gender distribution across department, Job Level, JobRole?
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender),FUN=median)
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$Department),FUN=median)
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobLevel),FUN=median)
# aggregate(discrim_emp_df[c('MonthlyIncome')],by=list(discrim_emp_df$Gender, discrim_emp_df$JobRole),FUN=median)
# # Figure 4: Fourth Analysis. Are there any strong coorelations between variables when split up on Gender, Department, or Job Role?
# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Human Resources'] = 0
# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Research & Development'] = 1
# discrim_emp_df['Department'][discrim_emp_df['Department'] == 'Sales'] = 2
# discrim_emp_df$Department = as.numeric(discrim_emp_df$Department)
#
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Healthcare Representative'] = 0
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Human Resources'] = 1
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Laboratory Technician'] = 2
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Manager'] = 3
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Manufacturing Director'] = 4
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Research Director'] = 5
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Research Scientist'] = 6
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Sales Executive'] = 7
# discrim_emp_df['JobRole'][discrim_emp_df['JobRole'] == 'Sales Representative'] = 8
# discrim_emp_df$JobRole = as.numeric(discrim_emp_df$JobRole)
#
# male_emp_df = discrim_emp_df[discrim_emp_df['Gender'] == 'Male',]
# female_emp_df = discrim_emp_df[discrim_emp_df['Gender'] == 'Female',]
#
# discrim_vars = c('JobLevel', 'MonthlyIncome', 'Age', 'PercentSalaryHike', 'StockOptionLevel', 'YearsAtCompany', 'Department', 'JobRole')
#
# corrplot(cor(male_emp_df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
# corrplot(cor(female_emp_df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
#
# # Correlation by Department
# for(i in levels(factor(male_emp_df$Department))) {
# df = male_emp_df[male_emp_df['Department'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
# }
#
# for(i in levels(factor(female_emp_df$Department))) {
# df = female_emp_df[female_emp_df['Department'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
# }
#
# # Correlation by Job Role
# for(i in levels(factor(male_emp_df$JobRole))) {
# df = male_emp_df[male_emp_df['JobRole'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Male Employees', mar = c(0, 0, 2, 0),type="upper")
# }
#
# for(i in levels(factor(female_emp_df$JobRole))) {
# df = female_emp_df[female_emp_df['JobRole'] == as.integer(i),]
# corrplot(cor(df[discrim_vars]), method='number', title='Female Employees', mar = c(0, 0, 2, 0),type="upper")
# }
```
\pagebreak
```
# Female to Male Ratio
# Across Organization
67% = 1737 / 2590
## Across Department
60% = 71 / 117 # Human Resources
67% = 1142 / 1682 # Research & Development
66% = 524 / 791 # Sales
## Across JobLevel
62% = 610 / 982 # Job Level 1
72% = 662 / 912 # Job Level 2
58% = 237 / 407 # Job Level 3
77% = 137 / 176 # Job Level 4
80% = 91 / 113 # Job Level 5
## Across JobRole
67% = 153 / 228 # Job Role 0
63% = 60 / 94 # Job Role 1
63% = 296 / 464 # Job Role 2
93% = 145 / 155 # Job Role 3
69% = 175 / 253 # Job Role 4
# Female to Male Median Income
## Across Organization
98% = 48980 / 49680
# Across Department
146% = 63230 / 43190 # Human Resources
96% = 49630 / 51630 # Research & Development
93% = 45590 / 48690 # Sales
## Across JobLevel
89% = 43415 / 48340 # Job Level 1
101% = 50820 / 49990 # Job Level 2
107% = 50790 / 47390 # Job Level 3
93% = 54290 / 58225 # Job Level 4
75% = 40140 / 53470 # Job Level 5
## Across JobRole
108% = 54050 / 49980 # Job Role 0
65% = 36900 / 56610 # Job Role 1
93% = 49690 / 53240 # Job Role 2
121% = 53680 / 44010 # Job Role 3
78% = 45810 / 58550 # Job Role 4
```
\pagebreak
### Improving Overall Job Satisfaction Analysis
```{r}
# Bin the data
js_emp_df= employee_df
js_emp_df$Age = cut(js_emp_df$Age, breaks=c(15, 30, 40, 60), right = TRUE, labels = FALSE)
js_emp_df$DistanceFromHome = cut(js_emp_df$DistanceFromHome, breaks=c(0, 15, 25, 30), right = TRUE, labels = FALSE)
js_emp_df$MonthlyIncome = cut(js_emp_df$MonthlyIncome, breaks=c(0, 20000, 50000, 100000,150000,200000), right = TRUE, labels = FALSE)
js_emp_df$NumCompaniesWorked = cut(js_emp_df$NumCompaniesWorked, breaks=c(0, 3, 5, 10), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$PercentSalaryHike = cut(js_emp_df$PercentSalaryHike, breaks=c(10, 15, 20, 25), right = TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$TotalWorkingYears = cut(js_emp_df$TotalWorkingYears, breaks=c(0, 5, 10, 20, 40), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$YearsAtCompany = cut(js_emp_df$YearsAtCompany, breaks=c(0, 5, 10, 20, 40), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$YearsSinceLastPromotion = cut(js_emp_df$YearsSinceLastPromotion, breaks=c(0, 5, 10, 15), right = TRUE, include.lowest= TRUE, include.highest= TRUE, labels = FALSE)
js_emp_df$YearsWithCurrManager = cut(js_emp_df$YearsWithCurrManager, breaks=c(0, 5, 10, 15, 20), right = TRUE, include.lowest= TRUE, labels = FALSE)
# Convert the columns into factors
for (i in 1:ncol(js_emp_df)){
js_emp_df[,i] = as.factor(js_emp_df[,i])
}
# Create the full model formula as a string
vars = paste(names(js_emp_df[,c(-5)]), collapse=' + ')
attr_formula = paste("JobSatisfaction", vars, sep=" ~ ")
# Split data into Training and Test
js_emp_df$rndnum = runif(nrow(js_emp_df), 1,100)
employee_train = js_emp_df[js_emp_df[, "rndnum"] <= 80 , ]
employee_test = js_emp_df[js_emp_df[, "rndnum"] > 80 , ]
# Helper function to print values
dt_moa = function(evaluation, predicted_values){
tab = table(evaluation, predicted_values)
dimnames(tab) = list("Actual" = levels(evaluation), "Predicted" = levels(evaluation))
pcttab = rbind(tab[1, ]/sum(tab[1, ]), tab[2, ]/sum(tab[2, ]), tab[3, ]/sum(tab[3, ]), tab[4, ]/sum(tab[4, ]))
dimnames(pcttab) = list("Actual" = levels(evaluation), "Predicted" = levels(evaluation))
round(pcttab, 2)
}
```
```{r}
# C5.0 algorithm ran
fm_fit_c50 = C5.0(as.formula(attr_formula), data = employee_train)
employee_test$predict = predict(fm_fit_c50, newdata=employee_test, type='class')
c50_fm_chart = dt_moa(employee_test$JobSatisfaction, employee_test$predict)
```
```{r}
f_sm_c50 = "JobSatisfaction ~ YearsWithCurrManager + Attrition + StockOptionLevel + DistanceFromHome + WorkLifeBalance + PercentSalaryHike + TrainingTimesLastYear + JobLevel + MonthlyIncome"
sm_fit_c50 = C5.0(as.formula(f_sm_c50), data = employee_train)
employee_test$predict = predict(sm_fit_c50, newdata=employee_test, type='class')
c50_sm_chart = dt_moa(employee_test$JobSatisfaction, employee_test$predict)
```
```{r}
print("full model c5.0")
c50_fm_chart
print("simple model c5.0")
c50_sm_chart
```
\pagebreak
# Question 2
## a
My colleague was incorrect in the first place because of the neuron estimates. My colleague opted for 3 hidden layers, I
delved into a more intricate design with 38 hidden layers broken into two tiers; c(25, 13). At the time of 10000
observations, this provided a decent enough reponse time for the NN algorithm to run.
## b
This isn't a good rational because my colleague is forgetting to consider number of observations and the input layer. In
addition, given that the next question will be using 60k observations and time is a precious resource, 40 hidden layers
will be very slow. I would go with a value even higher than 40 hidden layers. It also does not increases the accuracy
either.
## c
hidden=c(25, 13)
| neuron estimates | Execution Time | \# Accurate | \% Accurate |
|------------------|----------------|-------------|-------------|
| c(40) | 56.0837 mins | 6790 | 11.3 |
| c(100, 50) | 8.050967 mins | 6784 | 11.3 |
| c(200, 100) | 9.487618 mins | 6786 | 11.3 |
| c(400, 200) | 14.57694 mins | 6791 | 11.3 |
## d
### Part 1
1. KNN - This is a classification algorithm based on nearest node
2. Clustering - This is a literal classificaion algorithm
### Part 2
```{r}
# mnist_raw_raw = read.csv("mnist_train.csv", header=FALSE)
mnist_raw = mnist_raw_raw[1:60000,]
mnist_raw <- replace(mnist_raw, is.na(mnist_raw), 0)
mnist_raw$rndnum = runif(nrow(mnist_raw), 1,100)
train = mnist_raw[mnist_raw[, "rndnum"] <= 80 , ]
test = mnist_raw[mnist_raw[, "rndnum"] > 80 , ]
start.time = Sys.time()
knnmdl = knn(train[-c(1, ncol(train))], test[-c(1, ncol(test))], train[,1], k = 3,prob = TRUE)
end.time = Sys.time()
actual = test[,1]
cm = table(actual,knnmdl)
accuracy = sum(diag(cm))/length(actual)
time.taken = end.time - start.time
print(time.taken)
print(sprintf("%s Accuracy: %.2f%%", names(mnist_raw[1]), accuracy*100))
```
## e
For lower quantities of values, the KNN algorthim may be more suitable for classification.
## f
Nothing? This is a data mining class/employment and we want to promote a positive work environment
\pagebreak
# Question 3
\pagebreak
## a
The prior analysis proves its worth by showcasing our efficiency in variable selection. This strategic approach helps us
trim down our variables significantly, reducing the burden from 36 player attributes to a concise set of 5 attributes.
Remarkably, this streamlined approach retains a commendable 75% accuracy level. This not only simplifies our management
but also ensures a high level of predictive performance.
## aa
To build an ideal team, I would seek out an algorithm that would try identify the ideal position the player should play
based off the player attributes. The following code are the fields that are necessary to perform the grouping and how I
will be describing the groups.
```{r}
fifa_raw = read.csv("fifa.csv")
fifa_fa = fifa_fa[,c(56:ncol(fifa_fa) - 1)]
# Necssary fields to perform grouping
names(fifa_fa)
# Here is how I will describe the groups
levels(factor(fifa_raw$Position))
```
## ab
I will not be able to assess which clubs the players belong to because there isn't a strong relationship between the
fields that I have selected and clubs. Knowing which club a player belonged to would probably be only useful when and if
I need to trade the player.
## ac
My initial analysis would be useful for future analysis because I can leverage my new 5 factors to perform further an
analysis rather than my 36 player attributes.
## ad
One Interesting analytics we can perform is to see if we are paying or valuing players fairly. Based off the KNN
analysis, we may be over or under paying players.
```{r}
fifa = read.csv("fifa.csv")
fifa$Value = as.numeric(substr(fifa$Value, 2, nchar(fifa$Value)-1))
fifa$Wage = as.numeric(substr(fifa$Wage, 2, nchar(fifa$Wage)-1))
fifa = fifa[complete.cases(fifa), ]
run_knn = function (df) {
df$rndnum = runif(nrow(df), 1,100)
train = df[df[, "rndnum"] <= 80 , ]
test = df[df[, "rndnum"] > 80 , ]
knnmdl = knn(train[-c(1, ncol(train))], test[-c(1, ncol(test))], train[,1], k = 10,prob = TRUE)
actual = test[,1]
cm = table(actual,knnmdl)
accuracy = sum(diag(cm))/length(actual)
print(sprintf("%s Accuracy: %.2f%%", names(df[1]),accuracy*100))
}
fifa_fa = fifa[,c(12,56:ncol(fifa) - 1)]
run_knn(fifa_fa)
fifa_fa = fifa[,c(13,56:ncol(fifa) - 1)]
run_knn(fifa_fa)
```
# Question 4
## a
I agree with Michele Piccinno because without understanding the problem, using algorithms like neural networks is like
shooting in the dark. Understanding the problem helps form the right questions and guides us toward finding answers or
refining our approach. Plus, if we don't understand the problem, we can't tell if our solution works or what success
looks like.
## b
Accuracy, typically measured as a percentage of success or error in relation to the total, is a common method to gauge
performance across techniques. It's suitable because it provides insight into how well a technique performs. Achieving
100% accuracy isn't always necessary; often, an 80% success rate might still offer sufficient business value to pursue a
course of action. Additionally, without measurement, it's impossible to ascertain the effectiveness of a technique.
## c
When starting a project, dealing with data can be tricky. Understanding what the data means, checking it thoroughly,
making sure it looks right, cleaning up any mistakes, and getting rid of weird bits are all challenges. To tackle these,
I'd start by learning about the data sources, looking at the data closely, fixing any issues with how it's organized,
cleaning it up to make it better, and sorting out any strange or unexpected parts. If we do not perform the prework, the
validity of the analysis can be compromised.
## d
A decision tree predicts continuous variables by sorting them into groups, like age ranges such as 0-12 for kids, 13-18
for teens, and so on. This sorting helps handle continuous data better. It's important to pick sensible groups that
match the data we're working with in the dataset.
## e
An ethical issue in data mining involves using personal information like birthdates to crack PIN numbers, which breaches
privacy. Even seemingly harmless data, when processed using certain methods, could identify people, like using operating
system preferences to pinpoint individuals submitting homework. Analysts should focus on securely storing and
transferring data to protect personal information and prevent misuse.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment