library(tidyverse)
library(readxl)
library(httr) #to fetch the data
library(janitor) #for the clean_names() function for data cleaning
library(reticulate) #to enable Python within R
library(DBI) #to establish in-memory database of R dataframe
library(RSQLite) #for SQLite engine
library(lubridate) #for functions to handle dates
Introduction
Purpose
This project has three purposes:
- To show how to run R, SQL, and Python all interchangeably in a Quarto document
- To compare the ease of writing code using dplyr (R), SQL, and pandas (Python)
- To include some demonstration of SQL in my portfolio (which is often not included but remains a critical skill)
Setup and data preparation
Required packages:
About the dataset:
This dataset is from data.world, and includes information from Chicago’s Department of Human Resources for city employees in 2017. It’s a simple dataset to allow for comparisons across languages.
# reading as a temporary file, then saving as df
GET("https://query.data.world/s/hu5dkviuxd6k2ipuhpxjuyuds7aplu?dws=00000", write_disk(tf <- tempfile(fileext = ".xls")))
Response [https://download.data.world/file_download/wbezchicago/chicago-employee-positions-and-salaries-for-2017/Employee%20Salary%20Data%20as%20of%20Sept.%202017.xls?auth=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJwcm9kLXVzZXItY2xpZW50Om1yYWZhMyIsImlzcyI6ImFnZW50Om1yYWZhMzo6ODg0MjQ2ZDItMTgzMy00NmZjLTk2YTMtZjQ2MWMzMDJjOTZiIiwiaWF0IjoxNzIxODc0NzgyLCJyb2xlIjpbInVzZXIiLCJ1c2VyX2FwaV9hZG1pbiIsInVzZXJfYXBpX2VudGVycHJpc2VfYWRtaW4iLCJ1c2VyX2FwaV9yZWFkIiwidXNlcl9hcGlfd3JpdGUiXSwiZ2VuZXJhbC1wdXJwb3NlIjpmYWxzZSwidXJsIjoiMjM2ZDZlZGQ3NjdkZmVmOGRjYzM0Mzg3YTExMDQ1N2EzMmU1OGY3ZSJ9.oTr9PF0rrsLdVhewmy2v1vRgYvT0jy_PBmqhmrufaRSB25PEm48ZPxvswiSrZw8bNaACxDWpxwBiJVCVSaM-Zg]
Date: 2024-11-07 20:59
Status: 200
Content-Type: application/vnd.ms-excel
Size: 4.72 MB
<ON DISK> /var/folders/ck/dmgn8lbx6vl8sl89vlhr0sv00000gn/T//RtmpkOuOtb/filed8ea1dd52f04.xls
<- read_excel(tf) %>%
df #clean_names() to make all column names lowercase
clean_names()
Rows: 32,806
Columns: 8
$ name <chr> "AARON, JEFFERY M", "AARON, KARINA", …
$ title <chr> "SERGEANT", "POLICE OFFICER (ASSIGNED A…
$ department <chr> "POLICE", "POLICE", "FLEET AND FACILITY…
$ salary_annual <dbl> 101442.0, 94122.0, 101592.0, 110064.0, …
$ original_hire_date <dttm> 2005-09-26, 2005-09-26, 1991-08-01, 19…
$ start_date_in_present_position <dttm> 2016-06-01, 2017-04-16, 2000-05-01, 20…
$ salary_basis <chr> "SALARY", "SALARY", "SALARY", "SALARY",…
$ employment_category <chr> "Fulltime-Regular", "Fulltime-Regular",…
Setting up Python and SQL to execute
The DBI::
package allows you to create an in-memory database to query against. The DBI project site is a great place to learn more about it. I’ll start by doing some initial setup and establishing the connection between the R dataframe and the SQL table name that I’ll query.
<- DBI::dbConnect(SQLite(), ":memory:")
con ::dbWriteTable(conn = con, name = "df", value = df, field.types = c("original_hire_date" = "Date"), row.names = FALSE) DBI
The reticulate::
package allows for executing Python code in an R environment. The reticulate project site includes useful examples for getting up and running with Python in R. This package includes an r_to_py()
function that is needed to convert an R dataframe into a pandas dataframe.
$df <- r_to_py(df) py
When inserting a code chunk to your Markdown file, it originally defaults to ‘{r}’. You can simply change this to ‘python’ or ‘sql’ and, with the above set up, the code works beautifully in a Quarto document!
In the sections that follow, I used ChatGPT to generate prompts as querying exercises. For the initial code chunks,
1. Basic Selection and Filtering:
Retrieve all records of employees who have the job title “POLICE OFFICER”.
select *
from df
where title = 'POLICE OFFICER'
name | title | department | salary_annual | original_hire_date | start_date_in_present_position | salary_basis | employment_category |
---|---|---|---|---|---|---|---|
ABBATE, TERRY M | POLICE OFFICER | POLICE | 93354 | 818035200 | 818035200 | SALARY | Fulltime-Regular |
ABDALLAH, ZAID | POLICE OFFICER | POLICE | 84054 | 1354233600 | 1354233600 | SALARY | Fulltime-Regular |
ABDELHADI, ABDALMAHD | POLICE OFFICER | POLICE | 87006 | 1166400000 | 1166400000 | SALARY | Fulltime-Regular |
ABDELMAJEID, AZIZ | POLICE OFFICER | POLICE | 84054 | 1209340800 | 1209340800 | SALARY | Fulltime-Regular |
ABEJERO, JASON V | POLICE OFFICER | POLICE | 90024 | 940809600 | 940809600 | SALARY | Fulltime-Regular |
ABFALL, RICHARD C | POLICE OFFICER | POLICE | 48078 | 1494892800 | 1494892800 | SALARY | Fulltime-Regular |
ABNEY, PATRICK | POLICE OFFICER | POLICE | 72510 | 1435536000 | 1435536000 | SALARY | Fulltime-Regular |
ABOUASSI, CHADI | POLICE OFFICER | POLICE | 48078 | 1494892800 | 1494892800 | SALARY | Fulltime-Regular |
ABRAHAM, NANCY A | POLICE OFFICER | POLICE | 76266 | 1398643200 | 1398643200 | SALARY | Fulltime-Regular |
ABRAM, ANTHONY A | POLICE OFFICER | POLICE | 68616 | 1466985600 | 1466985600 | SALARY | Fulltime-Regular |
%>%
df filter(title == 'POLICE OFFICER')
# A tibble: 9,213 × 8
name title department salary_annual original_hire_date
<chr> <chr> <chr> <dbl> <dttm>
1 ABBATE, TERRY M POLICE OF… POLICE 93354 1995-12-04 00:00:00
2 ABDALLAH, ZAID POLICE OF… POLICE 84054 2012-11-30 00:00:00
3 ABDELHADI, ABDALMAHD POLICE OF… POLICE 87006 2006-12-18 00:00:00
4 ABDELMAJEID, AZIZ POLICE OF… POLICE 84054 2008-04-28 00:00:00
5 ABEJERO, JASON V POLICE OF… POLICE 90024 1999-10-25 00:00:00
6 ABFALL, RICHARD C POLICE OF… POLICE 48078 2017-05-16 00:00:00
7 ABNEY, PATRICK POLICE OF… POLICE 72510 2015-06-29 00:00:00
8 ABOUASSI, CHADI POLICE OF… POLICE 48078 2017-05-16 00:00:00
9 ABRAHAM, NANCY A POLICE OF… POLICE 76266 2014-04-28 00:00:00
10 ABRAM, ANTHONY A POLICE OF… POLICE 68616 2016-06-27 00:00:00
# ℹ 9,203 more rows
# ℹ 3 more variables: start_date_in_present_position <dttm>,
# salary_basis <chr>, employment_category <chr>
'title'] == 'POLICE OFFICER'] df[df[
name title ... salary_basis employment_category
10 ABBATE, TERRY M POLICE OFFICER ... SALARY Fulltime-Regular
12 ABDALLAH, ZAID POLICE OFFICER ... SALARY Fulltime-Regular
13 ABDELHADI, ABDALMAHD POLICE OFFICER ... SALARY Fulltime-Regular
15 ABDELMAJEID, AZIZ POLICE OFFICER ... SALARY Fulltime-Regular
24 ABEJERO, JASON V POLICE OFFICER ... SALARY Fulltime-Regular
... ... ... ... ... ...
32800 ZYGMUNT, DAWID POLICE OFFICER ... SALARY Fulltime-Regular
32801 ZYLINSKA, KATARZYNA POLICE OFFICER ... SALARY Fulltime-Regular
32802 ZYMANTAS, LAURA C POLICE OFFICER ... SALARY Fulltime-Regular
32803 ZYMANTAS, MARK E POLICE OFFICER ... SALARY Fulltime-Regular
32804 ZYRKOWSKI, CARLO E POLICE OFFICER ... SALARY Fulltime-Regular
[9213 rows x 8 columns]
List the names and salaries of employees whose annual salary is greater than $100,000.
select name, salary_annual
from df
where salary_annual > 100000
name | salary_annual |
---|---|
AARON, JEFFERY M | 101442 |
AARON, KIMBERLEI R | 101592 |
ABAD JR, VICENTE M | 110064 |
ABBATEMARCO, JAMES J | 103350 |
ABDELLATIF, AREF R | 102228 |
ABDUL-KARIM, MUHAMMAD A | 111492 |
ABDULLAH, KEVIN | 114846 |
ABOUELKHEIR, HASSAN A | 110064 |
ABRAHAM, GIRLEY T | 110064 |
ABRAMSKI, JOHN E | 128970 |
%>%
df filter(salary_annual > 100000) %>%
select(name, salary_annual)
# A tibble: 6,091 × 2
name salary_annual
<chr> <dbl>
1 AARON, JEFFERY M 101442
2 AARON, KIMBERLEI R 101592
3 ABAD JR, VICENTE M 110064
4 ABBATEMARCO, JAMES J 103350
5 ABDELLATIF, AREF R 102228
6 ABDUL-KARIM, MUHAMMAD A 111492
7 ABDULLAH, KEVIN 114846
8 ABOUELKHEIR, HASSAN A 110064
9 ABRAHAM, GIRLEY T 110064
10 ABRAMSKI, JOHN E 128970
# ℹ 6,081 more rows
'salary_annual'] > 100000][['name', 'salary_annual']] df[df[
name salary_annual
0 AARON, JEFFERY M 101442.0
2 AARON, KIMBERLEI R 101592.0
3 ABAD JR, VICENTE M 110064.0
9 ABBATEMARCO, JAMES J 103350.0
14 ABDELLATIF, AREF R 102228.0
... ... ...
32783 ZUPAN, BILL M 114324.0
32785 ZURAWSKI, JEFFREY 101608.0
32794 ZWOLFER, MATTHEW W 114324.0
32798 ZYGADLO, MICHAEL J 101608.0
32805 ZYSKOWSKI, DARIUSZ 115932.0
[6091 rows x 2 columns]
Find all employees who work in the “FINANCE” department.
select name
from df
where department = 'FINANCE'
name |
---|
ADAMCZYK JR, JAN |
ADAPON, NENITA P |
ADENI, MOHAMED K |
AHMED, MOHAMMAD A |
AIKONEDO, PAUL E |
ALAM, SYED S |
ALEXANDER, PATRICIA L |
ALEXANDER, RACQUEL L |
ALLEN, JONATHAN E |
ALLEN, KEVIN M |
%>%
df filter(department == 'FINANCE') %>%
select(name)
# A tibble: 563 × 1
name
<chr>
1 ADAMCZYK JR, JAN
2 ADAPON, NENITA P
3 ADENI, MOHAMED K
4 AHMED, MOHAMMAD A
5 AIKONEDO, PAUL E
6 ALAM, SYED S
7 ALEXANDER, PATRICIA L
8 ALEXANDER, RACQUEL L
9 ALLEN, JONATHAN E
10 ALLEN, KEVIN M
# ℹ 553 more rows
'department'] == 'FINANCE']['name'] df[df[
109 ADAMCZYK JR, JAN
162 ADAPON, NENITA P
175 ADENI, MOHAMED K
263 AHMED, MOHAMMAD A
278 AIKONEDO, PAUL E
...
32498 ZACARIAZ, ALEJANDRO
32522 ZAIDI, SYED K
32605 ZAVALA, FERNANDO
32680 ZHANG, KEFENG
32789 ZUREK, MARY H
Name: name, Length: 563, dtype: object
2. Aggregate Functions:
Calculate the average salary for all employees.
select avg(salary_annual)
from df
avg(salary_annual) |
---|
81631.46 |
%>%
df summarize(mean(salary_annual))
# A tibble: 1 × 1
`mean(salary_annual)`
<dbl>
1 81631.
'salary_annual'].mean() df[
np.float64(81631.4588758154)
Find the total number of employees in each department.
select department, count(*) as ttl_employees
from df
group by department
order by ttl_employees desc
department | ttl_employees |
---|---|
POLICE | 13055 |
FIRE | 4833 |
OEMC | 2121 |
STREETS & SAN | 2045 |
WATER MGMNT | 1902 |
AVIATION | 1414 |
TRANSPORTN | 1244 |
PUBLIC LIBRARY | 978 |
FLEET AND FACILITY MGMT | 964 |
FAMILY & SUPPORT | 625 |
%>%
df group_by(department) %>%
count() %>%
arrange(-n)
# A tibble: 36 × 2
# Groups: department [36]
department n
<chr> <int>
1 POLICE 13055
2 FIRE 4833
3 OEMC 2121
4 STREETS & SAN 2045
5 WATER MGMNT 1902
6 AVIATION 1414
7 TRANSPORTN 1244
8 PUBLIC LIBRARY 978
9 FLEET AND FACILITY MGMT 964
10 FAMILY & SUPPORT 625
# ℹ 26 more rows
'department').agg(
df.groupby(= ('name', 'count')
ttl_employees ='ttl_employees', ascending=False) ).reset_index().sort_values(by
department ttl_employees
27 POLICE 13055
15 FIRE 4833
26 OEMC 2121
31 STREETS & SAN 2045
34 WATER MGMNT 1902
2 AVIATION 1414
32 TRANSPORTN 1244
30 PUBLIC LIBRARY 978
16 FLEET AND FACILITY MGMT 964
13 FAMILY & SUPPORT 625
14 FINANCE 563
17 HEALTH 503
9 CITY COUNCIL 453
23 LAW 392
6 BUILDINGS 277
18 HOUSING & ECON DEV 204
7 BUSINESS AFFAIRS 161
3 BOARD OF ELECTION 103
12 DoIT 98
29 PROCUREMENT 97
21 INSPECTOR GEN 89
8 CITY CLERK 89
25 MAYOR'S OFFICE 86
1 ANIMAL CONTRL 80
20 HUMAN RESOURCES 71
10 CULTURAL AFFAIRS 67
5 BUDGET & MGMT 46
0 ADMIN HEARNG 39
11 DISABILITIES 29
33 TREASURER 22
19 HUMAN RELATIONS 17
22 IPRA 13
4 BOARD OF ETHICS 8
28 POLICE BOARD 2
24 LICENSE APPL COMM 1
Determine the highest salary in the dataset.
select max(salary_annual)
from df
max(salary_annual) |
---|
3e+05 |
%>%
df summarize(max(salary_annual))
# A tibble: 1 × 1
`max(salary_annual)`
<dbl>
1 300000
'salary_annual'].max() df[
np.float64(300000.0)
3. Grouping and Sorting:
List the average salary for each job title.
select title, avg(salary_annual) as avg_salary
from df
group by title
order by avg_salary desc
title | avg_salary |
---|---|
COMMISSIONER OF AVIATION | 300000 |
SUPERINTENDENT OF POLICE | 260004 |
MAYOR | 216210 |
FIRE COMMISSIONER | 202728 |
FIRST DEPUTY FIRE COMMISSIONER | 197736 |
FIRST DEPUTY SUPERINTENDENT | 197724 |
CHIEF OF STAFF | 195000 |
DEPUTY FIRE COMMISSIONER | 187680 |
CHIEF | 185364 |
ASST DEPUTY FIRE COMMISSIONER | 185352 |
%>%
df group_by(title) %>%
summarize(avg_salary = mean(salary_annual, na.rm = TRUE)) %>%
arrange(-avg_salary)
# A tibble: 1,121 × 2
title avg_salary
<chr> <dbl>
1 COMMISSIONER OF AVIATION 300000
2 SUPERINTENDENT OF POLICE 260004
3 MAYOR 216210
4 FIRE COMMISSIONER 202728
5 FIRST DEPUTY FIRE COMMISSIONER 197736
6 FIRST DEPUTY SUPERINTENDENT 197724
7 CHIEF OF STAFF 195000
8 DEPUTY FIRE COMMISSIONER 187680
9 CHIEF 185364
10 ASST DEPUTY FIRE COMMISSIONER 185352
# ℹ 1,111 more rows
'title').agg(
df.groupby(= ('salary_annual', 'mean')
avg_salary ='avg_salary', ascending=False) ).reset_index().sort_values(by
title avg_salary
252 COMMISSIONER OF AVIATION 300000.000000
997 SUPERINTENDENT OF POLICE 260004.000000
731 MAYOR 216210.000000
516 FIRE COMMISSIONER 202728.000000
539 FIRST DEPUTY FIRE COMMISSIONER 197736.000000
... ... ...
785 POLICE CADET 9838.400000
828 PROGRAM AIDE 9728.622222
1077 TITLE V PROGRAM TRAINEE I 8580.000000
576 FOSTER GRANDPARENT 2756.000000
915 SENIOR COMPANION 2756.000000
[1121 rows x 2 columns]
Find the top 5 highest-paid departments.
select department, avg(salary_annual) as avg_salary
from df
group by department
order by avg_salary desc
limit 5
department | avg_salary |
---|---|
IPRA | 102758.77 |
DoIT | 99220.53 |
BUILDINGS | 99095.58 |
FIRE | 98203.67 |
BOARD OF ETHICS | 95061.00 |
%>%
df group_by(department) %>%
summarize(avg_salary = mean(salary_annual)) %>%
top_n(5, avg_salary) %>%
arrange(-avg_salary)
# A tibble: 5 × 2
department avg_salary
<chr> <dbl>
1 IPRA 102759.
2 DoIT 99221.
3 BUILDINGS 99096.
4 FIRE 98204.
5 BOARD OF ETHICS 95061
'department').agg(
df.groupby(= ('salary_annual', 'mean')
avg_salary ='avg_salary', ascending=False).head(5) ).reset_index().sort_values(by
department avg_salary
22 IPRA 102758.769231
12 DoIT 99220.530612
6 BUILDINGS 99095.580650
15 FIRE 98203.672849
4 BOARD OF ETHICS 95061.000000
Show the number of employees for each job title, ordered by the number of employees in descending order.
select title, count(*) as cnt_employees
from df
group by title
order by cnt_employees desc
title | cnt_employees |
---|---|
POLICE OFFICER | 9213 |
FIREFIGHTER-EMT | 1492 |
SERGEANT | 1162 |
POLICE OFFICER (ASSIGNED AS DETECTIVE) | 1015 |
MOTOR TRUCK DRIVER | 709 |
POOL MOTOR TRUCK DRIVER | 693 |
SANITATION LABORER | 634 |
FIREFIGHTER | 533 |
CROSSING GUARD | 464 |
CONSTRUCTION LABORER | 456 |
%>%
df group_by(title) %>%
count() %>%
arrange(-n)
# A tibble: 1,121 × 2
# Groups: title [1,121]
title n
<chr> <int>
1 POLICE OFFICER 9213
2 FIREFIGHTER-EMT 1492
3 SERGEANT 1162
4 POLICE OFFICER (ASSIGNED AS DETECTIVE) 1015
5 MOTOR TRUCK DRIVER 709
6 POOL MOTOR TRUCK DRIVER 693
7 SANITATION LABORER 634
8 FIREFIGHTER 533
9 CROSSING GUARD 464
10 CONSTRUCTION LABORER 456
# ℹ 1,111 more rows
'title').agg(
df.groupby(= ('name', 'count')
cnt_employees ='cnt_employees', ascending=False) ).reset_index().sort_values(by
title cnt_employees
791 POLICE OFFICER 9213
529 FIREFIGHTER-EMT 1492
946 SERGEANT 1162
796 POLICE OFFICER (ASSIGNED AS DETECTIVE) 1015
744 MOTOR TRUCK DRIVER 709
.. ... ...
218 CHIEF SYSTEMS PROGRAMMER 1
220 CHIEF WATER CHEMIST 1
221 CHIEF WATER ENGINEER 1
222 CHIEF ZONING PLAN EXAMINER 1
0 1ST DEPUTY INSPECTOR GENERAL 1
[1121 rows x 2 columns]
4. Joining and Subqueries:
Retrieve the details of employees who have a salary in the top five of their department. Show the name, title, department, salary, and ranking.
with salary_by_dept_rank as (
select
name,
title,
department,
salary_annual,RANK() OVER(
partition by department
order by salary_annual desc) as salary_rank_by_department
from df
where department IS NOT NULL
)select *
from salary_by_dept_rank
where salary_rank_by_department <= 5
name | title | department | salary_annual | salary_rank_by_department |
---|---|---|---|---|
JACKOWIAK, PATRICIA | DIR OF ADMINISTRATIVE HEARINGS | ADMIN HEARNG | 156420 | 1 |
SHEELY, STEVEN N | DEPUTY DIR OF ADMINISTRATIVE ADJUDICATION | ADMIN HEARNG | 131688 | 2 |
HENNIGAN, KEVIN G | FISCAL ADMINISTRATOR | ADMIN HEARNG | 117660 | 3 |
HEIDT, SANDRA E | ASST MANAGER OF ADMINISTRATIVE ADJUDICATION | ADMIN HEARNG | 98160 | 4 |
GOGOLA, DEBORAH G | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 5 |
PARKER-OKOJIE, GABRIELLE E | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 5 |
RIZZO, ANTHONY J | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 5 |
WEINERT, DARCI N | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 5 |
RUSSELL, SUSAN J | EXECUTIVE DIR OF ANIMAL CARE AND CONTROL | ANIMAL CONTRL | 130008 | 1 |
YAMAJI, PETER S | VETERINARIAN | ANIMAL CONTRL | 128136 | 2 |
%>%
df group_by(department) %>%
mutate(salary_rank = dense_rank(desc(salary_annual))) %>%
arrange(department, salary_rank) %>%
filter(salary_rank <= 5,
!is.na(department))
# A tibble: 290 × 9
# Groups: department [35]
name title department salary_annual original_hire_date
<chr> <chr> <chr> <dbl> <dttm>
1 JACKOWIAK, PATRICIA DIR … ADMIN HEA… 156420 1989-09-01 00:00:00
2 SHEELY, STEVEN N DEPU… ADMIN HEA… 131688 2003-07-16 00:00:00
3 HENNIGAN, KEVIN G FISC… ADMIN HEA… 117660 1980-01-07 00:00:00
4 HEIDT, SANDRA E ASST… ADMIN HEA… 98160 1985-09-09 00:00:00
5 GOGOLA, DEBORAH G SENI… ADMIN HEA… 96636 1989-10-02 00:00:00
6 PARKER-OKOJIE, GABRIELLE… SENI… ADMIN HEA… 96636 2004-06-15 00:00:00
7 RIZZO, ANTHONY J SENI… ADMIN HEA… 96636 1997-12-08 00:00:00
8 WEINERT, DARCI N SENI… ADMIN HEA… 96636 2003-03-17 00:00:00
9 RUSSELL, SUSAN J EXEC… ANIMAL CO… 130008 2016-05-02 00:00:00
10 YAMAJI, PETER S VETE… ANIMAL CO… 128136 2000-08-21 00:00:00
# ℹ 280 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
# salary_basis <chr>, employment_category <chr>, salary_rank <int>
Find the names of employees who earn more than the average salary in their department.
with dept_avg_salary as (
select *,
AVG(salary_annual) OVER(
partition by department) as avg_salary_by_department
from df
where department IS NOT NULL
)select *
from dept_avg_salary
where salary_annual > avg_salary_by_department
order by department, salary_annual desc
name | title | department | salary_annual | original_hire_date | start_date_in_present_position | salary_basis | employment_category | avg_salary_by_department |
---|---|---|---|---|---|---|---|---|
JACKOWIAK, PATRICIA | DIR OF ADMINISTRATIVE HEARINGS | ADMIN HEARNG | 156420 | 620611200 | 1283299200 | SALARY | Fulltime-Regular | 78894.15 |
SHEELY, STEVEN N | DEPUTY DIR OF ADMINISTRATIVE ADJUDICATION | ADMIN HEARNG | 131688 | 1058313600 | 1309478400 | SALARY | Fulltime-Regular | 78894.15 |
HENNIGAN, KEVIN G | FISCAL ADMINISTRATOR | ADMIN HEARNG | 117660 | 316051200 | 815184000 | SALARY | Fulltime-Regular | 78894.15 |
HEIDT, SANDRA E | ASST MANAGER OF ADMINISTRATIVE ADJUDICATION | ADMIN HEARNG | 98160 | 495072000 | 879465600 | SALARY | Fulltime-Regular | 78894.15 |
GOGOLA, DEBORAH G | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 623289600 | 976924800 | SALARY | Fulltime-Regular | 78894.15 |
PARKER-OKOJIE, GABRIELLE E | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 1087257600 | 1497830400 | SALARY | Fulltime-Regular | 78894.15 |
RIZZO, ANTHONY J | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 881539200 | 1095292800 | SALARY | Fulltime-Regular | 78894.15 |
WEINERT, DARCI N | SENIOR ADMINISTRATIVE LAW OFFICER | ADMIN HEARNG | 96636 | 1047859200 | 1468627200 | SALARY | Fulltime-Regular | 78894.15 |
RIZZO, MARGARET A | ADMINISTRATIVE SERVICES OFFICER II EXCLD | ADMIN HEARNG | 93300 | 173404800 | 1164931200 | SALARY | Fulltime-Regular | 78894.15 |
SMITH, PAMELA M | ASST TO THE EXEC DIR | ADMIN HEARNG | 93300 | 585187200 | 991353600 | SALARY | Fulltime-Regular | 78894.15 |
%>%
df group_by(department) %>%
mutate(avg_salary = mean(salary_annual)) %>%
filter(salary_annual > avg_salary) %>%
arrange(department, -salary_annual)
# A tibble: 16,399 × 9
# Groups: department [35]
name title department salary_annual original_hire_date
<chr> <chr> <chr> <dbl> <dttm>
1 JACKOWIAK, PATRICIA DIR … ADMIN HEA… 156420 1989-09-01 00:00:00
2 SHEELY, STEVEN N DEPU… ADMIN HEA… 131688 2003-07-16 00:00:00
3 HENNIGAN, KEVIN G FISC… ADMIN HEA… 117660 1980-01-07 00:00:00
4 HEIDT, SANDRA E ASST… ADMIN HEA… 98160 1985-09-09 00:00:00
5 GOGOLA, DEBORAH G SENI… ADMIN HEA… 96636 1989-10-02 00:00:00
6 PARKER-OKOJIE, GABRIELLE… SENI… ADMIN HEA… 96636 2004-06-15 00:00:00
7 RIZZO, ANTHONY J SENI… ADMIN HEA… 96636 1997-12-08 00:00:00
8 WEINERT, DARCI N SENI… ADMIN HEA… 96636 2003-03-17 00:00:00
9 RIZZO, MARGARET A ADMI… ADMIN HEA… 93300 1975-07-01 00:00:00
10 SMITH, PAMELA M ASST… ADMIN HEA… 93300 1988-07-18 00:00:00
# ℹ 16,389 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
# salary_basis <chr>, employment_category <chr>, avg_salary <dbl>
'avg_salary'] = df['salary_annual'].mean()
df[
'salary_annual'] > df['avg_salary']] df[df[
name ... avg_salary
0 AARON, JEFFERY M ... 81631.458876
1 AARON, KARINA ... 81631.458876
2 AARON, KIMBERLEI R ... 81631.458876
3 ABAD JR, VICENTE M ... 81631.458876
7 ABBATACOLA, ROBERT J ... 81631.458876
... ... ... ...
32797 ZYGADLO, JOHN P ... 81631.458876
32798 ZYGADLO, MICHAEL J ... 81631.458876
32803 ZYMANTAS, MARK E ... 81631.458876
32804 ZYRKOWSKI, CARLO E ... 81631.458876
32805 ZYSKOWSKI, DARIUSZ ... 81631.458876
[19902 rows x 9 columns]
5. Complex Queries:
Calculate the percentage of employees in each department relative to the total number of employees.
with dept_employees as (
select department, count(*) cnt_employees
from df
group by department
),as (
ttl_employees select count(*) as ttl_count
from df
)select *, round(cnt_employees * 100.0 / ttl_count, 1) as pct_employees
from dept_employees, ttl_employees
order by pct_employees desc
department | cnt_employees | ttl_count | pct_employees |
---|---|---|---|
POLICE | 13055 | 32806 | 39.8 |
FIRE | 4833 | 32806 | 14.7 |
OEMC | 2121 | 32806 | 6.5 |
STREETS & SAN | 2045 | 32806 | 6.2 |
WATER MGMNT | 1902 | 32806 | 5.8 |
AVIATION | 1414 | 32806 | 4.3 |
TRANSPORTN | 1244 | 32806 | 3.8 |
PUBLIC LIBRARY | 978 | 32806 | 3.0 |
FLEET AND FACILITY MGMT | 964 | 32806 | 2.9 |
FAMILY & SUPPORT | 625 | 32806 | 1.9 |
%>%
df group_by(department) %>%
count() %>%
ungroup() %>%
mutate(ttl_employees = sum(n),
pct_employees = round((n / ttl_employees) * 100, 1)) %>%
arrange(-pct_employees)
# A tibble: 36 × 4
department n ttl_employees pct_employees
<chr> <int> <int> <dbl>
1 POLICE 13055 32806 39.8
2 FIRE 4833 32806 14.7
3 OEMC 2121 32806 6.5
4 STREETS & SAN 2045 32806 6.2
5 WATER MGMNT 1902 32806 5.8
6 AVIATION 1414 32806 4.3
7 TRANSPORTN 1244 32806 3.8
8 PUBLIC LIBRARY 978 32806 3
9 FLEET AND FACILITY MGMT 964 32806 2.9
10 FAMILY & SUPPORT 625 32806 1.9
# ℹ 26 more rows
Identify job titles that have more than 100 employees.
select title, count(*) cnt_employees
from df
group by title
having cnt_employees > 100
order by cnt_employees desc
title | cnt_employees |
---|---|
POLICE OFFICER | 9213 |
FIREFIGHTER-EMT | 1492 |
SERGEANT | 1162 |
POLICE OFFICER (ASSIGNED AS DETECTIVE) | 1015 |
MOTOR TRUCK DRIVER | 709 |
POOL MOTOR TRUCK DRIVER | 693 |
SANITATION LABORER | 634 |
FIREFIGHTER | 533 |
CROSSING GUARD | 464 |
CONSTRUCTION LABORER | 456 |
%>%
df group_by(title) %>%
count() %>%
filter(n > 100) %>%
arrange(-n)
# A tibble: 46 × 2
# Groups: title [46]
title n
<chr> <int>
1 POLICE OFFICER 9213
2 FIREFIGHTER-EMT 1492
3 SERGEANT 1162
4 POLICE OFFICER (ASSIGNED AS DETECTIVE) 1015
5 MOTOR TRUCK DRIVER 709
6 POOL MOTOR TRUCK DRIVER 693
7 SANITATION LABORER 634
8 FIREFIGHTER 533
9 CROSSING GUARD 464
10 CONSTRUCTION LABORER 456
# ℹ 36 more rows
'title').agg(
df.groupby(= ('title', 'count')
cnt_title ='cnt_title', ascending=False).query('cnt_title > 100') ).reset_index().sort_values(by
title cnt_title
791 POLICE OFFICER 9213
529 FIREFIGHTER-EMT 1492
946 SERGEANT 1162
796 POLICE OFFICER (ASSIGNED AS DETECTIVE) 1015
744 MOTOR TRUCK DRIVER 709
808 POOL MOTOR TRUCK DRIVER 693
899 SANITATION LABORER 634
525 FIREFIGHTER 533
326 CROSSING GUARD 464
280 CONSTRUCTION LABORER 456
1079 TRAFFIC CONTROL AIDE-HOURLY 452
680 LIEUTENANT-EMT 425
679 LIEUTENANT 379
327 CROSSING GUARD - PER CBA 348
760 PARAMEDIC 341
762 PARAMEDIC I/C 278
598 GENERAL LABORER - DSS 258
531 FIREFIGHTER/PARAMEDIC 255
530 FIREFIGHTER-EMT (RECRUIT) 252
140 AVIATION SECURITY OFFICER 236
786 POLICE COMMUNICATIONS OPERATOR I 236
617 HOISTING ENGINEER 233
520 FIRE ENGINEER-EMT 231
452 ELECTRICAL MECHANIC 229
360 DETENTION AIDE 224
801 POLICE OFFICER / FLD TRNG OFFICER 219
787 POLICE COMMUNICATIONS OPERATOR II 215
752 OPERATING ENGINEER-GROUP A 186
676 LIBRARY PAGE 166
753 OPERATING ENGINEER-GROUP C 166
689 MACHINIST (AUTOMOTIVE) 160
576 FOSTER GRANDPARENT 143
162 CAPTAIN-EMT 140
277 CONCRETE LABORER 139
15 ADMINISTRATIVE ASST II 132
649 LABORER 130
783 POLICE ADMINISTRATIVE CLERK 128
918 SENIOR DATA ENTRY OPERATOR 124
780 PLUMBER 123
978 STAFF ASST 121
666 LIBRARIAN I 121
52 ASPHALT LABORER 110
979 STAFF ASST TO THE ALDERMAN 107
17 ADMINISTRATIVE ASST III 107
519 FIRE ENGINEER 106
764 PARKING ENFORCEMENT AIDE 106
Find employees who have been in their current position for more than 5 years (assuming the dataset includes a hire date or a similar field).
Note: need to do some more research on the DBI::
package, as the SQL object is reading date fields in as a dbl data type.
%>%
df mutate(tenure = round(
::time_length(
lubridatetoday() - lubridate::as_date(start_date_in_present_position), "year"),
1)) %>%
filter(tenure > 5) %>%
arrange(-tenure)
# A tibble: 32,806 × 9
name title department salary_annual original_hire_date
<chr> <chr> <chr> <dbl> <dttm>
1 LANNON, JOHN P CONSTRUCTIO… WATER MGM… 83616 1963-07-02 00:00:00
2 WILLIAMS, DAISY CROSSING GU… OEMC 23254. 1965-12-06 00:00:00
3 CALHOUN, ROCKELL H CROSSING GU… OEMC 23254. 1968-03-25 00:00:00
4 SCOTT, ROSIE M PARKING ENF… FINANCE 64392 1968-08-19 00:00:00
5 BURKE, CATHERINE CROSSING GU… OEMC 23254. 1970-05-13 00:00:00
6 BISHOP, DORIS J LIBRARY ASS… PUBLIC LI… 69492 1971-09-07 00:00:00
7 LUMPKIN, STEPHEN SANITATION … STREETS &… 75317. 1972-03-11 00:00:00
8 KUCHARSKI, JOYCE M CROSSING GU… OEMC 23254. 1973-09-17 00:00:00
9 CORSO JR, JOHN J PROPERTY CU… POLICE 70092 1975-05-27 00:00:00
10 PHILLIPS, EUGENIA CROSSING GU… OEMC 23254. 1976-01-12 00:00:00
# ℹ 32,796 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
# salary_basis <chr>, employment_category <chr>, tenure <dbl>
6. Own Questions
Pull the second title of all employees that have been promoted at least once.
with promotion_seq as (
select
name,
title,
department,
start_date_in_present_position,ROW_NUMBER() OVER(
partition by name
order by start_date_in_present_position
as seq_promotion
) from df
)select *
from promotion_seq
where seq_promotion = 2
order by name
name | title | department | start_date_in_present_position | seq_promotion |
---|---|---|---|---|
ADE, JAMES P | POLICE OFFICER | POLICE | 1283299200 | 2 |
ANDERSON, DAVID C | PROJECTS ADMINISTRATOR | BUILDINGS | 1393200000 | 2 |
ANDERSON, DONALD | PROJECT MANAGER | AVIATION | 1494806400 | 2 |
ANDERSON, RHONDA M | SERGEANT | POLICE | 1464739200 | 2 |
ARROYO, FRANCISCO | WARD SUPERINTENDENT | STREETS & SAN | 1489622400 | 2 |
ATECA, SALVADOR R | MACHINIST (AUTOMOTIVE) | FLEET AND FACILITY MGMT | 1425340800 | 2 |
BIANCHI, MICHAEL J | FIRE COMMUNICATIONS OPERATOR I | OEMC | 1502841600 | 2 |
BRACKEN, DANIEL J | PARAMEDIC | FIRE | 1418688000 | 2 |
BROWN, ANTHONY | GENERAL LABORER - DSS | STREETS & SAN | 1405296000 | 2 |
BROWN, PATRICK J | POLICE OFFICER (ASSIGNED AS DETECTIVE) | POLICE | 1429142400 | 2 |
%>%
df group_by(name) %>%
arrange(start_date_in_present_position) %>%
mutate(seq_promotion = row_number()) %>%
filter(seq_promotion == 2) %>%
arrange(name)
# A tibble: 284 × 9
# Groups: name [284]
name title department salary_annual original_hire_date
<chr> <chr> <chr> <dbl> <dttm>
1 ADE, JAMES P POLICE OFFI… POLICE 84054 2010-09-01 00:00:00
2 ANDERSON, DAVID C PROJECTS AD… BUILDINGS 94824 2014-02-24 00:00:00
3 ANDERSON, DONALD PROJECT MAN… AVIATION 80868 1987-08-04 00:00:00
4 ANDERSON, RHONDA M SERGEANT POLICE 104628 2000-02-28 00:00:00
5 ARROYO, FRANCISCO WARD SUPERI… STREETS &… 73212 2015-05-18 00:00:00
6 ATECA, SALVADOR R MACHINIST (… FLEET AND… 96408 2015-03-03 00:00:00
7 BIANCHI, MICHAEL J FIRE COMMUN… OEMC 51264 2017-08-16 00:00:00
8 BRACKEN, DANIEL J PARAMEDIC FIRE 76266 2014-12-16 00:00:00
9 BROWN, ANTHONY GENERAL LAB… STREETS &… 43202. 2014-07-14 00:00:00
10 BROWN, PATRICK J POLICE OFFI… POLICE 97440 1999-03-08 00:00:00
# ℹ 274 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
# salary_basis <chr>, employment_category <chr>, seq_promotion <int>
='start_date_in_present_position').groupby('name').apply(
df.sort_values(bylambda x: x.assign(seq_promotion=range(1, len(x) + 1))).query('seq_promotion == 2').reset_index(drop=True).sort_values(by='name')
<string>:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
name ... seq_promotion
0 ADE, JAMES P ... 2
1 ANDERSON, DAVID C ... 2
2 ANDERSON, DONALD ... 2
3 ANDERSON, RHONDA M ... 2
4 ARROYO, FRANCISCO ... 2
.. ... ... ...
279 WILSON, EDWARD J ... 2
280 WILSON, JANICE L ... 2
281 WILSON, JOHN L ... 2
282 WILSON, JOSHUA ... 2
283 WRONOWSKI, JAMES J ... 2
[284 rows x 10 columns]
Conclusion
This project had three purposes:
To show how to run R, SQL, and Python all interchangeably in a Quarto document. Quarto and the
reticulate::
andDBI::
packages allow for the functionality to switch between different methods of handling, querying, and analyzing data. Practically speaking, you probably wouldn’t switch between these as often as I did in this work, but it’s useful to verify your work and could come in handy for specific circumstances (e.g. analyzing with Pandas and visualizing with ggplot2). I did find that some of the work in SQL is clunky – data types didn’t read in correctly for dates – but I have not yet dug more deeply into theDBI::
package to troubleshoot.To compare the ease of writing code using dplyr (R), SQL, and pandas (Python). Personally, I feel that R and
dplyr::
is the simplest syntax to use, followed by SQL and Python.dplyr::
is very linear in its construction, and it rarely has extra characters to remember. Python’s brackets, quotes, and chaining all take some getting used to, while R and SQL feel more direct.To include some demonstration of SQL in my portfolio (which is often not included but remains a critical skill). This was a fun way to demonstrate some SQL expertise while comparing and verifying output against R and Python.