Introduction

Purpose

This project has three purposes:

To show how to run R, SQL, and Python all interchangeably in a Quarto document
To compare the ease of writing code using dplyr (R), SQL, and pandas (Python)
To include some demonstration of SQL in my portfolio (which is often not included but remains a critical skill)

Setup and data preparation

Required packages:

library(tidyverse)
library(readxl)
library(httr)           #to fetch the data
library(janitor)        #for the clean_names() function for data cleaning
library(reticulate)     #to enable Python within R
library(DBI)            #to establish in-memory database of R dataframe
library(RSQLite)        #for SQLite engine
library(lubridate)      #for functions to handle dates

About the dataset:

This dataset is from data.world, and includes information from Chicago’s Department of Human Resources for city employees in 2017. It’s a simple dataset to allow for comparisons across languages.

# reading as a temporary file, then saving as df
GET("https://query.data.world/s/hu5dkviuxd6k2ipuhpxjuyuds7aplu?dws=00000", write_disk(tf <- tempfile(fileext = ".xls")))

Response [https://download.data.world/file_download/wbezchicago/chicago-employee-positions-and-salaries-for-2017/Employee%20Salary%20Data%20as%20of%20Sept.%202017.xls?auth=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJwcm9kLXVzZXItY2xpZW50Om1yYWZhMyIsImlzcyI6ImFnZW50Om1yYWZhMzo6ODg0MjQ2ZDItMTgzMy00NmZjLTk2YTMtZjQ2MWMzMDJjOTZiIiwiaWF0IjoxNzIxODc0NzgyLCJyb2xlIjpbInVzZXIiLCJ1c2VyX2FwaV9hZG1pbiIsInVzZXJfYXBpX2VudGVycHJpc2VfYWRtaW4iLCJ1c2VyX2FwaV9yZWFkIiwidXNlcl9hcGlfd3JpdGUiXSwiZ2VuZXJhbC1wdXJwb3NlIjpmYWxzZSwidXJsIjoiMjM2ZDZlZGQ3NjdkZmVmOGRjYzM0Mzg3YTExMDQ1N2EzMmU1OGY3ZSJ9.oTr9PF0rrsLdVhewmy2v1vRgYvT0jy_PBmqhmrufaRSB25PEm48ZPxvswiSrZw8bNaACxDWpxwBiJVCVSaM-Zg]
  Date: 2024-11-07 20:59
  Status: 200
  Content-Type: application/vnd.ms-excel
  Size: 4.72 MB
<ON DISK>  /var/folders/ck/dmgn8lbx6vl8sl89vlhr0sv00000gn/T//RtmpkOuOtb/filed8ea1dd52f04.xls

df <- read_excel(tf) %>% 
  #clean_names() to make all column names lowercase
  clean_names()

Rows: 32,806
Columns: 8
$ name                           <chr> "AARON,  JEFFERY M", "AARON,  KARINA", …
$ title                          <chr> "SERGEANT", "POLICE OFFICER (ASSIGNED A…
$ department                     <chr> "POLICE", "POLICE", "FLEET AND FACILITY…
$ salary_annual                  <dbl> 101442.0, 94122.0, 101592.0, 110064.0, …
$ original_hire_date             <dttm> 2005-09-26, 2005-09-26, 1991-08-01, 19…
$ start_date_in_present_position <dttm> 2016-06-01, 2017-04-16, 2000-05-01, 20…
$ salary_basis                   <chr> "SALARY", "SALARY", "SALARY", "SALARY",…
$ employment_category            <chr> "Fulltime-Regular", "Fulltime-Regular",…

Setting up Python and SQL to execute

The DBI:: package allows you to create an in-memory database to query against. The DBI project site is a great place to learn more about it. I’ll start by doing some initial setup and establishing the connection between the R dataframe and the SQL table name that I’ll query.

con <- DBI::dbConnect(SQLite(), ":memory:")
DBI::dbWriteTable(conn = con, name = "df", value = df, field.types = c("original_hire_date" = "Date"), row.names = FALSE)

The reticulate:: package allows for executing Python code in an R environment. The reticulate project site includes useful examples for getting up and running with Python in R. This package includes an r_to_py() function that is needed to convert an R dataframe into a pandas dataframe.

py$df <- r_to_py(df)

When inserting a code chunk to your Markdown file, it originally defaults to ‘{r}’. You can simply change this to ‘python’ or ‘sql’ and, with the above set up, the code works beautifully in a Quarto document!

In the sections that follow, I used ChatGPT to generate prompts as querying exercises. For the initial code chunks,

1. Basic Selection and Filtering:

Retrieve all records of employees who have the job title “POLICE OFFICER”.

select *
from df
where title = 'POLICE OFFICER'

Displaying records 1 - 10
name	title	department	salary_annual	original_hire_date	start_date_in_present_position	salary_basis	employment_category
ABBATE, TERRY M	POLICE OFFICER	POLICE	93354	818035200	818035200	SALARY	Fulltime-Regular
ABDALLAH, ZAID	POLICE OFFICER	POLICE	84054	1354233600	1354233600	SALARY	Fulltime-Regular
ABDELHADI, ABDALMAHD	POLICE OFFICER	POLICE	87006	1166400000	1166400000	SALARY	Fulltime-Regular
ABDELMAJEID, AZIZ	POLICE OFFICER	POLICE	84054	1209340800	1209340800	SALARY	Fulltime-Regular
ABEJERO, JASON V	POLICE OFFICER	POLICE	90024	940809600	940809600	SALARY	Fulltime-Regular
ABFALL, RICHARD C	POLICE OFFICER	POLICE	48078	1494892800	1494892800	SALARY	Fulltime-Regular
ABNEY, PATRICK	POLICE OFFICER	POLICE	72510	1435536000	1435536000	SALARY	Fulltime-Regular
ABOUASSI, CHADI	POLICE OFFICER	POLICE	48078	1494892800	1494892800	SALARY	Fulltime-Regular
ABRAHAM, NANCY A	POLICE OFFICER	POLICE	76266	1398643200	1398643200	SALARY	Fulltime-Regular
ABRAM, ANTHONY A	POLICE OFFICER	POLICE	68616	1466985600	1466985600	SALARY	Fulltime-Regular

df %>% 
  filter(title == 'POLICE OFFICER')

# A tibble: 9,213 × 8
   name                  title      department salary_annual original_hire_date 
   <chr>                 <chr>      <chr>              <dbl> <dttm>             
 1 ABBATE,  TERRY M      POLICE OF… POLICE             93354 1995-12-04 00:00:00
 2 ABDALLAH,  ZAID       POLICE OF… POLICE             84054 2012-11-30 00:00:00
 3 ABDELHADI,  ABDALMAHD POLICE OF… POLICE             87006 2006-12-18 00:00:00
 4 ABDELMAJEID,  AZIZ    POLICE OF… POLICE             84054 2008-04-28 00:00:00
 5 ABEJERO,  JASON V     POLICE OF… POLICE             90024 1999-10-25 00:00:00
 6 ABFALL,  RICHARD C    POLICE OF… POLICE             48078 2017-05-16 00:00:00
 7 ABNEY,  PATRICK       POLICE OF… POLICE             72510 2015-06-29 00:00:00
 8 ABOUASSI,  CHADI      POLICE OF… POLICE             48078 2017-05-16 00:00:00
 9 ABRAHAM,  NANCY A     POLICE OF… POLICE             76266 2014-04-28 00:00:00
10 ABRAM,  ANTHONY A     POLICE OF… POLICE             68616 2016-06-27 00:00:00
# ℹ 9,203 more rows
# ℹ 3 more variables: start_date_in_present_position <dttm>,
#   salary_basis <chr>, employment_category <chr>

df[df['title'] == 'POLICE OFFICER']

                        name           title  ... salary_basis  employment_category
10          ABBATE,  TERRY M  POLICE OFFICER  ...       SALARY     Fulltime-Regular
12           ABDALLAH,  ZAID  POLICE OFFICER  ...       SALARY     Fulltime-Regular
13     ABDELHADI,  ABDALMAHD  POLICE OFFICER  ...       SALARY     Fulltime-Regular
15        ABDELMAJEID,  AZIZ  POLICE OFFICER  ...       SALARY     Fulltime-Regular
24         ABEJERO,  JASON V  POLICE OFFICER  ...       SALARY     Fulltime-Regular
...                      ...             ...  ...          ...                  ...
32800        ZYGMUNT,  DAWID  POLICE OFFICER  ...       SALARY     Fulltime-Regular
32801   ZYLINSKA,  KATARZYNA  POLICE OFFICER  ...       SALARY     Fulltime-Regular
32802     ZYMANTAS,  LAURA C  POLICE OFFICER  ...       SALARY     Fulltime-Regular
32803      ZYMANTAS,  MARK E  POLICE OFFICER  ...       SALARY     Fulltime-Regular
32804    ZYRKOWSKI,  CARLO E  POLICE OFFICER  ...       SALARY     Fulltime-Regular

[9213 rows x 8 columns]

List the names and salaries of employees whose annual salary is greater than $100,000.

select name, salary_annual
from df
where salary_annual > 100000

Displaying records 1 - 10
name	salary_annual
AARON, JEFFERY M	101442
AARON, KIMBERLEI R	101592
ABAD JR, VICENTE M	110064
ABBATEMARCO, JAMES J	103350
ABDELLATIF, AREF R	102228
ABDUL-KARIM, MUHAMMAD A	111492
ABDULLAH, KEVIN	114846
ABOUELKHEIR, HASSAN A	110064
ABRAHAM, GIRLEY T	110064
ABRAMSKI, JOHN E	128970

df %>% 
  filter(salary_annual > 100000) %>% 
  select(name, salary_annual)

# A tibble: 6,091 × 2
   name                     salary_annual
   <chr>                            <dbl>
 1 AARON,  JEFFERY M               101442
 2 AARON,  KIMBERLEI R             101592
 3 ABAD JR,  VICENTE M             110064
 4 ABBATEMARCO,  JAMES J           103350
 5 ABDELLATIF,  AREF R             102228
 6 ABDUL-KARIM,  MUHAMMAD A        111492
 7 ABDULLAH,  KEVIN                114846
 8 ABOUELKHEIR,  HASSAN A          110064
 9 ABRAHAM,  GIRLEY T              110064
10 ABRAMSKI,  JOHN E               128970
# ℹ 6,081 more rows

df[df['salary_annual'] > 100000][['name', 'salary_annual']]

                        name  salary_annual
0          AARON,  JEFFERY M       101442.0
2        AARON,  KIMBERLEI R       101592.0
3        ABAD JR,  VICENTE M       110064.0
9      ABBATEMARCO,  JAMES J       103350.0
14       ABDELLATIF,  AREF R       102228.0
...                      ...            ...
32783         ZUPAN,  BILL M       114324.0
32785     ZURAWSKI,  JEFFREY       101608.0
32794    ZWOLFER,  MATTHEW W       114324.0
32798    ZYGADLO,  MICHAEL J       101608.0
32805    ZYSKOWSKI,  DARIUSZ       115932.0

[6091 rows x 2 columns]

Find all employees who work in the “FINANCE” department.

select name
from df
where department = 'FINANCE'

Displaying records 1 - 10
name
ADAMCZYK JR, JAN
ADAPON, NENITA P
ADENI, MOHAMED K
AHMED, MOHAMMAD A
AIKONEDO, PAUL E
ALAM, SYED S
ALEXANDER, PATRICIA L
ALEXANDER, RACQUEL L
ALLEN, JONATHAN E
ALLEN, KEVIN M

df %>% 
  filter(department == 'FINANCE') %>% 
  select(name)

# A tibble: 563 × 1
   name                  
   <chr>                 
 1 ADAMCZYK JR,  JAN     
 2 ADAPON,  NENITA P     
 3 ADENI,  MOHAMED K     
 4 AHMED,  MOHAMMAD A    
 5 AIKONEDO,  PAUL E     
 6 ALAM,  SYED S         
 7 ALEXANDER,  PATRICIA L
 8 ALEXANDER,  RACQUEL L 
 9 ALLEN,  JONATHAN E    
10 ALLEN,  KEVIN M       
# ℹ 553 more rows

df[df['department'] == 'FINANCE']['name']

109         ADAMCZYK JR,  JAN
162         ADAPON,  NENITA P
175         ADENI,  MOHAMED K
263        AHMED,  MOHAMMAD A
278         AIKONEDO,  PAUL E
                 ...         
32498    ZACARIAZ,  ALEJANDRO
32522          ZAIDI,  SYED K
32605       ZAVALA,  FERNANDO
32680          ZHANG,  KEFENG
32789          ZUREK,  MARY H
Name: name, Length: 563, dtype: object

2. Aggregate Functions:

Calculate the average salary for all employees.

select avg(salary_annual)
from df

1 records
avg(salary_annual)
81631.46

df %>%
  summarize(mean(salary_annual))

# A tibble: 1 × 1
  `mean(salary_annual)`
                  <dbl>
1                81631.

df['salary_annual'].mean()

np.float64(81631.4588758154)

Find the total number of employees in each department.

select department, count(*) as ttl_employees
from df
group by department
order by ttl_employees desc

Displaying records 1 - 10
department	ttl_employees
POLICE	13055
FIRE	4833
OEMC	2121
STREETS & SAN	2045
WATER MGMNT	1902
AVIATION	1414
TRANSPORTN	1244
PUBLIC LIBRARY	978
FLEET AND FACILITY MGMT	964
FAMILY & SUPPORT	625

df %>% 
  group_by(department) %>%
  count() %>% 
  arrange(-n)

# A tibble: 36 × 2
# Groups:   department [36]
   department                  n
   <chr>                   <int>
 1 POLICE                  13055
 2 FIRE                     4833
 3 OEMC                     2121
 4 STREETS & SAN            2045
 5 WATER MGMNT              1902
 6 AVIATION                 1414
 7 TRANSPORTN               1244
 8 PUBLIC LIBRARY            978
 9 FLEET AND FACILITY MGMT   964
10 FAMILY & SUPPORT          625
# ℹ 26 more rows

df.groupby('department').agg(
  ttl_employees = ('name', 'count')
).reset_index().sort_values(by='ttl_employees', ascending=False)

                 department  ttl_employees
27                   POLICE          13055
15                     FIRE           4833
26                     OEMC           2121
31            STREETS & SAN           2045
34              WATER MGMNT           1902
2                  AVIATION           1414
32               TRANSPORTN           1244
30           PUBLIC LIBRARY            978
16  FLEET AND FACILITY MGMT            964
13         FAMILY & SUPPORT            625
14                  FINANCE            563
17                   HEALTH            503
9              CITY COUNCIL            453
23                      LAW            392
6                 BUILDINGS            277
18       HOUSING & ECON DEV            204
7          BUSINESS AFFAIRS            161
3         BOARD OF ELECTION            103
12                     DoIT             98
29              PROCUREMENT             97
21            INSPECTOR GEN             89
8                CITY CLERK             89
25           MAYOR'S OFFICE             86
1             ANIMAL CONTRL             80
20          HUMAN RESOURCES             71
10         CULTURAL AFFAIRS             67
5             BUDGET & MGMT             46
0              ADMIN HEARNG             39
11             DISABILITIES             29
33                TREASURER             22
19          HUMAN RELATIONS             17
22                     IPRA             13
4           BOARD OF ETHICS              8
28             POLICE BOARD              2
24        LICENSE APPL COMM              1

Determine the highest salary in the dataset.

select max(salary_annual)
from df

1 records
max(salary_annual)
3e+05

df %>% 
  summarize(max(salary_annual))

# A tibble: 1 × 1
  `max(salary_annual)`
                 <dbl>
1               300000

df['salary_annual'].max()

np.float64(300000.0)

3. Grouping and Sorting:

List the average salary for each job title.

select title, avg(salary_annual) as avg_salary
from df
group by title
order by avg_salary desc

Displaying records 1 - 10
title	avg_salary
COMMISSIONER OF AVIATION	300000
SUPERINTENDENT OF POLICE	260004
MAYOR	216210
FIRE COMMISSIONER	202728
FIRST DEPUTY FIRE COMMISSIONER	197736
FIRST DEPUTY SUPERINTENDENT	197724
CHIEF OF STAFF	195000
DEPUTY FIRE COMMISSIONER	187680
CHIEF	185364
ASST DEPUTY FIRE COMMISSIONER	185352

df %>% 
  group_by(title) %>% 
  summarize(avg_salary = mean(salary_annual, na.rm = TRUE)) %>% 
  arrange(-avg_salary)

# A tibble: 1,121 × 2
   title                          avg_salary
   <chr>                               <dbl>
 1 COMMISSIONER OF AVIATION           300000
 2 SUPERINTENDENT OF POLICE           260004
 3 MAYOR                              216210
 4 FIRE COMMISSIONER                  202728
 5 FIRST DEPUTY FIRE COMMISSIONER     197736
 6 FIRST DEPUTY SUPERINTENDENT        197724
 7 CHIEF OF STAFF                     195000
 8 DEPUTY FIRE COMMISSIONER           187680
 9 CHIEF                              185364
10 ASST DEPUTY FIRE COMMISSIONER      185352
# ℹ 1,111 more rows

df.groupby('title').agg(
  avg_salary = ('salary_annual', 'mean')
).reset_index().sort_values(by='avg_salary', ascending=False)

                               title     avg_salary
252         COMMISSIONER OF AVIATION  300000.000000
997         SUPERINTENDENT OF POLICE  260004.000000
731                            MAYOR  216210.000000
516                FIRE COMMISSIONER  202728.000000
539   FIRST DEPUTY FIRE COMMISSIONER  197736.000000
...                              ...            ...
785                     POLICE CADET    9838.400000
828                     PROGRAM AIDE    9728.622222
1077       TITLE V PROGRAM TRAINEE I    8580.000000
576               FOSTER GRANDPARENT    2756.000000
915                 SENIOR COMPANION    2756.000000

[1121 rows x 2 columns]

Find the top 5 highest-paid departments.

select department, avg(salary_annual) as avg_salary
from df
group by department
order by avg_salary desc
limit 5

5 records
department	avg_salary
IPRA	102758.77
DoIT	99220.53
BUILDINGS	99095.58
FIRE	98203.67
BOARD OF ETHICS	95061.00

df %>% 
  group_by(department) %>% 
  summarize(avg_salary = mean(salary_annual)) %>% 
  top_n(5, avg_salary) %>% 
  arrange(-avg_salary)

# A tibble: 5 × 2
  department      avg_salary
  <chr>                <dbl>
1 IPRA               102759.
2 DoIT                99221.
3 BUILDINGS           99096.
4 FIRE                98204.
5 BOARD OF ETHICS     95061

df.groupby('department').agg(
  avg_salary = ('salary_annual', 'mean')
).reset_index().sort_values(by='avg_salary', ascending=False).head(5)

         department     avg_salary
22             IPRA  102758.769231
12             DoIT   99220.530612
6         BUILDINGS   99095.580650
15             FIRE   98203.672849
4   BOARD OF ETHICS   95061.000000

Show the number of employees for each job title, ordered by the number of employees in descending order.

select title, count(*) as cnt_employees
from df
group by title
order by cnt_employees desc

Displaying records 1 - 10
title	cnt_employees
POLICE OFFICER	9213
FIREFIGHTER-EMT	1492
SERGEANT	1162
POLICE OFFICER (ASSIGNED AS DETECTIVE)	1015
MOTOR TRUCK DRIVER	709
POOL MOTOR TRUCK DRIVER	693
SANITATION LABORER	634
FIREFIGHTER	533
CROSSING GUARD	464
CONSTRUCTION LABORER	456

df %>% 
  group_by(title) %>% 
  count() %>% 
  arrange(-n)

# A tibble: 1,121 × 2
# Groups:   title [1,121]
   title                                      n
   <chr>                                  <int>
 1 POLICE OFFICER                          9213
 2 FIREFIGHTER-EMT                         1492
 3 SERGEANT                                1162
 4 POLICE OFFICER (ASSIGNED AS DETECTIVE)  1015
 5 MOTOR TRUCK DRIVER                       709
 6 POOL MOTOR TRUCK DRIVER                  693
 7 SANITATION LABORER                       634
 8 FIREFIGHTER                              533
 9 CROSSING GUARD                           464
10 CONSTRUCTION LABORER                     456
# ℹ 1,111 more rows

df.groupby('title').agg(
  cnt_employees = ('name', 'count')
).reset_index().sort_values(by='cnt_employees', ascending=False)

                                      title  cnt_employees
791                          POLICE OFFICER           9213
529                         FIREFIGHTER-EMT           1492
946                                SERGEANT           1162
796  POLICE OFFICER (ASSIGNED AS DETECTIVE)           1015
744                      MOTOR TRUCK DRIVER            709
..                                      ...            ...
218                CHIEF SYSTEMS PROGRAMMER              1
220                     CHIEF WATER CHEMIST              1
221                    CHIEF WATER ENGINEER              1
222              CHIEF ZONING PLAN EXAMINER              1
0              1ST DEPUTY INSPECTOR GENERAL              1

[1121 rows x 2 columns]

4. Joining and Subqueries:

Retrieve the details of employees who have a salary in the top five of their department. Show the name, title, department, salary, and ranking.

with salary_by_dept_rank as (
select 
    name,
    title,
    department,
    salary_annual,
    RANK() OVER(
      partition by department
      order by salary_annual desc) as salary_rank_by_department
from df
where department IS NOT NULL
)
select *
from salary_by_dept_rank
where salary_rank_by_department <= 5

Displaying records 1 - 10
name	title	department	salary_annual	salary_rank_by_department
JACKOWIAK, PATRICIA	DIR OF ADMINISTRATIVE HEARINGS	ADMIN HEARNG	156420	1
SHEELY, STEVEN N	DEPUTY DIR OF ADMINISTRATIVE ADJUDICATION	ADMIN HEARNG	131688	2
HENNIGAN, KEVIN G	FISCAL ADMINISTRATOR	ADMIN HEARNG	117660	3
HEIDT, SANDRA E	ASST MANAGER OF ADMINISTRATIVE ADJUDICATION	ADMIN HEARNG	98160	4
GOGOLA, DEBORAH G	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	5
PARKER-OKOJIE, GABRIELLE E	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	5
RIZZO, ANTHONY J	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	5
WEINERT, DARCI N	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	5
RUSSELL, SUSAN J	EXECUTIVE DIR OF ANIMAL CARE AND CONTROL	ANIMAL CONTRL	130008	1
YAMAJI, PETER S	VETERINARIAN	ANIMAL CONTRL	128136	2

df %>% 
  group_by(department) %>% 
  mutate(salary_rank = dense_rank(desc(salary_annual))) %>% 
  arrange(department, salary_rank) %>% 
  filter(salary_rank <= 5,
         !is.na(department))

# A tibble: 290 × 9
# Groups:   department [35]
   name                       title department salary_annual original_hire_date 
   <chr>                      <chr> <chr>              <dbl> <dttm>             
 1 JACKOWIAK,  PATRICIA       DIR … ADMIN HEA…        156420 1989-09-01 00:00:00
 2 SHEELY,  STEVEN N          DEPU… ADMIN HEA…        131688 2003-07-16 00:00:00
 3 HENNIGAN,  KEVIN G         FISC… ADMIN HEA…        117660 1980-01-07 00:00:00
 4 HEIDT,  SANDRA E           ASST… ADMIN HEA…         98160 1985-09-09 00:00:00
 5 GOGOLA,  DEBORAH G         SENI… ADMIN HEA…         96636 1989-10-02 00:00:00
 6 PARKER-OKOJIE,  GABRIELLE… SENI… ADMIN HEA…         96636 2004-06-15 00:00:00
 7 RIZZO,  ANTHONY J          SENI… ADMIN HEA…         96636 1997-12-08 00:00:00
 8 WEINERT,  DARCI N          SENI… ADMIN HEA…         96636 2003-03-17 00:00:00
 9 RUSSELL,  SUSAN J          EXEC… ANIMAL CO…        130008 2016-05-02 00:00:00
10 YAMAJI,  PETER S           VETE… ANIMAL CO…        128136 2000-08-21 00:00:00
# ℹ 280 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
#   salary_basis <chr>, employment_category <chr>, salary_rank <int>

Find the names of employees who earn more than the average salary in their department.

with dept_avg_salary as (
select *,
        AVG(salary_annual) OVER(
          partition by department) as avg_salary_by_department 
from df
where department IS NOT NULL
)
select *
from dept_avg_salary
where salary_annual > avg_salary_by_department
order by department, salary_annual desc

Displaying records 1 - 10
name	title	department	salary_annual	original_hire_date	start_date_in_present_position	salary_basis	employment_category	avg_salary_by_department
JACKOWIAK, PATRICIA	DIR OF ADMINISTRATIVE HEARINGS	ADMIN HEARNG	156420	620611200	1283299200	SALARY	Fulltime-Regular	78894.15
SHEELY, STEVEN N	DEPUTY DIR OF ADMINISTRATIVE ADJUDICATION	ADMIN HEARNG	131688	1058313600	1309478400	SALARY	Fulltime-Regular	78894.15
HENNIGAN, KEVIN G	FISCAL ADMINISTRATOR	ADMIN HEARNG	117660	316051200	815184000	SALARY	Fulltime-Regular	78894.15
HEIDT, SANDRA E	ASST MANAGER OF ADMINISTRATIVE ADJUDICATION	ADMIN HEARNG	98160	495072000	879465600	SALARY	Fulltime-Regular	78894.15
GOGOLA, DEBORAH G	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	623289600	976924800	SALARY	Fulltime-Regular	78894.15
PARKER-OKOJIE, GABRIELLE E	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	1087257600	1497830400	SALARY	Fulltime-Regular	78894.15
RIZZO, ANTHONY J	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	881539200	1095292800	SALARY	Fulltime-Regular	78894.15
WEINERT, DARCI N	SENIOR ADMINISTRATIVE LAW OFFICER	ADMIN HEARNG	96636	1047859200	1468627200	SALARY	Fulltime-Regular	78894.15
RIZZO, MARGARET A	ADMINISTRATIVE SERVICES OFFICER II EXCLD	ADMIN HEARNG	93300	173404800	1164931200	SALARY	Fulltime-Regular	78894.15
SMITH, PAMELA M	ASST TO THE EXEC DIR	ADMIN HEARNG	93300	585187200	991353600	SALARY	Fulltime-Regular	78894.15

df %>% 
  group_by(department) %>% 
  mutate(avg_salary = mean(salary_annual)) %>% 
  filter(salary_annual > avg_salary) %>% 
  arrange(department, -salary_annual)

# A tibble: 16,399 × 9
# Groups:   department [35]
   name                       title department salary_annual original_hire_date 
   <chr>                      <chr> <chr>              <dbl> <dttm>             
 1 JACKOWIAK,  PATRICIA       DIR … ADMIN HEA…        156420 1989-09-01 00:00:00
 2 SHEELY,  STEVEN N          DEPU… ADMIN HEA…        131688 2003-07-16 00:00:00
 3 HENNIGAN,  KEVIN G         FISC… ADMIN HEA…        117660 1980-01-07 00:00:00
 4 HEIDT,  SANDRA E           ASST… ADMIN HEA…         98160 1985-09-09 00:00:00
 5 GOGOLA,  DEBORAH G         SENI… ADMIN HEA…         96636 1989-10-02 00:00:00
 6 PARKER-OKOJIE,  GABRIELLE… SENI… ADMIN HEA…         96636 2004-06-15 00:00:00
 7 RIZZO,  ANTHONY J          SENI… ADMIN HEA…         96636 1997-12-08 00:00:00
 8 WEINERT,  DARCI N          SENI… ADMIN HEA…         96636 2003-03-17 00:00:00
 9 RIZZO,  MARGARET A         ADMI… ADMIN HEA…         93300 1975-07-01 00:00:00
10 SMITH,  PAMELA M           ASST… ADMIN HEA…         93300 1988-07-18 00:00:00
# ℹ 16,389 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
#   salary_basis <chr>, employment_category <chr>, avg_salary <dbl>

df['avg_salary'] = df['salary_annual'].mean()

df[df['salary_annual'] > df['avg_salary']]

                        name  ...    avg_salary
0          AARON,  JEFFERY M  ...  81631.458876
1             AARON,  KARINA  ...  81631.458876
2        AARON,  KIMBERLEI R  ...  81631.458876
3        ABAD JR,  VICENTE M  ...  81631.458876
7      ABBATACOLA,  ROBERT J  ...  81631.458876
...                      ...  ...           ...
32797       ZYGADLO,  JOHN P  ...  81631.458876
32798    ZYGADLO,  MICHAEL J  ...  81631.458876
32803      ZYMANTAS,  MARK E  ...  81631.458876
32804    ZYRKOWSKI,  CARLO E  ...  81631.458876
32805    ZYSKOWSKI,  DARIUSZ  ...  81631.458876

[19902 rows x 9 columns]

5. Complex Queries:

Calculate the percentage of employees in each department relative to the total number of employees.

with dept_employees as (
  select department, count(*) cnt_employees
  from df
  group by department
),
ttl_employees as (
  select count(*) as ttl_count
  from df
)
select *, round(cnt_employees * 100.0 / ttl_count, 1) as pct_employees
from dept_employees, ttl_employees
order by pct_employees desc

Displaying records 1 - 10
department	cnt_employees	ttl_count	pct_employees
POLICE	13055	32806	39.8
FIRE	4833	32806	14.7
OEMC	2121	32806	6.5
STREETS & SAN	2045	32806	6.2
WATER MGMNT	1902	32806	5.8
AVIATION	1414	32806	4.3
TRANSPORTN	1244	32806	3.8
PUBLIC LIBRARY	978	32806	3.0
FLEET AND FACILITY MGMT	964	32806	2.9
FAMILY & SUPPORT	625	32806	1.9

df %>% 
  group_by(department) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(ttl_employees = sum(n),
         pct_employees = round((n / ttl_employees) * 100, 1)) %>% 
  arrange(-pct_employees)

# A tibble: 36 × 4
   department                  n ttl_employees pct_employees
   <chr>                   <int>         <int>         <dbl>
 1 POLICE                  13055         32806          39.8
 2 FIRE                     4833         32806          14.7
 3 OEMC                     2121         32806           6.5
 4 STREETS & SAN            2045         32806           6.2
 5 WATER MGMNT              1902         32806           5.8
 6 AVIATION                 1414         32806           4.3
 7 TRANSPORTN               1244         32806           3.8
 8 PUBLIC LIBRARY            978         32806           3  
 9 FLEET AND FACILITY MGMT   964         32806           2.9
10 FAMILY & SUPPORT          625         32806           1.9
# ℹ 26 more rows

Identify job titles that have more than 100 employees.

select title, count(*) cnt_employees
from df
group by title
having cnt_employees > 100
order by cnt_employees desc

Displaying records 1 - 10
title	cnt_employees
POLICE OFFICER	9213
FIREFIGHTER-EMT	1492
SERGEANT	1162
POLICE OFFICER (ASSIGNED AS DETECTIVE)	1015
MOTOR TRUCK DRIVER	709
POOL MOTOR TRUCK DRIVER	693
SANITATION LABORER	634
FIREFIGHTER	533
CROSSING GUARD	464
CONSTRUCTION LABORER	456

df %>% 
  group_by(title) %>% 
  count() %>% 
  filter(n > 100) %>% 
  arrange(-n)

# A tibble: 46 × 2
# Groups:   title [46]
   title                                      n
   <chr>                                  <int>
 1 POLICE OFFICER                          9213
 2 FIREFIGHTER-EMT                         1492
 3 SERGEANT                                1162
 4 POLICE OFFICER (ASSIGNED AS DETECTIVE)  1015
 5 MOTOR TRUCK DRIVER                       709
 6 POOL MOTOR TRUCK DRIVER                  693
 7 SANITATION LABORER                       634
 8 FIREFIGHTER                              533
 9 CROSSING GUARD                           464
10 CONSTRUCTION LABORER                     456
# ℹ 36 more rows

df.groupby('title').agg(
  cnt_title = ('title', 'count')
).reset_index().sort_values(by='cnt_title', ascending=False).query('cnt_title > 100')

                                       title  cnt_title
791                           POLICE OFFICER       9213
529                          FIREFIGHTER-EMT       1492
946                                 SERGEANT       1162
796   POLICE OFFICER (ASSIGNED AS DETECTIVE)       1015
744                       MOTOR TRUCK DRIVER        709
808                  POOL MOTOR TRUCK DRIVER        693
899                       SANITATION LABORER        634
525                              FIREFIGHTER        533
326                           CROSSING GUARD        464
280                     CONSTRUCTION LABORER        456
1079             TRAFFIC CONTROL AIDE-HOURLY        452
680                           LIEUTENANT-EMT        425
679                               LIEUTENANT        379
327                 CROSSING GUARD - PER CBA        348
760                                PARAMEDIC        341
762                            PARAMEDIC I/C        278
598                    GENERAL LABORER - DSS        258
531                    FIREFIGHTER/PARAMEDIC        255
530                FIREFIGHTER-EMT (RECRUIT)        252
140                AVIATION SECURITY OFFICER        236
786         POLICE COMMUNICATIONS OPERATOR I        236
617                        HOISTING ENGINEER        233
520                        FIRE ENGINEER-EMT        231
452                      ELECTRICAL MECHANIC        229
360                           DETENTION AIDE        224
801        POLICE OFFICER / FLD TRNG OFFICER        219
787        POLICE COMMUNICATIONS OPERATOR II        215
752               OPERATING ENGINEER-GROUP A        186
676                             LIBRARY PAGE        166
753               OPERATING ENGINEER-GROUP C        166
689                   MACHINIST (AUTOMOTIVE)        160
576                       FOSTER GRANDPARENT        143
162                              CAPTAIN-EMT        140
277                         CONCRETE LABORER        139
15                    ADMINISTRATIVE ASST II        132
649                                  LABORER        130
783              POLICE ADMINISTRATIVE CLERK        128
918               SENIOR DATA ENTRY OPERATOR        124
780                                  PLUMBER        123
978                               STAFF ASST        121
666                              LIBRARIAN I        121
52                           ASPHALT LABORER        110
979               STAFF ASST TO THE ALDERMAN        107
17                   ADMINISTRATIVE ASST III        107
519                            FIRE ENGINEER        106
764                 PARKING ENFORCEMENT AIDE        106

Find employees who have been in their current position for more than 5 years (assuming the dataset includes a hire date or a similar field).

Note: need to do some more research on the DBI:: package, as the SQL object is reading date fields in as a dbl data type.

df %>% 
  mutate(tenure = round(
    lubridate::time_length(
    today() - lubridate::as_date(start_date_in_present_position), "year"),
    1)) %>% 
  filter(tenure > 5) %>% 
  arrange(-tenure)

# A tibble: 32,806 × 9
   name                title        department salary_annual original_hire_date 
   <chr>               <chr>        <chr>              <dbl> <dttm>             
 1 LANNON,  JOHN P     CONSTRUCTIO… WATER MGM…        83616  1963-07-02 00:00:00
 2 WILLIAMS,  DAISY    CROSSING GU… OEMC              23254. 1965-12-06 00:00:00
 3 CALHOUN,  ROCKELL H CROSSING GU… OEMC              23254. 1968-03-25 00:00:00
 4 SCOTT,  ROSIE M     PARKING ENF… FINANCE           64392  1968-08-19 00:00:00
 5 BURKE,  CATHERINE   CROSSING GU… OEMC              23254. 1970-05-13 00:00:00
 6 BISHOP,  DORIS J    LIBRARY ASS… PUBLIC LI…        69492  1971-09-07 00:00:00
 7 LUMPKIN,  STEPHEN   SANITATION … STREETS &…        75317. 1972-03-11 00:00:00
 8 KUCHARSKI,  JOYCE M CROSSING GU… OEMC              23254. 1973-09-17 00:00:00
 9 CORSO JR,  JOHN J   PROPERTY CU… POLICE            70092  1975-05-27 00:00:00
10 PHILLIPS,  EUGENIA  CROSSING GU… OEMC              23254. 1976-01-12 00:00:00
# ℹ 32,796 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
#   salary_basis <chr>, employment_category <chr>, tenure <dbl>

6. Own Questions

Pull the second title of all employees that have been promoted at least once.

with promotion_seq as (
      select
          name,
          title,
          department,
          start_date_in_present_position,
          ROW_NUMBER() OVER(
            partition by name
            order by start_date_in_present_position
          ) as seq_promotion
      from df
)
select *
from promotion_seq
where seq_promotion = 2
order by name

Displaying records 1 - 10
name	title	department	start_date_in_present_position	seq_promotion
ADE, JAMES P	POLICE OFFICER	POLICE	1283299200	2
ANDERSON, DAVID C	PROJECTS ADMINISTRATOR	BUILDINGS	1393200000	2
ANDERSON, DONALD	PROJECT MANAGER	AVIATION	1494806400	2
ANDERSON, RHONDA M	SERGEANT	POLICE	1464739200	2
ARROYO, FRANCISCO	WARD SUPERINTENDENT	STREETS & SAN	1489622400	2
ATECA, SALVADOR R	MACHINIST (AUTOMOTIVE)	FLEET AND FACILITY MGMT	1425340800	2
BIANCHI, MICHAEL J	FIRE COMMUNICATIONS OPERATOR I	OEMC	1502841600	2
BRACKEN, DANIEL J	PARAMEDIC	FIRE	1418688000	2
BROWN, ANTHONY	GENERAL LABORER - DSS	STREETS & SAN	1405296000	2
BROWN, PATRICK J	POLICE OFFICER (ASSIGNED AS DETECTIVE)	POLICE	1429142400	2

df %>% 
  group_by(name) %>% 
  arrange(start_date_in_present_position) %>% 
  mutate(seq_promotion = row_number()) %>% 
  filter(seq_promotion == 2) %>% 
  arrange(name)

# A tibble: 284 × 9
# Groups:   name [284]
   name                title        department salary_annual original_hire_date 
   <chr>               <chr>        <chr>              <dbl> <dttm>             
 1 ADE,  JAMES P       POLICE OFFI… POLICE            84054  2010-09-01 00:00:00
 2 ANDERSON,  DAVID C  PROJECTS AD… BUILDINGS         94824  2014-02-24 00:00:00
 3 ANDERSON,  DONALD   PROJECT MAN… AVIATION          80868  1987-08-04 00:00:00
 4 ANDERSON,  RHONDA M SERGEANT     POLICE           104628  2000-02-28 00:00:00
 5 ARROYO,  FRANCISCO  WARD SUPERI… STREETS &…        73212  2015-05-18 00:00:00
 6 ATECA,  SALVADOR R  MACHINIST (… FLEET AND…        96408  2015-03-03 00:00:00
 7 BIANCHI,  MICHAEL J FIRE COMMUN… OEMC              51264  2017-08-16 00:00:00
 8 BRACKEN,  DANIEL J  PARAMEDIC    FIRE              76266  2014-12-16 00:00:00
 9 BROWN,  ANTHONY     GENERAL LAB… STREETS &…        43202. 2014-07-14 00:00:00
10 BROWN,  PATRICK J   POLICE OFFI… POLICE            97440  1999-03-08 00:00:00
# ℹ 274 more rows
# ℹ 4 more variables: start_date_in_present_position <dttm>,
#   salary_basis <chr>, employment_category <chr>, seq_promotion <int>

df.sort_values(by='start_date_in_present_position').groupby('name').apply(
  lambda x: x.assign(seq_promotion=range(1, len(x) + 1))).query('seq_promotion == 2').reset_index(drop=True).sort_values(by='name')

<string>:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
                    name  ... seq_promotion
0          ADE,  JAMES P  ...             2
1     ANDERSON,  DAVID C  ...             2
2      ANDERSON,  DONALD  ...             2
3    ANDERSON,  RHONDA M  ...             2
4     ARROYO,  FRANCISCO  ...             2
..                   ...  ...           ...
279    WILSON,  EDWARD J  ...             2
280    WILSON,  JANICE L  ...             2
281      WILSON,  JOHN L  ...             2
282      WILSON,  JOSHUA  ...             2
283  WRONOWSKI,  JAMES J  ...             2

[284 rows x 10 columns]

Conclusion

This project had three purposes:

To show how to run R, SQL, and Python all interchangeably in a Quarto document. Quarto and the reticulate:: and DBI:: packages allow for the functionality to switch between different methods of handling, querying, and analyzing data. Practically speaking, you probably wouldn’t switch between these as often as I did in this work, but it’s useful to verify your work and could come in handy for specific circumstances (e.g. analyzing with Pandas and visualizing with ggplot2). I did find that some of the work in SQL is clunky – data types didn’t read in correctly for dates – but I have not yet dug more deeply into the DBI:: package to troubleshoot.
To compare the ease of writing code using dplyr (R), SQL, and pandas (Python). Personally, I feel that R and dplyr:: is the simplest syntax to use, followed by SQL and Python. dplyr:: is very linear in its construction, and it rarely has extra characters to remember. Python’s brackets, quotes, and chaining all take some getting used to, while R and SQL feel more direct.
To include some demonstration of SQL in my portfolio (which is often not included but remains a critical skill). This was a fun way to demonstrate some SQL expertise while comparing and verifying output against R and Python.