library(tidyverse)
library(DBI)
library(RSQLite)
Introduction
The BrainTree SQL coding challenge is known as a great challenge for data analysts looking to test their skills. In this project, I am putting my skills to the test, using Quarto and the DBI::
and RSQLite::
packages to run SQL queries on an in-memory database.
Setup and data preparation
Required packages:
About the dataset:
Here is some text from the BrainTree challenge site:
This is an opportunity for you to show us your grasp of SQL which plays a key role in the day-to-day job that you’re applying for. All members of the Analytics Data team have taken this challenge and participated in its creation. If you feel that there are any questions that are either not fair or not clear, please do let us know; this is VERY important to us! A few important things to note before you get started:
- All work should be done in SQL. Any variant is fine (e.g. MS SQL, Postgres, MySQL, etc.). If you normally use R, SAS, or other similar tools with SQL it’s important that you show that you can work in SQL by itself to produce the correct answers to this challenge.
- If you are confused by a specific question, you can request clarification by replying to the message that this challenge was attached to. This is NOT intended for you to obtain technical help with solving the problems on this challenge or asking for hints; it should only be used for question clarification. This challenge is due back within 1 week (7 calendar days) of being sent to you.
- If you cannot answer a question, please do your best, show your work, leave comments, and let us know your general thoughts.
- We are interested in BOTH your answers and the work/code that you write to get them. Please leave plenty of comments for us to read when we review your work.
- There are some blank/null values in this set. That’s how we found it and it reflects the nature of imperfect data. Please work with this and provide explanations of any issues or workarounds required to arrive at your answers.
- There are no intentional gotchas, trick questions, or traps; the challenge is intended to demonstrate some of the typical day-to-day SQL skills that the job requires.
- Some of these questions may seem nonsensical and you may find yourself asking, “why would anyone want to know that?!” They are intended purely as a measure of your SQL skills and not as actual questions that we would expect to ask of this type of data set. Please take them with a grain of salt.
<- read_csv('.//data/countries.csv')
countries <- read_csv('.//data/continent_map.csv')
continent_map <- read_csv('.//data/continents.csv')
continents <- read_csv('.//data/per_capita.csv') per_capita
Setting up SQL to execute
The DBI::
package allows you to create an in-memory database to query against. The DBI project site is a great place to learn more about it. I’ll start by doing some initial setup and establishing the connection between the R dataframe and the SQL table name that I’ll query.
<- DBI::dbConnect(SQLite(), ":memory:")
con ::dbWriteTable(conn = con, name = "continent_map", value = continent_map)
DBI::dbWriteTable(conn = con, name = "per_capita", value = per_capita)
DBI::dbWriteTable(conn = con, name = "continents", value = continents)
DBI::dbWriteTable(conn = con, name = "countries", value = countries) DBI
Questions
- Data Integrity Checking & Cleanup
Alphabetically list all of the country codes in the continent_map table that appear more than once. Display any values where country_code is null as country_code = “FOO” and make this row appear first in the list, even though it should alphabetically sort to the middle. Provide the results of this query as your answer.
select
CASE WHEN country_code IS NULL THEN "FOO" ELSE country_code END as country_code,
count(*) as cnt_country
from continent_map
group by country_code
having cnt_country > 1
country_code | cnt_country |
---|---|
FOO | 4 |
ARM | 3 |
AZE | 2 |
CYP | 2 |
GEO | 2 |
KAZ | 2 |
RUS | 2 |
TUR | 2 |
UMI | 2 |
For all countries that have multiple rows in the continent_map table, delete all multiple records leaving only the 1 record per country. The record that you keep should be the first one when sorted by the continent_code alphabetically ascending. Provide the query/ies and explanation of step(s) that you follow to delete these records.
with continent_map_update as (
select *,
ROW_NUMBER() OVER(
partition by country_code
order by country_code asc) as seq
from continent_map)
select *
from continent_map_update
where seq = 1
country_code | continent_code | seq |
---|---|---|
NA | OC | 1 |
ABW | NA | 1 |
AFG | AS | 1 |
AGO | AF | 1 |
AIA | NA | 1 |
ALA | EU | 1 |
ALB | EU | 1 |
AND | EU | 1 |
ANT | NA | 1 |
ARE | AS | 1 |
- List the countries ranked 10-12 in each continent by the percent of year-over-year growth descending from 2011 to 2012.
The percent of growth should be calculated as: ((2012 gdp - 2011 gdp) / 2011 gdp)
The list should include the columns:
- rank
- continent_name
- country_code
- country_name
- growth_percent
For this question, there are two elements of data cleaning:
- To list the countries and their ranking, I filtered out the many country codes that actually represented regions or country income groupings.
- The
continent_name
is NA for North America, and I wanted to generate the results in a cleaner way.
with df as (
select
*,
p.
countries.country_name,CASE WHEN p.country_code IN ('EAP', 'EAS', 'LIC', 'ARB', 'NOC', 'UMC', 'MEA', 'PSS', 'MNA', 'MIC', 'LMY', 'HPC', 'NAC', 'LDC', 'LMC', 'SSF', 'SSA', 'CSS', 'WLD', 'ECA', 'HIC', 'OEC', 'OED', 'SST', 'LCN', 'SAS', 'LAC', 'OSS', 'ECS', 'EUU', 'EMU') THEN 'Region' ELSE 'Country' END as country_region,
CASE WHEN c.continent_name IS NULL THEN 'North America' ELSE c.continent_name END as continent_name,
LAG(p.gdp_per_capita, 1, NULL) OVER(
partition by p.country_code
order by year) as t0_gdp_pc
from per_capita p
left join continent_map m
on p.country_code = m.country_code
left join continents c
on m.continent_code = c.continent_code
left join countries
on p.country_code = countries.country_code
where country_region = 'Country'
),as (
growth select
*,
round(((gdp_per_capita - t0_gdp_pc)/ t0_gdp_pc) * 100, 2) as growth_percent
from df
),as (
rank_2012 select
*,
RANK() OVER(
partition by continent_name
order by growth_percent desc) as rank
from growth
where year = 2012
and growth_percent IS NOT NULL
)select
rank,
continent_name,
country_code,
country_name,concat(growth_percent,'%') as growth_percent
from rank_2012
order by growth_percent desc
rank | continent_name | country_code | country_name | growth_percent |
---|---|---|---|---|
8 | Africa | LBR | Liberia | 9.77% |
12 | Asia | PHL | Philippines | 9.73% |
13 | Asia | KWT | Kuwait | 9.68% |
9 | Africa | EGY | Egypt, Arab Rep. | 9.53% |
14 | Asia | GEO | Georgia | 8.92% |
10 | Africa | RWA | Rwanda | 8.73% |
4 | Oceania | AUS | Australia | 8.63% |
2 | North America | KNA | St. Kitts and Nevis | 8.45% |
3 | North America | CRI | Costa Rica | 8.38% |
6 | South America | COL | Colombia | 8.38% |
- For the year 2012, create a 3 column, 1 row report showing the percent share of gdp_per_capita for the following regions:
- Asia, (ii) Europe, (iii) the Rest of the World. Your result should look something like
Asia Europe Rest of World 25.0% 25.0% 50.0%
with df as (
select
*,
p.
countries.country_name,CASE WHEN p.country_code IN ('EAP', 'EAS', 'LIC', 'ARB', 'NOC', 'UMC', 'MEA', 'PSS', 'MNA', 'MIC', 'LMY', 'HPC', 'NAC', 'LDC', 'LMC', 'SSF', 'SSA', 'CSS', 'WLD', 'ECA', 'HIC', 'OEC', 'OED', 'SST', 'LCN', 'SAS', 'LAC', 'OSS', 'ECS', 'EUU', 'EMU') THEN 'Region' ELSE 'Country' END as country_region,
CASE WHEN c.continent_name IS NULL THEN 'North America' ELSE c.continent_name END as continent_name,
LAG(p.gdp_per_capita, 1, NULL) OVER(
partition by p.country_code
order by year) as t0_gdp_pc
from per_capita p
left join continent_map m
on p.country_code = m.country_code
left join continents c
on m.continent_code = c.continent_code
left join countries
on p.country_code = countries.country_code
where country_region = 'Country'
),as (
sum_by_continent select
continent_name,sum(gdp_per_capita) as sum_gdp_per_capita
from df
where year = 2012
group by continent_name
),as (
sum_total select
sum(gdp_per_capita) as ttl_gdp_per_capita
from df
where year = 2012
),as (
percent_tbl select
*,
sum_by_continent.
sum_total.ttl_gdp_per_capita,round((sum_gdp_per_capita / ttl_gdp_per_capita) * 100, 1) as percent
from sum_by_continent, sum_total
),as (
tbl_sums select
CASE WHEN continent_name = 'Asia' THEN 'Asia'
WHEN continent_name = 'Europe' THEN 'Europe'
ELSE 'Rest of World' END as continent_groups,
sum(percent) as percent
from percent_tbl
group by continent_groups
)select
MAX(CASE WHEN continent_groups = 'Asia' THEN concat(percent, '%') END) AS Asia,
MAX(CASE WHEN continent_groups = 'Europe' THEN concat(percent, '%') END) AS Europe,
MAX(CASE WHEN continent_groups = 'Rest of World' THEN concat(percent, '%') END) AS Rest_of_World
FROM tbl_sums
Asia | Europe | Rest_of_World |
---|---|---|
27.6% | 43.8% | 28.6% |
4a. What is the count of countries and sum of their related gdp_per_capita values for the year 2007 where the string ‘an’ (case insensitive) appears anywhere in the country name?
select
count(*) as nbr_countries,
sum(per_capita.gdp_per_capita) as sum_gdp_per_capita
from per_capita
left join countries
on per_capita.country_code = countries.country_code
where countries.country_name like '%an%'
and year = 2007
nbr_countries | sum_gdp_per_capita |
---|---|
68 | 1022936 |
4b. Repeat question 4a, but this time make the query case sensitive.
select
count(*) as nbr_countries,
sum(per_capita.gdp_per_capita) as sum_gdp_per_capita
from per_capita
left join countries
on per_capita.country_code = countries.country_code
where countries.country_name glob '*an*'
and year = 2007
nbr_countries | sum_gdp_per_capita |
---|---|
66 | 979600.7 |
- Find the sum of gpd_per_capita by year and the count of countries for each year that have non-null gdp_per_capita where (i) the year is before 2012 and (ii) the country has a null gdp_per_capita in 2012. Your result should have the columns:
year country_count total
select
year,
count(country_code) as country_count,
sum(gdp_per_capita) as total
from per_capita
where 1 = 1
and gdp_per_capita IS NOT NULL
and country_code IN
select country_code
(from per_capita
where year = 2012
and gdp_per_capita IS NULL)
group by year
year | country_count | total |
---|---|---|
2004 | 15 | 491203.2 |
2005 | 15 | 510735.0 |
2006 | 14 | 553689.6 |
2007 | 14 | 654508.8 |
2008 | 10 | 574016.2 |
2009 | 9 | 473103.3 |
2010 | 4 | 179750.8 |
2011 | 4 | 199152.7 |
- All in a single query, execute all of the steps below and provide the results as your final answer:
- create a single list of all per_capita records for year 2009 that includes columns:
continent_name country_code country_name gdp_per_capita
- order this list by:
continent_name ascending characters 2 through 4 (inclusive) of the country_name descending
create a running total of gdp_per_capita by continent_name
return only the first record from the ordered list for which each continent’s running total of gdp_per_capita meets or exceeds $70,000.00 with the following columns:
continent_name country_code country_name gdp_per_capita running_total
with df as (
select
CASE WHEN continents.continent_name IS NULL THEN 'North America'
ELSE continents.continent_name END as continent_name,
per_capita.country_code,
countries.country_name,
per_capita.gdp_per_capita,ROW_NUMBER() OVER(
partition by continent_name
order by substr(country_name, 2, 3) desc) as row_number,
SUM(gdp_per_capita) OVER(
partition by continent_name
order by substr(country_name, 2, 3) desc) as running_total
from per_capita
left join countries
on per_capita.country_code = countries.country_code
left join continent_map
on per_capita.country_code = continent_map.country_code
left join continents
on continent_map.continent_code = continents.continent_code
where year = 2009
order by continent_name asc
),as (
criteria select
continent_name,
country_code,
country_name,
gdp_per_capita,
running_total,row_number
from df
where running_total >= 70000.00
),as (
final select
continent_name,
country_code,
country_name,
gdp_per_capita,
running_total,ROW_NUMBER() OVER(
partition by continent_name) as row_number
from criteria
)select *
from final
where row_number = 1
continent_name | country_code | country_name | gdp_per_capita | running_total | row_number |
---|---|---|---|---|---|
Africa | LBY | Libya | 10455.575 | 70227.16 | 1 |
Asia | KWT | Kuwait | 37160.543 | 73591.81 | 1 |
Europe | CHE | Switzerland | 65790.067 | 119051.78 | 1 |
North America | ECA | Europe & Central Asia (developing only) | 5495.417 | 97451.38 | 1 |
Oceania | NZL | New Zealand | 27474.333 | 84623.92 | 1 |
South America | ECU | Ecuador | 4236.778 | 72315.82 | 1 |
- Find the country with the highest average gdp_per_capita for each continent for all years. Now compare your list to the following data set. Please describe any and all mistakes that you can find with the data set below. Include any code that you use to help detect these mistakes.
with df as (
select
per_capita.country_code,
countries.country_name,year,
per_capita.CASE WHEN continent_map.continent_code IS NULL THEN 'NA'
ELSE continent_map.continent_code END as continent_code,
CASE WHEN continents.continent_name IS NULL THEN 'North America'
ELSE continents.continent_name END as continent_name,
CASE WHEN per_capita.country_code IN ('EAP', 'EAS', 'LIC', 'ARB', 'NOC', 'UMC', 'MEA', 'PSS', 'MNA', 'MIC', 'LMY', 'HPC', 'NAC', 'LDC', 'LMC', 'SSF', 'SSA', 'CSS', 'WLD', 'ECA', 'HIC', 'OEC', 'OED', 'SST', 'LCN', 'SAS', 'LAC', 'OSS', 'ECS', 'EUU', 'EMU') THEN 'Region' ELSE 'Country' END as country_region,
per_capita.gdp_per_capitafrom per_capita
left join continent_map
on per_capita.country_code = continent_map.country_code
left join countries
on per_capita.country_code = countries.country_code
left join continents
on continent_map.continent_code = continents.continent_code
),as (
avg_gdp_per_capita select
continent_name,
country_name,
country_code,round(AVG(gdp_per_capita), 2) as avg_gdp_per_capita
from df
where country_region = 'Country'
group by continent_name, country_name, country_code
),rank as (
select
*,
RANK() OVER(
partition by continent_name
order by avg_gdp_per_capita desc) as rank_gdp_per_capita
from avg_gdp_per_capita
)select *
from rank
where rank_gdp_per_capita = 1
continent_name | country_name | country_code | avg_gdp_per_capita | rank_gdp_per_capita |
---|---|---|---|---|
Africa | Equatorial Guinea | GNQ | 17955.72 | 1 |
Asia | Qatar | QAT | 70567.96 | 1 |
Europe | Monaco | MCO | 151421.89 | 1 |
North America | Bermuda | BMU | 84634.83 | 1 |
Oceania | Australia | AUS | 46147.45 | 1 |
South America | Chile | CHL | 10781.71 | 1 |