Introduction

The BrainTree SQL coding challenge is known as a great challenge for data analysts looking to test their skills. In this project, I am putting my skills to the test, using Quarto and the DBI:: and RSQLite:: packages to run SQL queries on an in-memory database.

Setup and data preparation

Required packages:

library(tidyverse)
library(DBI)
library(RSQLite)

About the dataset:

Here is some text from the BrainTree challenge site:

This is an opportunity for you to show us your grasp of SQL which plays a key role in the day-to-day job that you’re applying for. All members of the Analytics Data team have taken this challenge and participated in its creation. If you feel that there are any questions that are either not fair or not clear, please do let us know; this is VERY important to us! A few important things to note before you get started:

All work should be done in SQL. Any variant is fine (e.g. MS SQL, Postgres, MySQL, etc.). If you normally use R, SAS, or other similar tools with SQL it’s important that you show that you can work in SQL by itself to produce the correct answers to this challenge.

If you are confused by a specific question, you can request clarification by replying to the message that this challenge was attached to. This is NOT intended for you to obtain technical help with solving the problems on this challenge or asking for hints; it should only be used for question clarification. This challenge is due back within 1 week (7 calendar days) of being sent to you.

If you cannot answer a question, please do your best, show your work, leave comments, and let us know your general thoughts.

We are interested in BOTH your answers and the work/code that you write to get them. Please leave plenty of comments for us to read when we review your work.

There are some blank/null values in this set. That’s how we found it and it reflects the nature of imperfect data. Please work with this and provide explanations of any issues or workarounds required to arrive at your answers.

There are no intentional gotchas, trick questions, or traps; the challenge is intended to demonstrate some of the typical day-to-day SQL skills that the job requires.

Some of these questions may seem nonsensical and you may find yourself asking, “why would anyone want to know that?!” They are intended purely as a measure of your SQL skills and not as actual questions that we would expect to ask of this type of data set. Please take them with a grain of salt.

countries <- read_csv('.//data/countries.csv')
continent_map <- read_csv('.//data/continent_map.csv')
continents <- read_csv('.//data/continents.csv')
per_capita <- read_csv('.//data/per_capita.csv')

Setting up SQL to execute

The DBI:: package allows you to create an in-memory database to query against. The DBI project site is a great place to learn more about it. I’ll start by doing some initial setup and establishing the connection between the R dataframe and the SQL table name that I’ll query.

con <- DBI::dbConnect(SQLite(), ":memory:")
DBI::dbWriteTable(conn = con, name = "continent_map", value = continent_map)
DBI::dbWriteTable(conn = con, name = "per_capita", value = per_capita)
DBI::dbWriteTable(conn = con, name = "continents", value = continents)
DBI::dbWriteTable(conn = con, name = "countries", value = countries)

Questions

Data Integrity Checking & Cleanup

Alphabetically list all of the country codes in the continent_map table that appear more than once. Display any values where country_code is null as country_code = “FOO” and make this row appear first in the list, even though it should alphabetically sort to the middle. Provide the results of this query as your answer.

select 
      CASE WHEN country_code IS NULL THEN "FOO" ELSE country_code END as country_code,
      count(*) as cnt_country
from continent_map
group by country_code
having cnt_country > 1

9 records
country_code	cnt_country
FOO	4
ARM	3
AZE	2
CYP	2
GEO	2
KAZ	2
RUS	2
TUR	2
UMI	2

For all countries that have multiple rows in the continent_map table, delete all multiple records leaving only the 1 record per country. The record that you keep should be the first one when sorted by the continent_code alphabetically ascending. Provide the query/ies and explanation of step(s) that you follow to delete these records.

with continent_map_update as (
select *,
        ROW_NUMBER() OVER(
            partition by country_code
            order by country_code asc) as seq
from continent_map)
select *
from continent_map_update
where seq = 1

Displaying records 1 - 10
country_code	continent_code	seq
NA	OC	1
ABW	NA	1
AFG	AS	1
AGO	AF	1
AIA	NA	1
ALA	EU	1
ALB	EU	1
AND	EU	1
ANT	NA	1
ARE	AS	1

List the countries ranked 10-12 in each continent by the percent of year-over-year growth descending from 2011 to 2012.

The percent of growth should be calculated as: ((2012 gdp - 2011 gdp) / 2011 gdp)

The list should include the columns:

rank
continent_name
country_code
country_name
growth_percent

For this question, there are two elements of data cleaning:

To list the countries and their ranking, I filtered out the many country codes that actually represented regions or country income groupings.
The continent_name is NA for North America, and I wanted to generate the results in a cleaner way.

with df as (
select 
        p.*,
        countries.country_name,
        CASE WHEN p.country_code IN ('EAP', 'EAS', 'LIC', 'ARB', 'NOC', 'UMC', 'MEA', 'PSS', 'MNA', 'MIC', 'LMY', 'HPC', 'NAC', 'LDC', 'LMC', 'SSF', 'SSA', 'CSS', 'WLD', 'ECA', 'HIC', 'OEC', 'OED', 'SST', 'LCN', 'SAS', 'LAC', 'OSS', 'ECS', 'EUU', 'EMU') THEN 'Region' ELSE 'Country' END as country_region,
        CASE WHEN c.continent_name IS NULL THEN 'North America' ELSE c.continent_name END as continent_name,
        LAG(p.gdp_per_capita, 1, NULL) OVER(
            partition by p.country_code
            order by year) as t0_gdp_pc
from per_capita p
left join continent_map m
on p.country_code = m.country_code
left join continents c
on m.continent_code = c.continent_code
left join countries
on p.country_code = countries.country_code
where country_region = 'Country'
),
growth as (
select 
        *,
        round(((gdp_per_capita - t0_gdp_pc)/ t0_gdp_pc) * 100, 2) as growth_percent
from df
),
rank_2012 as (
select 
        *,
        RANK() OVER(
            partition by continent_name
            order by growth_percent desc) as rank
from growth
where year = 2012
and growth_percent IS NOT NULL
)
select
          rank,
          continent_name,
          country_code,
          country_name,
          concat(growth_percent,'%') as growth_percent
from rank_2012
order by growth_percent desc

Displaying records 1 - 10
rank	continent_name	country_code	country_name	growth_percent
8	Africa	LBR	Liberia	9.77%
12	Asia	PHL	Philippines	9.73%
13	Asia	KWT	Kuwait	9.68%
9	Africa	EGY	Egypt, Arab Rep.	9.53%
14	Asia	GEO	Georgia	8.92%
10	Africa	RWA	Rwanda	8.73%
4	Oceania	AUS	Australia	8.63%
2	North America	KNA	St. Kitts and Nevis	8.45%
3	North America	CRI	Costa Rica	8.38%
6	South America	COL	Colombia	8.38%

For the year 2012, create a 3 column, 1 row report showing the percent share of gdp_per_capita for the following regions:

Asia, (ii) Europe, (iii) the Rest of the World. Your result should look something like

Asia Europe Rest of World 25.0% 25.0% 50.0%

with df as (
select 
        p.*,
        countries.country_name,
        CASE WHEN p.country_code IN ('EAP', 'EAS', 'LIC', 'ARB', 'NOC', 'UMC', 'MEA', 'PSS', 'MNA', 'MIC', 'LMY', 'HPC', 'NAC', 'LDC', 'LMC', 'SSF', 'SSA', 'CSS', 'WLD', 'ECA', 'HIC', 'OEC', 'OED', 'SST', 'LCN', 'SAS', 'LAC', 'OSS', 'ECS', 'EUU', 'EMU') THEN 'Region' ELSE 'Country' END as country_region,
        CASE WHEN c.continent_name IS NULL THEN 'North America' ELSE c.continent_name END as continent_name,
        LAG(p.gdp_per_capita, 1, NULL) OVER(
            partition by p.country_code
            order by year) as t0_gdp_pc
from per_capita p
left join continent_map m
on p.country_code = m.country_code
left join continents c
on m.continent_code = c.continent_code
left join countries
on p.country_code = countries.country_code
where country_region = 'Country'
),
sum_by_continent as (
select 
        continent_name,
        sum(gdp_per_capita) as sum_gdp_per_capita
from df
where year = 2012
group by continent_name
),
sum_total as (
select
        sum(gdp_per_capita) as ttl_gdp_per_capita
from df
where year = 2012
),
percent_tbl as (
select 
          sum_by_continent.*,
          sum_total.ttl_gdp_per_capita,
          round((sum_gdp_per_capita / ttl_gdp_per_capita) * 100, 1)  as percent
from sum_by_continent, sum_total
),
tbl_sums as (
select
        CASE WHEN continent_name = 'Asia' THEN 'Asia' 
        WHEN continent_name = 'Europe' THEN 'Europe' 
        ELSE 'Rest of World' END as continent_groups,
        sum(percent) as percent
from percent_tbl
group by continent_groups
)
select
        MAX(CASE WHEN continent_groups = 'Asia' THEN concat(percent, '%') END) AS Asia,
        MAX(CASE WHEN continent_groups = 'Europe' THEN concat(percent, '%') END) AS Europe,
        MAX(CASE WHEN continent_groups = 'Rest of World' THEN concat(percent, '%') END) AS Rest_of_World
FROM tbl_sums

1 records
Asia	Europe	Rest_of_World
27.6%	43.8%	28.6%

4a. What is the count of countries and sum of their related gdp_per_capita values for the year 2007 where the string ‘an’ (case insensitive) appears anywhere in the country name?

select
        count(*) as nbr_countries,
        sum(per_capita.gdp_per_capita) as sum_gdp_per_capita
from per_capita
left join countries
on per_capita.country_code = countries.country_code
where countries.country_name like '%an%'
and year = 2007

1 records
nbr_countries	sum_gdp_per_capita
68	1022936

4b. Repeat question 4a, but this time make the query case sensitive.

select
        count(*) as nbr_countries,
        sum(per_capita.gdp_per_capita) as sum_gdp_per_capita
from per_capita
left join countries
on per_capita.country_code = countries.country_code
where countries.country_name glob '*an*'
and year = 2007

1 records
nbr_countries	sum_gdp_per_capita
66	979600.7

Find the sum of gpd_per_capita by year and the count of countries for each year that have non-null gdp_per_capita where (i) the year is before 2012 and (ii) the country has a null gdp_per_capita in 2012. Your result should have the columns:

year country_count total

select 
    year,
    count(country_code) as country_count,
    sum(gdp_per_capita) as total
from per_capita
where 1 = 1
and gdp_per_capita IS NOT NULL
and country_code IN 
    (select country_code
    from per_capita
    where year = 2012
    and gdp_per_capita IS NULL)
group by year

8 records
year	country_count	total
2004	15	491203.2
2005	15	510735.0
2006	14	553689.6
2007	14	654508.8
2008	10	574016.2
2009	9	473103.3
2010	4	179750.8
2011	4	199152.7

All in a single query, execute all of the steps below and provide the results as your final answer:

create a single list of all per_capita records for year 2009 that includes columns:

continent_name country_code country_name gdp_per_capita

order this list by:

continent_name ascending characters 2 through 4 (inclusive) of the country_name descending

create a running total of gdp_per_capita by continent_name
return only the first record from the ordered list for which each continent’s running total of gdp_per_capita meets or exceeds $70,000.00 with the following columns:

continent_name country_code country_name gdp_per_capita running_total

with df as (
select
        CASE WHEN continents.continent_name IS NULL THEN 'North America'
            ELSE continents.continent_name END as continent_name,
        per_capita.country_code,
        countries.country_name,
        per_capita.gdp_per_capita,
        ROW_NUMBER() OVER(
        partition by continent_name
            order by substr(country_name, 2, 3) desc) as row_number,
        SUM(gdp_per_capita) OVER(
            partition by continent_name
            order by substr(country_name, 2, 3) desc) as running_total
from per_capita
left join countries
on per_capita.country_code = countries.country_code
left join continent_map
on per_capita.country_code = continent_map.country_code
left join continents
on continent_map.continent_code = continents.continent_code
where year = 2009
order by continent_name asc
),
criteria as (
select
        continent_name,
        country_code,
        country_name,
        gdp_per_capita,
        running_total,
        row_number
from df
where running_total >= 70000.00
),
final as (
select
        continent_name,
        country_code,
        country_name,
        gdp_per_capita,
        running_total,
        ROW_NUMBER() OVER(
            partition by continent_name) as row_number
from criteria
)
select *
from final
where row_number = 1

6 records
continent_name	country_code	country_name	gdp_per_capita	running_total	row_number
Africa	LBY	Libya	10455.575	70227.16	1
Asia	KWT	Kuwait	37160.543	73591.81	1
Europe	CHE	Switzerland	65790.067	119051.78	1
North America	ECA	Europe & Central Asia (developing only)	5495.417	97451.38	1
Oceania	NZL	New Zealand	27474.333	84623.92	1
South America	ECU	Ecuador	4236.778	72315.82	1

Find the country with the highest average gdp_per_capita for each continent for all years. Now compare your list to the following data set. Please describe any and all mistakes that you can find with the data set below. Include any code that you use to help detect these mistakes.

with df as (
select
        per_capita.country_code,
        countries.country_name,
        per_capita.year,
        CASE WHEN continent_map.continent_code IS NULL THEN 'NA' 
            ELSE continent_map.continent_code END as continent_code,
        CASE WHEN continents.continent_name IS NULL THEN 'North America' 
            ELSE continents.continent_name END as continent_name,
        CASE WHEN per_capita.country_code IN ('EAP', 'EAS', 'LIC', 'ARB', 'NOC', 'UMC', 'MEA', 'PSS', 'MNA', 'MIC', 'LMY', 'HPC', 'NAC', 'LDC', 'LMC', 'SSF', 'SSA', 'CSS', 'WLD', 'ECA', 'HIC', 'OEC', 'OED', 'SST', 'LCN', 'SAS', 'LAC', 'OSS', 'ECS', 'EUU', 'EMU') THEN 'Region' ELSE 'Country' END as country_region,
        per_capita.gdp_per_capita
from per_capita
left join continent_map
on per_capita.country_code = continent_map.country_code
left join countries
on per_capita.country_code = countries.country_code
left join continents
on continent_map.continent_code = continents.continent_code
),
avg_gdp_per_capita as (
select
        continent_name,
        country_name,
        country_code,
        round(AVG(gdp_per_capita), 2) as avg_gdp_per_capita
from df
where country_region = 'Country'
group by continent_name, country_name, country_code
),
rank as (
select 
        *,
        RANK() OVER(
            partition by continent_name
            order by avg_gdp_per_capita desc) as rank_gdp_per_capita
from avg_gdp_per_capita
)
select *
from rank
where rank_gdp_per_capita = 1

6 records
continent_name	country_name	country_code	avg_gdp_per_capita	rank_gdp_per_capita
Africa	Equatorial Guinea	GNQ	17955.72	1
Asia	Qatar	QAT	70567.96	1
Europe	Monaco	MCO	151421.89	1
North America	Bermuda	BMU	84634.83	1
Oceania	Australia	AUS	46147.45	1
South America	Chile	CHL	10781.71	1