Prescreening Tasks for Data Engineer

This document provides the exercises that are part of the prescreening process for Data Engineering positions. As such, their goal is to provide insights into solutioning process and skills application for solving example tasks related to this position. Thus, please make sure to explain the important details and specify any assumption taken.

Exercise 1

Assume having a Sales table with the following fields:

And another table named Discounts with the following structure:

For a customer and a sales order there can exist a discount value in the Discounts table, however there is no information about the precise discount value per each item in the Sales table. Thus, for this exercise it is necessary to:

Exercise 2

Implement the solution above, preferably using Python (with Pandas), Java, C#, or another programming language. You can also use a pseudo-code instead. Please make sure to provide comments where needed to the solution provided.

Assumptions

I will assume that this is an e-commerce app, and due to the description of the problem this is a design problem of an OLTP system, the exercises are not asking about agregations, data ingestion, transformations and less how to move data between systems, in order to get insights from the preexisted data, there are no references neither of the velocity of the data nor about the tasks scheduled, all the situations mentioned before are features of the OLAP solutions, Hence, I will avoid based my decisions thinking in OLAP systems.

A sales order is a record that a customer can use to initiate or request a sale. In general, the sales order is filled out by the customer to order certain products from a business. The customer pays for these products when they submit the sales order. This document represents the promise that the requested goods will be ordered by the business and delivered to the customer later. When the goods are fulfilled and picked up by the customer, the sales order is used as a receipt to make sure the correct amount and types of products are given to the customer. In simple terms, a sales order is a request from a customer for specific items.

There are two relevant situations in the description:

Base case

The initial data model allows assigning a discount per user to each purchased item that appears in the sales table, but there are two problems with this approach:

  1. You cannot add different discounts to each item
  2. You cannot add more than one item of the same type to the order, this would create a problem with the composite primary key.

initial

let’s focus in the relevant tables using PostgreSQL


CREATE TABLE IF NOT EXISTS Sales(
    sales_order_id int NOT NULL,
    sales_order_item int NOT NULL, 
    customer_id int null,
    date timestamp not null, 
    transaction_value float not null, 
    discounted_value float not null,
    PRIMARY KEY(sales_order_id, sales_order_item, customer_id)
);


CREATE TABLE IF NOT EXISTS Discounts(
    sales_order_id int not null,
    customer_id int not null,
    discount_value float not null
    PRIMARY KEY(sales_order_id, customer_id)
);

insert into sales values 
(1, 2, 150, NOW(), 200, 200),
(1, 3, 150, NOW(), 310, 310),
(1, 4, 150, NOW(), 80, 80)


insert into discounts values (1, 150, 0.3)

select * from sales

image

select * from discounts

image

WITH sales_updated_CTE(sales_order_id,sales_order_item, customer_id, date,transaction_value, discount)
AS
(
    SELECT 
        s.sales_order_id,
        s.sales_order_item,
        s.customer_id,
        s.date,
        s.transaction_value,
        (s.transaction_value - (s.transaction_value * d.discount_value)) AS discount
    FROM sales s
    JOIN discounts d
    ON s.sales_order_id = d.sales_order_id
    WHERE s.sales_order_id = %s AND s.customer_id = %s
)
UPDATE sales as s
SET discounted_value = c.discount
FROM sales_updated_CTE as c 
WHERE c.sales_order_id = s.sales_order_id 
AND c.customer_id = s.customer_id
AND c.sales_order_item = s.sales_order_item
select * from sales

image

Edge case

The second approach (edge case), considered add more than one unit of the same product to the order (quantity) and create different discounts to each item, and update the discount is less complex

second


CREATE TABLE IF NOT EXISTS Sales2(
    sales_order_id int NOT NULL,
    customer_id int null,
    date timestamp not null, 
    PRIMARY KEY(sales_order_id, customer_id)
);

CREATE TABLE IF NOT EXISTS Order_detail(
    sales_order_id int not null,
    sales_order_item int NOT NULL, 
    transaction_value float not null,
    quantity int not null, 
    discount_value float not null,
    discounted_value float not null,
    PRIMARY KEY(sales_order_id, sales_order_item)
);


insert into Sales2 values (1, 150, NOW())

insert into Order_detail values 
(1, 2, 150, 2, 0.3, 150),
(1, 3, 210, 1, 0.4, 210), 
(1, 4, 80, 3, 0.2, 80)


select * from sales2

image

select * from Order_detail

image

UPDATE Order_detail
SET discounted_value = (transaction_value - (transaction_value * discount_value)) * quantity
WHERE sales_order_id = %s

select * from Order_detail

image

Steps to execute the project

ramse@DESKTOP-K6K6E5A MINGW64 /c
$ git clone https://github.com/Wittline/dataengineering-assignment.git
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/dataengineering-assignment/docker (main)
$ docker-compose up

let’s wait until all the images and containers are created

image

ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/dataengineering-assignment/docker (main)
$ docker ps

It will show you all the containers and images contained in the docker-compose-yml file, this mean that all the images were created correctly

image

let’s check the docker-compose.yml file

version: '3'

services:
    jupyter-notebook:
        image: jupyter/minimal-notebook
        volumes:
            - ./folder:/home/jovyan/work
        ports:
            - 8888:8888
        container_name: jupyter-notebook-container
        environment:
            - JUPYTER_TOKEN=jupyter
        command: jupyter notebook --NotebookApp.iopub_data_rate_limit=3e10
        depends_on:
            - "db"
    db:
        container_name: pg_container
        image: postgres
        restart: always
        environment:
            POSTGRES_USER: "db"
            POSTGRES_PASSWORD: "db"
            POSTGRES_DB: "db_db"
        ports:
            - "5432:5432"
        volumes:
            - pg_data:/var/lib/postgresql/data/

volumes:
    pg_data:

image

The jupyter notebook of this project is called dataengineering-assignment.ipynb over the folder work, there is a copy of this file in the folder called folder in this repository, please keep this file on its folder.

image

Now you can run the project step by step using this notebook, the files sql_queries.py and db_engine.py contains all the logic needed

image

Summary

Authors