Data Engineering Portfolio

Data Engineering, BI & Analytics Projects

About

I'm Victor, a Data/Analytics Engineer specializing in building scalable data pipelines and lakehouse solutions. I've implemented Medallion Architecture in Databricks, orchestrated ETL/ELT pipelines with dbt and Airflow, and delivered data products that empower business teams.

My tech stack includes Python, PySpark, SQL, dbt, Azure, and Databricks. This portfolio highlights some of my side projects and technical explorations.

Latest Project

Postgres to Databricks CDC Pipeline

A high-performance Data Ingestion Project built with the Python dlt library. It is designed to move data from PostgreSQL to Databricks using CDC (Change Data Capture) for efficient synchronization. Orchestrated natively by Databricks Lakeflow Jobs, this project serves as a robust blueprint for enterprise data replication.

Technical Assessment Projects

Projects developed as part of technical assessment processes, demonstrating comprehensive problem-solving abilities and technical skills.

1. Video Game Sales

Data Analyst position, September 2024

A project that analyzes video game sales data to evaluate gaming partnership opportunities.

Features

Data extraction automation
Data preparation and cleaning
Regional sales analysis
Genre market share calculation
Platform performance tracking

Tech Stack

Python DuckDB Prefect Pandas wget Jupyter

Skills Applied

Data/file extraction Data preparation SQL analysis Data workflow orchestration

2. School Supplies Market

Analytics Engineer position, March 2024

A data preparation project focusing on standardizing and integrating e-commerce school supplies sales data for planning purposes.

Features

Automated header validation system
Data quality analysis and standardization
SQLite database implementation

Tech Stack

Python SQLite Pandas Google Sheets

Skills Applied

Data preparation and cleaning Database operations ETL processes Business analytics

Code Snippets & Practice Projects

Smaller-scale projects and code examples showcasing specific technical skills and tools implementation.

1. Cryptocurrencies Quotes

An EL (Extract, Load) pipeline that fetches market data (price, volume, market cap) for BTC, ETH, and LTC from CoinMarketCap API and stores it in DuckDB.

Features

Automated data extraction from CoinMarketCap API
Error handling and logging system
Data storage in DuckDB database

Tech Stack

Python dlt DuckDB CoinMarketCap API

Skills Applied

API integration Data pipeline development SQL querying

2. Chinook Sales Simulator

Turns the static Chinook dataset into a living, chaotic OLTP simulator. It generates not just new sales (INSERTs), but also simulates data corrections (UPDATEs) and cancellations (DELETEs), creating a real-world data source for testing advanced pipelines (CDC, SCD Type 2).

Features

Simulates the full data lifecycle: INSERT, UPDATE, and DELETE
Models UPDATEs/DELETEs as late-arriving changes within a 90-day window
Ensures ACID compliance (all-or-nothing) for each D-1 batch
Includes a verification script to audit simulation logs against the DB state

Tech Stack

Python PostgreSQL Neon uv

Skills Applied

Data Generation SQL Functions System Automation CLI Integration

3. Postgres to Databricks CDC Pipeline

A high-performance Data Ingestion Project built with the Python dlt library, moving data from PostgreSQL to Databricks using CDC for efficient synchronization.

Features

High-performance ingestion using Python dlt
Change Data Capture (CDC) synchronization
PostgreSQL to Databricks replication
Orchestration via Databricks Lakeflow Jobs

Tech Stack

Python dlt PostgreSQL Databricks

Skills Applied

CDC / Data Replication Data Ingestion Lakeflow Orchestration