Data Engineering Portfolio

Data Engineering, BI & Analytics Projects

About

I'm Victor, a Data/Analytics Engineer specializing in building scalable data pipelines and lakehouse solutions. I've implemented Medallion Architecture in Databricks, orchestrated ETL/ELT pipelines with dbt and Airflow, and delivered data products that empower business teams.

My tech stack includes Python, PySpark, SQL, dbt, Azure, and Databricks. This portfolio highlights some of my side projects and technical explorations.

Latest Project

Postgres to Databricks CDC Pipeline

A high-performance Data Ingestion Project built with the Python dlt library. It is designed to move data from PostgreSQL to Databricks using CDC (Change Data Capture) for efficient synchronization. Orchestrated natively by Databricks Lakeflow Jobs, this project serves as a robust blueprint for enterprise data replication.

Technical Assessment Projects

Projects developed as part of technical assessment processes, demonstrating comprehensive problem-solving abilities and technical skills.

1. Video Game Sales

Data Analyst position, September 2024

A project that analyzes video game sales data to evaluate gaming partnership opportunities.

Features

  • Data extraction automation
  • Data preparation and cleaning
  • Regional sales analysis
  • Genre market share calculation
  • Platform performance tracking

Tech Stack

Python DuckDB Prefect Pandas wget Jupyter

Skills Applied

Data/file extraction Data preparation SQL analysis Data workflow orchestration

2. School Supplies Market

Analytics Engineer position, March 2024

A data preparation project focusing on standardizing and integrating e-commerce school supplies sales data for planning purposes.

Features

  • Automated header validation system
  • Data quality analysis and standardization
  • SQLite database implementation

Tech Stack

Python SQLite Pandas Google Sheets

Skills Applied

Data preparation and cleaning Database operations ETL processes Business analytics

Code Snippets & Practice Projects

Smaller-scale projects and code examples showcasing specific technical skills and tools implementation.

1. Cryptocurrencies Quotes

An EL (Extract, Load) pipeline that fetches market data (price, volume, market cap) for BTC, ETH, and LTC from CoinMarketCap API and stores it in DuckDB.

Features

  • Automated data extraction from CoinMarketCap API
  • Error handling and logging system
  • Data storage in DuckDB database

Tech Stack

Python dlt DuckDB CoinMarketCap API

Skills Applied

API integration Data pipeline development SQL querying

2. Chinook Sales Simulator

Turns the static Chinook dataset into a living, chaotic OLTP simulator. It generates not just new sales (INSERTs), but also simulates data corrections (UPDATEs) and cancellations (DELETEs), creating a real-world data source for testing advanced pipelines (CDC, SCD Type 2).

Features

  • Simulates the full data lifecycle: INSERT, UPDATE, and DELETE
  • Models UPDATEs/DELETEs as late-arriving changes within a 90-day window
  • Ensures ACID compliance (all-or-nothing) for each D-1 batch
  • Includes a verification script to audit simulation logs against the DB state

Tech Stack

Python PostgreSQL Neon uv

Skills Applied

Data Generation SQL Functions System Automation CLI Integration

3. Postgres to Databricks CDC Pipeline

A high-performance Data Ingestion Project built with the Python dlt library, moving data from PostgreSQL to Databricks using CDC for efficient synchronization.

Features

  • High-performance ingestion using Python dlt
  • Change Data Capture (CDC) synchronization
  • PostgreSQL to Databricks replication
  • Orchestration via Databricks Lakeflow Jobs

Tech Stack

Python dlt PostgreSQL Databricks

Skills Applied

CDC / Data Replication Data Ingestion Lakeflow Orchestration