From Databricks to Your Analytics Tool: A RWD Workflow

Sun, 10 May 2026 00:00:00 -0500

Why Not Just Load Everything Locally?

Real-world data (RWD) — claims, EHR, registries — is massive. A single claims database can have billions of rows across diagnosis, procedure, and pharmacy tables. You cannot pd.read_csv() your way through it.

The workflow that works:

Manipulate data in Databricks — filter, join, aggregate using SQL on distributed compute
Create an analytic-ready table — a focused cohort with only the columns and rows you need
Bring the result to your analytics tool — Python, R, or SAS for statistical analysis and visualization

This post walks through that workflow using a diabetes cohort as an example. While we use Databricks here, the same pattern applies to any big data computing engine — Snowflake, Impala, AWS Athena, or Google BigQuery. The principle is the same: reduce data remotely, then bring the result to your analytics tool.

Databricks Connect: Run Spark Code from Your Local IDE

Fri, 08 May 2026 00:00:00 -0500

Why Databricks Connect?

The Databricks web notebook is great for interactive exploration, but sometimes you want to:

Use your preferred local IDE (VS Code, PyCharm, etc.)
Version control your code with git
Run scripts in CI/CD pipelines
Debug with local tools

Databricks Connect lets you do all of this while still using a remote Databricks cluster for compute.

PySpark on Degang Wang

From Databricks to Your Analytics Tool: A RWD Workflow

Why Not Just Load Everything Locally?

Databricks Connect: Run Spark Code from Your Local IDE

Why Databricks Connect?