<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PySpark on Degang Wang</title><link>https://degangwang.com/tags/pyspark/</link><description>Recent content in PySpark on Degang Wang</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 10 May 2026 00:00:00 -0500</lastBuildDate><atom:link href="https://degangwang.com/tags/pyspark/index.xml" rel="self" type="application/rss+xml"/><item><title>From Databricks to Your Analytics Tool: A RWD Workflow</title><link>https://degangwang.com/2026/05/10/databricks-rwd-workflow/</link><pubDate>Sun, 10 May 2026 00:00:00 -0500</pubDate><guid>https://degangwang.com/2026/05/10/databricks-rwd-workflow/</guid><description>&lt;h2 id="why-not-just-load-everything-locally"&gt;Why Not Just Load Everything Locally?&lt;/h2&gt;
&lt;p&gt;Real-world data (RWD) — claims, EHR, registries — is massive. A single claims database can have billions of rows across diagnosis, procedure, and pharmacy tables. You cannot &lt;code&gt;pd.read_csv()&lt;/code&gt; your way through it.&lt;/p&gt;
&lt;p&gt;The workflow that works:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Manipulate data in Databricks&lt;/strong&gt; — filter, join, aggregate using SQL on distributed compute&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create an analytic-ready table&lt;/strong&gt; — a focused cohort with only the columns and rows you need&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bring the result to your analytics tool&lt;/strong&gt; — Python, R, or SAS for statistical analysis and visualization&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This post walks through that workflow using a diabetes cohort as an example. While we use Databricks here, the same pattern applies to any big data computing engine — Snowflake, Impala, AWS Athena, or Google BigQuery. The principle is the same: reduce data remotely, then bring the result to your analytics tool.&lt;/p&gt;</description></item><item><title>Databricks Connect: Run Spark Code from Your Local IDE</title><link>https://degangwang.com/2026/05/08/databricks-connect/</link><pubDate>Fri, 08 May 2026 00:00:00 -0500</pubDate><guid>https://degangwang.com/2026/05/08/databricks-connect/</guid><description>&lt;h2 id="why-databricks-connect"&gt;Why Databricks Connect?&lt;/h2&gt;
&lt;p&gt;The Databricks web notebook is great for interactive exploration, but sometimes you want to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use your preferred local IDE (VS Code, PyCharm, etc.)&lt;/li&gt;
&lt;li&gt;Version control your code with git&lt;/li&gt;
&lt;li&gt;Run scripts in CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Debug with local tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Databricks Connect lets you do all of this while still using a remote Databricks cluster for compute.&lt;/p&gt;</description></item></channel></rss>