Why Databricks Connect?
The Databricks web notebook is great for interactive exploration, but sometimes you want to:
- Use your preferred local IDE (VS Code, PyCharm, etc.)
- Version control your code with git
- Run scripts in CI/CD pipelines
- Debug with local tools
Databricks Connect lets you do all of this while still using a remote Databricks cluster for compute.
How It Works
Databricks Connect replaces the local Spark execution engine with a connection to a remote Databricks cluster. Your code runs locally, but DataFrame operations are executed on the cluster.
Setup
1. Install the package
pip install databricks-connect
Make sure the version matches your cluster’s Databricks Runtime version.
2. Set environment variables
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-personal-access-token"
export DATABRICKS_CLUSTER_ID="your-cluster-id"
You can find these in your Databricks workspace:
- Host: your workspace URL
- Token: User Settings > Developer > Access Tokens
- Cluster ID: Compute > your cluster > URL or JSON
3. Create a session
import os
from databricks.connect import DatabricksSession
def get_spark_session(host=None, token=None, cluster_id=None):
host = host or os.getenv("DATABRICKS_HOST")
token = token or os.getenv("DATABRICKS_TOKEN")
cluster_id = cluster_id or os.getenv("DATABRICKS_CLUSTER_ID")
spark = (
DatabricksSession.builder
.host(host)
.token(token)
.clusterId(cluster_id)
.getOrCreate()
)
return spark
4. Use it like any Spark session
spark = get_spark_session()
df = spark.sql("SELECT * FROM my_catalog.my_schema.my_table LIMIT 10")
df.show()
When to Use This vs. Notebooks
| Use Case | Recommended |
|---|---|
| Quick data exploration | Notebook |
| Production scripts | Databricks Connect |
| Local debugging | Databricks Connect |
| Collaboration with non-Databricks users | Databricks Connect |
| Scheduled jobs | Databricks Jobs (or either) |
Caveats
- The cluster must be running (or will auto-start if configured)
- Some Spark features (e.g., custom JVM libraries) may not work identically
- Network latency applies — large
collect()calls will be slow - Token expiration requires periodic refresh