Our team is transitioning away from Business Objects Data Services and moving to a modern data stack. A few years back we successfully moved our data warehouse to Azure Data Lake Storage, Synapse, and Databricks.
Now we need to figure out the best approach for application-to-application integrations that weren’t part of the original migration. Currently, these still run on the old BODS system. They write working tables to our on-premises SQL Server and usually create CSV files that get transferred via SFTP to target systems.
We want to build new integrations using our current toolset but we’re debating the best architecture. The initial design proposed using the same medallion structure as our data warehouse with bronze, silver, and gold layers. However, I think we should discuss this more thoroughly before setting a standard.
Our main questions are around storage design and where to put the final data so our reporting tools can access it. Options include writing back to on-prem SQL, using Azure SQL Database, Synapse, or Databricks SQL warehouse.
We’re planning to use Databricks Python notebooks for processing logic and Unity Catalog for data storage. Most integrations are batch-based running daily or a few times per day, so we don’t need real-time processing.
Has anyone dealt with similar architecture decisions? Any resources or best practices you’d recommend for designing application integration patterns in a modern data lake environment?
Been down this road migrating from Informatica PowerCenter to Azure. Here’s what burned us early - don’t force every integration into the same pattern. Your daily CSV drops? Just use simple ADF pipelines straight to Azure SQL instead of running everything through Databricks. The medallion approach works great for complex stuff but it’s expensive overkill for basic data movement. I’d go hybrid on storage - keep frequently accessed data in Azure SQL for fast reporting, archive the rest to your data lake. Watch out for SFTP dependencies too. We kept way more on-prem stuff than planned because downstream systems couldn’t change how they pulled data. Document your data lineage needs now - Unity Catalog integration with reporting tools gets messy depending on what you’re running.
The Problem: You’re migrating your data integration processes from an on-premises system to Azure, and you’re experiencing high costs associated with using Databricks for all integration tasks, particularly for simple, daily batch jobs. You’re looking for ways to reduce costs while maintaining efficiency and leveraging your existing Azure infrastructure.
Understanding the “Why” (The Root Cause): Databricks, while a powerful platform for complex data processing and transformation, can be cost-prohibitive for simple batch jobs, especially when serverless alternatives exist within the Azure ecosystem. The cost of spinning up and managing Databricks clusters for routine, low-complexity tasks is often higher than necessary, leading to unnecessary expenditure. A more cost-effective approach is to utilize Azure services better suited for these types of operations.
Step-by-Step Guide:
Migrate Simple Batch Jobs to Azure Synapse Serverless SQL Pools: For your daily CSV drops and other straightforward batch processing tasks, switch from Databricks to Azure Synapse Serverless SQL Pools. Synapse Serverless SQL Pools offer a pay-as-you-go model, meaning you only pay for the compute resources consumed during query execution. This significantly reduces costs compared to the ongoing expenses of maintaining Databricks clusters, especially for jobs that run infrequently or have relatively low compute requirements. You can easily ingest your CSV files into Synapse using Azure Data Factory (ADF) pipelines and then process and load the data into your desired destination within Synapse.
Optimize Data Ingestion with Azure Data Factory (ADF): Use ADF to create pipelines for ingesting your CSV files from SFTP. ADF offers a robust and cost-effective solution for data movement, specifically designed for automated, scheduled data transfers. This separates the data ingestion process from the data transformation and loading tasks, further improving efficiency and reducing costs.
Retain Databricks for Complex Tasks: Continue using Databricks for complex data transformation or business logic where its capabilities are necessary. The hybrid approach will allow you to optimize cost and performance based on task complexity.
Implement Monitoring and Alerting: Implement robust monitoring and alerting for both Synapse and Databricks to ensure that data pipelines are functioning correctly and efficiently. This will enable timely detection and resolution of potential issues, which is key in maintaining the integrity of the data integration process.
Common Pitfalls & What to Check Next:
SFTP Migration Challenges: Don’t underestimate the complexities of migrating from your on-premises SFTP infrastructure. Many downstream systems may have dependencies on specific file paths, naming conventions, and data formats. Begin by thoroughly documenting these dependencies and plan for potential modifications or rework in downstream systems.
Data Lineage and Governance: With a hybrid approach involving Synapse and Databricks, maintaining data lineage and enforcing governance becomes even more critical. Leverage Unity Catalog to centrally manage access control and metadata across both platforms. This will ensure data quality and compliance.
Cost Optimization: While moving to Synapse Serverless can significantly reduce costs, monitor your consumption patterns closely. Optimize your queries to minimize resource usage and ensure you’re leveraging the serverless pricing model effectively.
Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!
Did this exact migration three years ago when we ditched BODS. Here’s what actually worked.
Skip full medallion for app integrations unless you need complex transformations. We went bronze for raw ingestion, gold for final output. Silver was just burning compute on simple daily loads.
Storage-wise, we split it up. Transactional data feeding reports stayed in Azure SQL Database since our BI tools already plugged in there. Everything else went to Databricks SQL warehouse - we were doing most processing in Databricks anyway.
Biggest gotcha was dependency mapping. Those CSV drops you mentioned? Half our downstream systems had hardcoded file paths and naming conventions. Took months to untangle. Start documenting those SFTP patterns now.
What saved us time - built a simple metadata-driven framework in Databricks for common patterns like CSV generation and table updates. Most integrations now just need config instead of custom notebooks.
Also consider Azure Data Factory for simple stuff. We kept Databricks for complex business logic but ADF handles basic copy operations way cheaper. Hybrid approach kept costs reasonable while we figured out the right patterns.
Went through this same thing 18 months ago moving from SSIS to Azure. Medallion architecture works for integrations but it’s overkill for simple daily batches. We kept bronze minimal and ditched silver completely for most app-to-app stuff. For storage, we stuck with Azure SQL Database since our downstream apps already had SQL connectors. Synapse was way too much for our volume and Databricks SQL warehouse pricing made no sense. Main goal was updating the processing without breaking existing apps. Lock down Unity Catalog permissions from day one - we got burned when teams started spinning up notebooks without governance and screwed up our data lineage. If you’ve got complex job dependencies, check out Delta Live Tables.
you’re overthinking the medallion thing for app integrations. we did something similar last year and just used databricks sql warehouse for the final layer since we were already in that ecosystem. performance was solid for batch jobs and unity catalog made governance way easier than juggling multiple endpoints.