Simplifying JSONB queries when syncing Salesforce data to PostgreSQL with CloudQuery

Need help with complex JSONB structures after data sync

I’m working with CloudQuery to move data from Salesforce to PostgreSQL and running into issues with query complexity.

My setup

I have two configuration files for the sync process:

salesforce.yml

kind: source
spec:
  name: "salesforce"
  path: cloudquery/salesforce
  registry: cloudquery
  version: "v3.2.1"
  destinations: ["postgresql"]
  tables: ["*"]
  spec:
    api_key: "my_secret_key"
    batch_settings:
      batch_size: 500
      retry_attempts: 3
    entity_options:
      salesforce_contacts:
        related_objects: ['accounts', 'opportunities', 'activities', 'campaigns']
      salesforce_accounts:
        related_objects: ['contacts', 'opportunities', 'activities', 'campaigns']

postgresql.yml

kind: destination
spec:
  name: "postgresql"
  path: "cloudquery/postgresql"
  registry: "cloudquery"
  version: "v7.3.2"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: "postgresql://admin:password@localhost:5432/crm_data?sslmode=disable"

The problem

The sync works fine but I end up with really complex JSONB columns. In the salesforce_contacts table, there’s a contact_info column like this:

{
  "name": "John Smith",
  "email_address": "[email protected]", 
  "phone_number": "+1234567890",
  "created_at": "2024-07-15T14:30:22.158Z",
  "record_id": "40298374529",
  "updated_at": "2024-07-15T16:45:33.901Z"
}

And a related_entities column:

{
  "accounts": {
    "items": [
      {
        "record_id": "98765432101",
        "relationship": "contact_to_account"
      },
      {
        "record_id": "98765432101", 
        "relationship": "contact_to_account_primary"
      }
    ]
  }
}

I don’t understand why the same account appears twice with different relationship types.

Current query approach

Right now I’m using this complex query to get contacts with their accounts:

WITH account_relationships AS (
    SELECT 
        ct.record_id AS contact_record_id,
        ct.contact_info ->> 'name' AS contact_name,
        ct.contact_info ->> 'email_address' AS contact_email,
        jsonb_array_elements(ct.related_entities -> 'accounts' -> 'items') AS account_relation
    FROM 
        salesforce_contacts ct
)
SELECT 
    DISTINCT ON (contact_record_id, account_record_id)
    ar.contact_record_id,
    ar.contact_name,
    ar.contact_email,
    acc.record_id AS account_record_id,
    acc.contact_info ->> 'company_name' AS account_name
FROM 
    account_relationships ar
LEFT JOIN 
    salesforce_accounts acc
    ON (ar.account_relation ->> 'record_id') = acc.record_id
ORDER BY 
    contact_record_id, account_record_id;

This gets messy fast when I need other relationships too. Writing queries with multiple JSONB operations is really hard for me.

What I need

Is there a way to transform this data before it goes into PostgreSQL? Or maybe a better approach to structure the data so queries aren’t so complicated?

I’m open to any suggestions on how to make this more manageable.

The Problem:

You’re experiencing complex JSONB columns in your PostgreSQL database after syncing data from Salesforce using CloudQuery. Your current queries are becoming overly complex due to the nested structure of the JSONB data, particularly in the salesforce_contacts table’s related_entities column, where the same account might appear multiple times with different relationship types. This makes querying for related entities cumbersome and inefficient.

:thinking: Understanding the “Why” (The Root Cause):

The complexity arises from how Salesforce structures relationships. A single contact can have multiple relationships (e.g., primary contact, billing contact) with the same account. CloudQuery, by default, preserves this structure in the JSONB format. Your current approach necessitates using JSONB operators in SQL, leading to complicated queries. The solution involves restructuring the data before it reaches PostgreSQL, making queries simpler and faster.

:gear: Step-by-Step Guide:

  1. Implement CloudQuery Transformations: The most effective approach is to leverage CloudQuery’s transformation capabilities within your salesforce.yml configuration file. This allows you to pre-process the data before it’s written to your PostgreSQL database. You’ll define transformations to flatten the JSONB structure and create separate columns for relevant fields. Here’s an example:
kind: source
spec:
  name: "salesforce"
  path: cloudquery/salesforce
  registry: cloudquery
  version: "v3.2.1"
  destinations: ["postgresql"]
  tables: ["*"]
  spec:
    api_key: "my_secret_key"
    batch_settings:
      batch_size: 500
      retry_attempts: 3
    entity_options:
      salesforce_contacts:
        related_objects: ['accounts', 'opportunities', 'activities', 'campaigns']
        transformations:
          - name: "flatten_related_accounts"
            type: "jsonpath"
            source: "related_entities.accounts.items"
            target: "related_accounts"
            # Example of creating boolean columns (adjust to your needed relationship types):
            transformations:
              - name: "primary_account"
                type: "expression"
                expression: "related_accounts[*].relationship == 'contact_to_account_primary'"
                target: "is_primary_account"
              - name: "billing_account"
                type: "expression"
                expression: "related_accounts[*].relationship == 'contact_to_account_billing'"
                target: "is_billing_account"
            # You can add more transformations here to handle additional fields and relationships
      salesforce_accounts:
        related_objects: ['contacts', 'opportunities', 'activities', 'campaigns']

This transformation uses JSONPath to extract the related_entities.accounts.items array. Subsequent transformations then handle creating separate boolean flags for different account relationships (is_primary_account, is_billing_account). Adapt these expressions to reflect the actual relationship types in your Salesforce data. This transformation generates new columns (is_primary_account, is_billing_account, etc.) instead of complex JSONB structures.

  1. Re-run the Sync: After modifying your salesforce.yml, re-run the CloudQuery sync process. This will apply the transformations and load the data into PostgreSQL with the improved column structure.

  2. Verify Data: Query your salesforce_contacts table in PostgreSQL to confirm the new columns have been correctly populated. You should now be able to query data using simple WHERE clauses instead of complex JSONB operations.

:mag: Common Pitfalls & What to Check Next:

  • JSONPath Syntax: Ensure your JSONPath expressions accurately target the nested fields within your JSONB data. Use a JSONPath evaluator (many online tools are available) to test your expressions before integrating them into your CloudQuery configuration.
  • Transformation Errors: Carefully review the CloudQuery logs for any errors during the transformation process. CloudQuery may provide detailed error messages to help identify potential issues in your transformation configuration.
  • Data Consistency: Verify that all relevant relationship types are correctly handled in your transformations to avoid losing or misrepresenting data during the sync.
  • Scalability: As your Salesforce data grows, consider additional optimization techniques, such as using materialized views or database functions, to improve query performance.

:speech_balloon: Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!

materialized views rly help! flatten your jsonb columns to get the fields you use most often. just set a refresh schedule so u don’t need those messy ctes anymore every time. saves a lot of time!

Try creating database views or functions to handle the JSONB extraction. I had a similar issue and wrote PostgreSQL functions that wrapped the complex JSONB stuff into simple calls. Something like get_contact_accounts(contact_id) that returns a clean table instead of needing those CTEs every time. Another thing that worked well - use PostgreSQL’s generated columns to auto-extract common JSONB fields into regular columns on insert. You get the performance of indexed regular columns while keeping the original JSONB for flexibility. Those duplicate account relationships are pretty normal - Salesforce stores multiple role types per relationship all the time. You could create a view that pivots these into separate columns or handles deduplication with priority rules.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.