Streamlining data management when transferring HubSpot data to PostgreSQL using CloudQuery

I am currently transferring data from HubSpot to PostgreSQL with CloudQuery, but I’m facing challenges with the complexity of querying JSONB data.

Configuration Details

Here’s how my source is set up:

kind: source
spec:
  name: "hubspot"
  path: cloudquery/hubspot
  registry: cloudquery
  version: "v4.4.3"
  destinations: ["postgresql"]
  tables: ["*"]
  spec:
    app_token: "my_access_token"
    pagination:
      page_size: 1000
      max_retries: 5
    table_options:
      hubspot_crm_contacts:
        associations: ["contacts", "deals", "notes", "emails", "tasks", "meetings", "calls", "company"]
      hubspot_crm_companies:
        associations: ["contacts", "deals", "notes", "emails", "tasks", "meetings", "calls", "company"]
      hubspot_crm_deals:
        associations: ["contacts", "deals", "notes", "emails", "tasks", "meetings", "calls", "company"]

And my destination configuration:

kind: destination
spec:
  name: "postgresql"
  path: "cloudquery/postgresql"
  registry: "cloudquery"
  version: "v8.5.1"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: "postgresql://test:test@localhost:5432/dbsync?sslmode=disable"

Data Structure Concern

In the hubspot_crm_contacts table, there are JSONB columns. For example, the properties column contains:

{
  "email": "[email protected]",
  "lastname": "Musk",
  "firstname": "Elon",
  "createdate": "2024-08-05T09:06:00.072Z",
  "hs_object_id": "28109002461",
  "lastmodifieddate": "2024-08-05T11:00:14.253Z"
}

The associations column has:

{
  "companies": {
    "results": [
      {
        "id": "13176491214",
        "type": "contact_to_company"
      },
      {
        "id": "13176491214",
        "type": "contact_to_company_unlabeled"
      }
    ]
  }
}

The Issue

I’m struggling with the complexity of the SQL queries needed to retrieve contacts along with their associated data. Here’s the query I’m using to fetch contacts and their linked companies:

WITH company_data AS (
    SELECT 
        c.id AS contact_id,
        c.properties ->> 'firstname' AS first_name,
        c.properties ->> 'lastname' AS last_name,
        c.properties ->> 'email' AS email,
        jsonb_array_elements(c.associations -> 'companies' -> 'results') AS company_json
    FROM 
        hubspot_crm_contacts c
)
SELECT 
    DISTINCT ON (contact_id, company_id) 
    cd.contact_id,
    cd.first_name,
    cd.last_name,
    cd.email,
    cmp.id AS company_id,
    cmp.properties ->> 'name' AS company_name,
    cmp.properties ->> 'domain' AS company_domain
FROM 
    company_data cd
LEFT JOIN 
    hubspot_crm_companies cmp 
    ON (cd.company_json ->> 'id') = cmp.id
ORDER BY 
    contact_id, company_id, cd.company_json ->> 'type' DESC;

This becomes even more convoluted with additional associations. I am seeking guidance on effective data management techniques.

Is there a way to restructure the data before it enters PostgreSQL or methods to simplify my SQL queries?

Use CloudQuery’s transformations feature to flatten your HubSpot data during sync instead of wrestling with complex JSONB queries later. Add a transformation config to your CloudQuery setup that automatically creates normalized tables for associations. I’ve done this before - wrote a simple transformation that creates separate tables like contact_company_associations and contact_deal_associations. Makes querying way easier. The transformation runs during sync, so your PostgreSQL gets clean relational data instead of nested JSON mess. Completely eliminated my need for CTEs and complex joins. You’ll need to modify your CloudQuery config for custom transformations, but it’s worth the setup time for how much simpler the queries become.

Try creating a PostgreSQL view to flatten the common JSONB fields like firstname, lastname, etc. Your main queries will be way cleaner and you won’t need to repeat ->> operators everywhere. Works grate for HubSpot data.

Try PostgreSQL’s generated columns to automatically extract JSONB fields you access often. Just alter your table: ALTER TABLE hubspot_crm_contacts ADD COLUMN email_computed TEXT GENERATED ALWAYS AS (properties ->> 'email') STORED; This cuts out messy JSON extractions from every query and boosts performance since values are pre-computed and indexed. For associations, I’d create a separate junction table that normalizes the many-to-many relationships during CloudQuery sync with a custom transformation script. This approach has saved me tons of time debugging complex CTEs.