I am currently transferring data from HubSpot to PostgreSQL with CloudQuery, but I’m facing challenges with the complexity of querying JSONB data.
Configuration Details
Here’s how my source is set up:
kind: source
spec:
name: "hubspot"
path: cloudquery/hubspot
registry: cloudquery
version: "v4.4.3"
destinations: ["postgresql"]
tables: ["*"]
spec:
app_token: "my_access_token"
pagination:
page_size: 1000
max_retries: 5
table_options:
hubspot_crm_contacts:
associations: ["contacts", "deals", "notes", "emails", "tasks", "meetings", "calls", "company"]
hubspot_crm_companies:
associations: ["contacts", "deals", "notes", "emails", "tasks", "meetings", "calls", "company"]
hubspot_crm_deals:
associations: ["contacts", "deals", "notes", "emails", "tasks", "meetings", "calls", "company"]
And my destination configuration:
kind: destination
spec:
name: "postgresql"
path: "cloudquery/postgresql"
registry: "cloudquery"
version: "v8.5.1"
write_mode: "overwrite-delete-stale"
spec:
connection_string: "postgresql://test:test@localhost:5432/dbsync?sslmode=disable"
Data Structure Concern
In the hubspot_crm_contacts table, there are JSONB columns. For example, the properties column contains:
{
"email": "[email protected]",
"lastname": "Musk",
"firstname": "Elon",
"createdate": "2024-08-05T09:06:00.072Z",
"hs_object_id": "28109002461",
"lastmodifieddate": "2024-08-05T11:00:14.253Z"
}
The associations column has:
{
"companies": {
"results": [
{
"id": "13176491214",
"type": "contact_to_company"
},
{
"id": "13176491214",
"type": "contact_to_company_unlabeled"
}
]
}
}
The Issue
I’m struggling with the complexity of the SQL queries needed to retrieve contacts along with their associated data. Here’s the query I’m using to fetch contacts and their linked companies:
WITH company_data AS (
SELECT
c.id AS contact_id,
c.properties ->> 'firstname' AS first_name,
c.properties ->> 'lastname' AS last_name,
c.properties ->> 'email' AS email,
jsonb_array_elements(c.associations -> 'companies' -> 'results') AS company_json
FROM
hubspot_crm_contacts c
)
SELECT
DISTINCT ON (contact_id, company_id)
cd.contact_id,
cd.first_name,
cd.last_name,
cd.email,
cmp.id AS company_id,
cmp.properties ->> 'name' AS company_name,
cmp.properties ->> 'domain' AS company_domain
FROM
company_data cd
LEFT JOIN
hubspot_crm_companies cmp
ON (cd.company_json ->> 'id') = cmp.id
ORDER BY
contact_id, company_id, cd.company_json ->> 'type' DESC;
This becomes even more convoluted with additional associations. I am seeking guidance on effective data management techniques.
Is there a way to restructure the data before it enters PostgreSQL or methods to simplify my SQL queries?