Hey folks, I’m in a bit of a pickle with my data migration project. I’ve managed to pull data from HubSpot into PostgreSQL using CloudQuery, but I’m hitting a wall with the query complexity.
The main issue is with the hubspot_crm_contacts
table. It’s got two JSON columns: properties
(with contact details) and associations
(linking to other entities like companies).
Here’s what’s bugging me:
- The
associations
column sometimes has duplicate entries. What’s up with that?
- My SQL queries are getting super complicated, especially when I try to fetch contacts with their associated data.
- I’m worried about how to handle multiple associations in the future.
I’ve tried a CTE approach to join contacts with companies, but it feels clunky and hard to scale.
WITH contact_companies AS (
SELECT
c.id,
c.properties->>'full_name' AS name,
c.properties->>'work_email' AS email,
jsonb_array_elements(c.associations->'companies'->'results') AS company_data
FROM
hubspot_crm_contacts c
)
SELECT
cc.id,
cc.name,
cc.email,
comp.properties->>'company_name' AS company
FROM
contact_companies cc
JOIN
hubspot_crm_companies comp ON (cc.company_data->>'id') = comp.id;
Is there a way to transform this data before it hits PostgreSQL? Or maybe a better way to structure it for easier querying? I’m all ears for suggestions!
I’ve been through a similar situation with HubSpot data migration, and I can share what worked for me. Instead of trying to handle all the complexity in PostgreSQL, I found it more effective to pre-process the data before insertion.
I used Python with the ‘pandas’ library to flatten the JSON structures and handle the associations. This approach allowed me to create cleaner, denormalized tables that were much easier to query.
For the duplicate entries in associations, I used a simple dict to keep track of unique associations while iterating through the data. This eliminated duplicates before they even hit the database.
As for scaling with multiple associations, I created separate junction tables for each type of association (contacts_companies, contacts_deals, etc.). This relational approach made it much easier to manage and query complex relationships.
While it required some upfront work, this method significantly simplified my downstream queries and improved overall performance. It might be worth exploring for your use case as well.
Your approach using CTE is on the right track, but I agree it can get unwieldy. Have you considered using PostgreSQL’s built-in JSON functions more extensively? They can simplify your queries significantly.
For the duplicate entries in associations, you could use jsonb_array_distinct()
to remove duplicates before processing. As for complex queries, creating custom functions might help encapsulate some of the logic.
Here’s a potential query structure:
SELECT
c.id,
c.properties->>'full_name' AS name,
c.properties->>'work_email' AS email,
jsonb_array_elements_text(
jsonb_array_distinct(
c.associations->'companies'->'results'
)
) AS company_id
FROM hubspot_crm_contacts c;
This flattens the structure while handling duplicates. You could then join this result with the companies table as needed. For handling multiple associations, consider creating separate junction tables for each association type. This relational approach often proves more manageable in the long run.
hey flying leaf, i feel ur pain with complex data structures. have u considered using a materialized view? it could simplify ur queries n improve performance. something like:
CREATE MATERIALIZED VIEW contact_company_view AS
SELECT
c.id,
c.properties->>‘full_name’ AS name,
c.properties->>‘work_email’ AS email,
comp.properties->>‘company_name’ AS company
FROM hubspot_crm_contacts c
LEFT JOIN hubspot_crm_companies comp ON (c.associations->‘companies’->‘results’->0->>‘id’) = comp.id;
this might help with the duplicate issue too. just a thought!