How to parse and tally different header levels in LangGraph state data?

I’m building a LangGraph application for automated document creation. I need to extract and count various header types, including main sections, subsections, and sub-subsections, from a string stored in my state dictionary under the key “report_structure”.

state_data = {
    'report_structure': 'Final document structure:\n\n'
                       '**Chapter 1: Financial Planning Basics**\n\n'
                       '1.1 Understanding budgets and their role in planning\n'
                       '1.1.1 Key principles of effective budgeting\n'
                       '1.2 Overview of financial planning for professionals\n\n'
                       '**Chapter 2: Investment Strategies**\n\n'
                       '2.1 Risk assessment and portfolio management\n'
                       '2.1.1 Real-world examples of risk management\n'
                       '2.2 Retirement planning considerations\n'
                       '2.2.1 Statistical data on retirement savings\n'
                       '2.3 Emergency fund strategies\n'
                       '2.3.1 Common mistakes in emergency planning\n'
}

I’m struggling to differentiate between main chapters, such as “Chapter 1”, subsections like “1.1”, and sub-subsections like “1.1.1”, since they all follow similar numbering patterns. What would be the best method to count each header type separately?

Your structure looks good for regex parsing. Each hierarchy level has unique patterns you can match against.

For chapters, use \*\*Chapter \d+:.*?\*\* to catch the bold markdown. Subsections are just \d+\.\d+ and sub-subsections are \d+\.\d+\.\d+. I’ve dealt with similar parsing issues in technical docs - counting decimal points in the numbering scheme works every time.

Watch out for partial matches though. Use word boundaries and be specific about what comes before/after each pattern. I’d grab all matches with re.findall() first, then use len() to count them instead of trying to count directly with regex.