I have Google Voice set up to convert voicemail recordings to text and send them via email. After removing HTML tags from the email content, I get this text:
<!-- div, p, a, li, td {} .links-date a {color:#000000; text-decoration:none} .links-footer a {color:#757575; line-height:12px; text-decoration:none} .links-phone_number a {color:inherit; text-decoration:none} .im {color:#000!important} --> Hello, this is a test message play message YOUR ACCOUNT HELP CENTER HELP FORUM To edit your email preferences for voicemail, go to the Email notification settings in your account. Google Inc. 1600 Amphitheatre Pkwy Mountain View CA 94043 USA
I need to extract just the message part: Hello, this is a test message
The actual voicemail content is always positioned between two fixed markers. It starts right after the “–>” symbol and ends right before “play message”. These boundaries are consistent across all emails.
What would be the best approach to isolate only the voicemail transcript text? Can I use a formatter or some code logic to extract this specific portion while ignoring everything else?
I had a similar Google Voice setup and used a split approach that worked great. Split the text on “–>” first, grab the second part, then split that on “play message” and take the first element. Something like text.split('-->')[1].split('play message')[0].strip() does the job cleanly. Way more readable than regex and easier for others to maintain later. This stayed stable even when Google tweaked their HTML - the text markers stuck around while CSS classes changed. Heads up though - transcription quality varies a lot based on audio, so you’ll probably want basic cleanup for stuff like extra spaces or weird punctuation.
string slicing beats regex here. find where “–>” sits, then locate “play message” and grab what’s between them. text[text.find('-->') + 3:text.find('play message')].strip() does the job. works across most languages and way less likely to break than regex.
Regex is perfect for this. Since your markers are consistent, try this pattern: -->\s*(.+?)\s*play message - it’ll capture everything between the comment tag and “play message”. The (.+?) is a non-greedy capture group that won’t grab too much. I’ve done similar parsing before and you’ll definitely want to trim whitespace from what you capture - there’s always extra spacing. Python’s re module handles this well, or if you’re using JavaScript, the built-in string methods work great. One heads up: watch for spacing variations around those markers. Google tweaks their email formatting sometimes, so test your regex on a few different voicemails to make sure it stays reliable.