I’m working with a large table that has about 500,000 rows. One of the columns is a varchar(255) UTF8 field called filename
. I want to clean up these filenames by removing weird characters.
I thought I could use a character class like [^a-zA-Z0-9()_ .-]
to match the characters I want to remove. But I’m not sure how to do this in MySQL.
Is there a built-in function in MySQL that can do regex-based replacements? I’m looking for something similar to the REPLACE() function, but with regex support. For example:
-- This works:
SELECT REPLACE('hello world', 'o', 'x');
-- Is there something like this?
SELECT MAGIC_REGEX_REPLACE('Hello World', '/[aeiou]/', '-');
-- Desired output: 'H-ll- W-rld'
I know about REGEXP and RLIKE, but they only check if there’s a match. They don’t actually do replacements.
I could pull the data into PHP and use preg_replace, then update the database. But that seems slow and clunky. Is there a better way to do this directly in MySQL?
As someone who’s dealt with similar data cleanup tasks, I can tell you that MySQL’s lack of native regex replace can be frustrating. However, I’ve found a decent workaround using stored procedures and loops.
Here’s the gist:
You can create a stored procedure that iterates through a list of unwanted characters and uses REPLACE() for each one. It’s not as elegant as a single regex replace, but it gets the job done without needing external processing or UDFs.
The procedure would look something like this:
DELIMITER //
CREATE PROCEDURE clean_filenames()
BEGIN
DECLARE chars_to_remove VARCHAR(255);
DECLARE i INT DEFAULT 1;
SET chars_to_remove = '!@#$%^&*{}[]|\:;<>?,/';
WHILE i <= LENGTH(chars_to_remove) DO
UPDATE your_table
SET filename = REPLACE(filename, SUBSTRING(chars_to_remove, i, 1), '');
SET i = i + 1;
END WHILE;
END //
DELIMITER ;
Then you just call CALL clean_filenames(); to run it.
It’s not perfect, but it’s a pure MySQL solution that can handle large datasets reasonably well. Just be sure to test thoroughly on a subset of your data first.
i think there’s a way to do this with mysql’s built-in functions, but it’s kinda hacky. you could use a combo of SUBSTRING_INDEX and REPLACE in a loop to remove each unwanted char. it’s not as clean as regex, but might work for simple cases. just be careful with performance on big tables!
yo, have u tried using the REGEXP_REPLACE function? it’s available in MySQL 8.0.4 and later versions. it does exactly what ur looking for. the syntax is like this:
SELECT REGEXP_REPLACE(filename, ‘[^a-zA-Z0-9()_ .-]’, ‘’) FROM your_table;
it’s way simpler than all those complex workarounds others mentioned. just make sure ur using a recent MySQL version!
I’ve actually tackled a similar issue before, and while MySQL doesn’t have a built-in regex replace function, there’s a workaround using a combination of REGEXP and SUBSTRING_INDEX that can get the job done.
Here’s the approach I used:
- Create a user-defined function (UDF) in MySQL that implements regex replace.
- Use this UDF in your SQL queries.
It’s a bit involved, but it works well for large datasets. You’ll need to write the UDF in C or C++ and compile it as a shared object. Then you can load it into MySQL.
Once set up, you can use it like this:
SELECT REGEX_REPLACE(filename, '[^a-zA-Z0-9()_ .-]', '') FROM your_table;
This method is much faster than pulling data into PHP and back. It processes everything at the database level, which is ideal for large tables like yours.
Just be aware that creating UDFs requires certain permissions and can have security implications, so make sure you understand the risks before implementing this solution.
While MySQL doesn’t offer a direct regex replacement function, there’s a clever workaround using recursive common table expressions (CTEs) introduced in MySQL 8.0. This method can handle your task without needing UDFs or external processing.
Here’s a basic example:
WITH RECURSIVE cte AS (
SELECT filename, 1 AS pos
FROM your_table
UNION ALL
SELECT
CASE WHEN SUBSTRING(filename, pos, 1) REGEXP '[^a-zA-Z0-9()_ .-]'
THEN CONCAT(LEFT(filename, pos-1), RIGHT(filename, LENGTH(filename)-pos))
ELSE filename
END,
pos + 1
FROM cte
WHERE pos <= LENGTH(filename)
)
SELECT filename FROM cte WHERE pos > LENGTH(filename);
This approach iteratively checks each character and removes it if it doesn’t match your desired pattern. It’s more efficient than multiple REPLACE calls and works directly in MySQL. Just be cautious with very large datasets, as it might impact performance.