The title of this post may not seem like that common of a use case – but I ran across a fascinating problem today. Admittedly my Regex-Foo is not the greatest so I struggled with this one for a bit. I am posting my solution here in hopes that it helps other people.
In short I had a document that was being sent through a Character Set encoding. The source of the encoding was unkown and despite different attempts and parsing it properly, I was still left with a few documents that had stray question marks in them for their dashes and quotes.
At the end of the day it was easier to run them through a cleaning filter after performing the charset conversion.
First Problem. Finding a punctuation character, in this case a question mark, that had alphanumeric text before and after it.
Don’t Match: hello?
Solution. Using pre_replace I was able to create a regex pattern that would find the question mark and confirm that it did in fact have alphanumeric characters before and after it.
Here is the preg_replace code:
$str = preg_replace("/(?<![\W_])\?(?![\W_])/", "'", $str);
Its a bit easier to read when you break it down in the three main parts, the condition, the character to find and then the closing condition.
/ (?<![\W_]) \? (?![\W_]) /