Getting Rid of MS Smart Quotes

April 11, 2003 by  
Filed under Code Snippets

Ever have that problem with displaying Smart Quotes in the browser? Well, here is how I solved the little bug.

This problem has been bugging me for a while. See the image:

smart quotes

The problem is in the red circle. See how there’s junk around “Meet the Author?” Well, I was researching on the problem and it’s because the user copied and pasted from a Word Document. Microsoft adds these “Smart Quotes” to your documents which are just fancy close and open quotes. These replaces the straight quotes ( ” ) when the user enters the data.


So how to fix this? I did a lot of Googling today on smart quotes with php/html. I couldn’t find a good regular expression to replace the fancy quotes. I mean, what do I put in the parameters of eregi_replace()?

Here are some links that helped me:
Creating Special Characters
Smart Quotes:
Adding automated curly quotes to Cocoa’s Text system

chr()
ord()
htmlentities()

But the one that really helped me was the htmlentities() function. In the comments, a user posted this function:

function superhtmlentities($text) 
{
$entities = array(128 => 'euro', 130 => 'sbquo',
131 => 'fnof', 132 => 'bdquo', 133 => 'hellip',
134 => 'dagger', 135 => 'Dagger', 136 => 'circ',
137 => 'permil', 138 => 'Scaron', 139 => 'lsaquo',
140 => 'OElig', 145 => 'lsquo', 146 => 'rsquo',
147 => 'ldquo', 148 => 'rdquo', 149 => 'bull',
150 => 'ndash', 151 => 'mdash', 152 => 'tilde',
153 => 'trade', 154 => 'scaron', 155 => 'rsaquo',
156 => 'oelig', 159 => 'Yuml');

$new_text = '';
for($i = 0; $i < strlen($text); $i++)
{
$num = ord($text{$i});
if(array_key_exists($num, $entities))
{
$new_text .= '&'.$entities[$num].';';
}
else if($num < 127 || $num > 159)
{
$new_text .= $text{$i};
}
}
return htmlentities($new_text);
}

This function converts all the evil (invalid) characters Microsoft Word could possibly use to HTML entities.

The only strange thing is that is printed out:

â&euro;&oelig;Meet the Authorâ&euro;

in the HTML code. But the wonderful thing is, I have something to at least run eregi_replace() on. :)

So my code looks like this:

$brief_des = eregi_replace("&oelig;", '', 
eregi_replace("â&euro;", '"',
superhtmlentities($newsitem->body)));

This actually fixes my Smart Quotes problem by replacing them with regular ( ” ) and it deletes the extra uncessary characters. I’m still a little confused as to why superhtmlentities() returned what it did. But for now, I’m extremely happy to be able to remove those MS Smart Quotes!

Related Posts Plugin for WordPress, Blogger...
  • http://blogs.bwerp.net/ Adam

    Yeah, Word is pretty notorious about this. You see it replace the long dash character as well. (I’m at a loss for the name for that particular piece of punctuation..)

  • http://blogs.bwerp.net/ Adam

    Yeah, Word is pretty notorious about this. You see it replace the long dash character as well. (I’m at a loss for the name for that particular piece of punctuation..)

  • Noel

    Cheers for posting this – I was finding it really difficult to figure out how to handle copy-and-pastes from word docs, and this has sorted me right out. Nice one!

  • http://url Noel

    Cheers for posting this – I was finding it really difficult to figure out how to handle copy-and-pastes from word docs, and this has sorted me right out. Nice one!

  • http://www.agaricus.co.uk andyh

    thanks, I’ve had the same problem with the replacement function. Something on the server side on Linux is interpreting the ms characters as three characters. eg, the microsoft open quote (ldquo) is chr(226) . chr(128). chr(156) instead of just chr(147).

    Need to map 226 128 156 to 147 before applying the replacement code for HTML entities.

    works fine on a windows iis php server.

    Something to do with character encoding of posted data I expect.

  • http://www.agaricus.co.uk andyh

    thanks, I’ve had the same problem with the replacement function. Something on the server side on Linux is interpreting the ms characters as three characters. eg, the microsoft open quote (ldquo) is chr(226) . chr(128). chr(156) instead of just chr(147).

    Need to map 226 128 156 to 147 before applying the replacement code for HTML entities.

    works fine on a windows iis php server.

    Something to do with character encoding of posted data I expect.

  • Colin

    that long dash is an emdash… the shorter one is the endash, and the smallest is the hyphen

  • http://n/a Colin

    that long dash is an emdash… the shorter one is the endash, and the smallest is the hyphen

  • Chris

    Make sure the charset in your and headers matches the doctype passed to htmlentites/htmlspecialchars. “ISO-8859-1″ is the default.

  • Chris

    Make sure the charset in your and headers matches the doctype passed to htmlentites/htmlspecialchars. “ISO-8859-1″ is the default.

  • http://tedgeving.com eurotrash

    http://us2.php.net/htmlentities

    this will work too.

  • http://tedgeving.com eurotrash

    http://us2.php.net/htmlentities

    this will work too.