Somewhat 'Smarter' Search and Replace

let's say you have some old HTML 3.2 documents that you want to bring into the 21^st century by updating to HTML 4.01 or XHTML. So you want to remove all those nasty font tags and replace them with styles. HTML-Tidy is a free HTML validator (one of the ones recommended by w3.org) which has this capability built in. With a click of the button all font tags can be removed and replaced with span tags with an associated class defined in a style.

Some other search/replace examples:

[Digithead warning: Unless you carry your writing utensils in a pocket protector, you may not be interested in the following.]

HTML-Kit is a free, full featured HTML and script editor (text editor! not WYSIWYG) developed by Chami.com. Among its many features is it allows you to build 'smart' search and replaces since your search phrase can be a regular expression. (Note: HTML Tidy, mentioned above, is also built in!)

I recently had an occassion to use this to remove all the width="xx%" attributes in a very large HTML table that was generated in Excel '95 with Internet Assistant. This utility puts a width= attribute in every cell! This table had many columns, each with their own unique width. I could have issued several searches/replaces (one for each column width) as follows:

search for: width="11%"; replace with: (nothing)
search for: width="27%"; replace with: (nothing)
... etc.

or I could issue one search and replace as a regular expression to get them all:

search for: width=\"[\d]+\"; replace with: (nothing)

What this says is search for width=" followed by a string of digits of any length which is then followed by another double quote. Then remove that string (i.e. replace with nothing)

Another slightly more complicated example: Say (for whatever reason) you had a document containing a series of related terms like p01, p0201, p031156,... and (for some other reason) you needed them all changed to q01, q0201, q031156, ... (is this minding your p's and q's?). You can not simply do a global replace for all p's to q's since that would make changes you did not intend. Here again, a regular expression replace can make your work easy:

search for: p(\d+); replace with: q%3

The above says to search for a p followed by a series of digits of any length and replace it with a q followed by the same digits. The %3 is called a back-reference. In this case it refers back to the 3^rd thing that is matched in the search p(\d+). The p is the first match. One would think the string of digits would be the second match but (for some reason, I don't know why) there is an extra NULL match thrown in for parenthesized things so it is referred to as the 3^rd match in HTML-Kit.

This is where HTML-Kit's regular expressions are different from PERL's, for example. In all cases, things enclosed in parentheses are grouped as a matching entity, but in PERL (for example) you would refer back to the series of digits above as $1 instead of %3 (i.e. the first back-reference, only things in parentheses are available as back-references). Also note the dollar sign instead of the percent sign. It is a bit confusing, but does seem to be consistent once you know this. Just test your search/replace command on a single replace to make sure it works as expected before you hit 'Replace All'.

What the !^.*$! is a "regular expression"?

Regular expressions are cryptic looking expressions that are used for pattern matching. It is not trivial to learn, but it can be extremely powerful and is common to many web programming languages (PERL, PHP, Unix/Linux shells, etc.) See WebMonkey for a grep example and explanation. Regular expressions used in PERL and PHP are well documented. There is some documentation for regular expressions used with HTML-Kit search and replace but it is sparse and they do seem to be work somewhat differently than the other above examples. I guess I would be hard pressed to recommend learning regular expressions just so you could use them with HTML-Kit text editor but given their ubiquity in web CGI scripting, they may be worth your while.