The Proxomitron Filter Examples

The Proxomitron

Filter Examples

John Sankey 2002

Introduction
Technical Details
The Filter Language
Filter Examples
Wipe out a feature of HTML
Exception Filters
Remove elements within a tag
Convert one tag to another
Get rid of CGI URL's
Use lists to deal with our complex world
Change the entire configuration of The Proxomitron
Get rid of banner ads

The Proxomitron
Keeping an eye on the Web.
For You.

Wipe out a feature of HTML

Some features of the Web are so risky or obnoxious that they should be wiped out altogether. The top three are applets, ActiveX, and active scripting.

Name = "Kill applets" Limit = 32767 Match = "<applet\s*</applet>"

Applets are programs sitting in your computer. Most run at the same privilege as your browser - they can read or write absolutely anything on your entire system (including any network to which you are connected). Some applets are distributed with your browser, but any kid on the Internet can download one to you if you have applets enabled (in IE at least). That new applet can take the name of an existing applet, and be a virus, trojan, or anything anyone in the world happens to feel like using you for (cf. The Attacks on GRC.COM). Stopping them, possibly except from trusted sites, is a basic security step. (Turning off Outlook Express is even more important - that's how most viri and trojans are actually distributed. Old-fashioned BBS text-only mail systems have their merits!)

HTML invokes applets with a beginning and end tag: <applet appletname> starts one, and </applet> ends it. So, this filter is triggered by the <applet sequence (without the > - remember that applet has one or more fields between <applet and the tag completion >, they can appear in any order and they could be anything at all including nonsense!) The filter then looks for anything at all (the *) until it finds </applet>, then stops. Then, it simply deletes (replaces by nothing) the whole sequence - tag, fields and all. Your browser never sees it.

You can also turn off applets in most browsers, and you should - there are new ways found every month or so to sneak trojans and other destructive programs into your computer. For example, MIME type "multipart/x-mixed-replace" allows more than one HTML page for every GET request. Each has its own header, and they don't have to be HTML - application/octet-stream loads anything the remote site wants onto your computer, then runs it. The Proxomitron only knows how to filter HTML (http://) and SSL (https://).

Note the large Limit - applets are so dangerous that you want to be sure you get rid of them, no matter how long they are. If you meet up with a malformed page that doesn't close an applet, The Proxomitron will have to read all the way to the end of the page before it can send any of it to your browser - slow but safe.

If you want to be warned when you meet up with a site that uses applets, add
Replace = "[Applet Killed]"
Just remember that this will only let you know of the 'up front' applets, not the sneaky ones.

Name = "Kill ActiveX" URL = "^(*.microsoft.com/)" Match = "<object\s*</object>"

This works the same as the Kill applets filter, but deletes ActiveX things except for Microsoft sites that supply updates for your operating system. (If you don't trust Microsoft, you're using Linux or something like it, and ActiveX won't work anyway!) As with applets, it's a good precaution to turn off ActiveX in your browser as well, and only activate it when you want to update Microsoft code online. (By the way, Microsoft provides file download executables for most updates; using them you don't need ActiveX at all, and you can back up your upgrades as well.)

Name = "Kill scripts" Match = "<script\s*</script>"

This works the same as the Kill applets filter, but deletes all scripts, no matter what language they are written in. (JavaScript is the most common, of course.) JavaScript commands can be used to read anything on your computer that you can read as a user on most systems. They can only be 'understood' by a full-blown Java interpreter, as there are so many different ways the same code can be written. (For example, "expires" can be written "exp"+"ires" ...) So, it is not possible for a program as simple as The Proxomitron to filter them reliably. Killing them all, except for a few trustworthy sites, is a sensible option, particularly with browsers that have no facility to turn them off. (IE4.73, for example, only has an option to disable 'active scripting', not all JavaScript, and the distinction depends on where the source page comes from.) A large Limit is also advisable for this filter, since some pages are now entirely in JavaScript with HTML tags around the script to disguise them.

Use the same filter design to get rid of anything else that you don't want on your monitor: styles, iFrames (Microsoft's idea of progress) and Layers (Netscape's brainwave). Others are sure to turn up in the future! Just use a small Limit for anything that is not potentially destructive, so the page will load as quickly as possible.

Exception Filters

Often you want to allow one form of a tag that you appreciate or trust, while barring other variants of it that cause you problems. <Meta...> is a prime candidate. You want to get the content-type meta (even if you aren't into Arabic or Chinese, quite a few sites use 16-bit Unicode now). But, every day browser writers seem to dream up a new meta that disables your back button (http-equiv=refresh), fouls up caching (expires, nocache, pragma), makes you twiddle your thumbs for ages with fades from one page to another (page-enter, page-exit, site-enter, site-exit), or is some other similar nuisance.

Name = "Allow content-type meta" Bounds="<meta\s*>" Match = "\1"content-type"\2" Replace = "\1"content-type"\2" Name = "Kill metas" Match = "<meta\s*>"

This introduces the Bounds feature of The Proxomitron. A filter with Bounds will only be effective within the range matched by the Bounds - in this case within a meta tag.

The first filter says: within the limits set by the bound, record all characters that exist before "content-type" into variable \1, save all characters that exist after "content-type" into \2, then put everything back. If "content-type" doesn't exist, the filter does nothing and the tag gets passed on to the second filter.

When a filter matches a code in the input HTML, (unless you tell The Proxomitron otherwise with the Multi function) it 'uses up' the code - it's not available to be grabbed by the second filter. That's how this combination of filters lets the desired meta tag through and blocks all the rest.

This multifilter method lets you do quite complex things, like accepting a specific meta from just one 'gotta-have' site, a second from just another site, etc. But, if you only want to let one thing through from everywhere, and block the rest (also from everywhere), there is a more elegant way (although the syntax is more confusing!):

Name = "Allow content-type metas but no others" Bounds="<meta\s*>" Match="(^[^>]++"content-type")*"

This introduces the logical-test function of The Proxomitron [^], and the repeat function +. [^>]+ means: repeatedly test characters from here on until you find a >. A negated expression doesn't 'use up' characters it doesn't match. The bound starts with <, so there must be at least one character that [^>]+ doesn't match. Following it with the second + and "content-type" says: if, on the way, the test meets up with "content-type", then set the result of the test (the 'value' of everything in blue) as TRUE. Otherwise, the result of the test is FALSE (the leading <).

Then, apply the NOT function (the leading ^) to the blue test result. That makes the value of everything in the outer parentheses TRUE if "content-type" was not found, FALSE if it was found.

Finally, apply that TRUE/FALSE result to the match-everything function (the trailing *). That is, 'use up' everything within the bounds if the value of the outer parentheses is TRUE, but take no action if the value is FALSE. Since there is no Replace in the filter, everything that is 'used up' in the Match is replaced by nothing.

Is there more than one meta tag you'd like to let through? Read on for how to use The Proxomitron OR function and lists.

Remove elements within a tag

Often you'll want to remove only one element of a tag while leaving the rest as they are. Or, you will want to write separate filters for each part of a tag, for flexibility or simplicity.

Name = "eliminate page background images" Bounds="<body\s*>" Multi = TRUE Matching: "\1background="*"\2" Replace = "\1\2"

This filter introduces the Multi function. Normally, when a filter matches a code in the input HTML, it 'uses up' the code - it's not available to be grabbed by another second filter. When you change just one element of a tag, and other filters might need to grab other elements that occur before the element you have matched, use Multi=TRUE. Just remember that, if you do, YOU are responsible for ensuring that there is no way that the filter can match its replacement text. If it does, you will get an infinite loop and have to stop The Proxomitron with its Abort button.

Anyway, within the bound, this filter looks for the element that is used to specify a page background image. Everything before that element is put into variable 1. The equals = sign also matches spaces on either side of it. The doublequote matches either a single or double quote, the * matches everything up to the next quote. Finally, the rest of the bound is put into variable 2. \1\2 just puts everything within the bound back except for the element it removed.

Some elements can appear more than once within a tag. Then, you need a recursive filter - one that keeps being called until all elements have been removed. There are two ways of doing this - internal (stack) and external (added space). Here is how you do it with the stack.

Name = "Kill All Link Colors" Multi = TRUE Bounds = "<body\s*>" Match = "(\#((a|v|)link=\w))+ \#" Replace = "\@"

This introduces The Proxomitron's stack - a variable that can hold many things, not just one as in the case of \1. Here's how it works.

The part in dark blue matches alink, vlink or link. So, the part in blue matches all forms of a body link colour such as link=blue or alink="#FFFFFF". When one of these is matched, everything within the bounds prior to it is put in the stack. The + says to repeat until there are no more matches, then the final \# grabs the rest.

For example, consider
<body link=black bgcolor=white alink=black>
The first match is to "link=black", so "<body " is put in the stack (position 1). The filter is then pointing to the space following "link=black". On the next repeat, it matches "alink=black", and " bgcolor=white " is put in the stack (position 2). There are no more matches, so the trailing > is put in the stack (position 3).

Now the replace \@ dumps the stack, first in first out: "<body bgcolor=white >". All link colours have been removed in one invocation of the filter. (Of course, if you are colour or vision impaired, you will actually want to remove the entire body tag - background colour and image too - so you see the colours you set in your browser.)

Convert one tag to another

Comments are usually put in pages by HTML generators (those 'point click and ignore the crappy code' programs), but occasionally even by programmers so they can figure out what they are doing. You'd be surprised how often programmers mark off the beginning and end of ads and other corruption by comments! Sometimes, seeing them can be really useful.

But, converting tag 1 to tag 2 can be tricky. The first thing to do with any such conversion is to ensure that side effects do not occur with anything that is inside the changed tag.

Side effect 1: it is common practice to comment everything inside JavaScript tags so browsers without JavaScript support won't try to display them. (A browser is supposed to ignore anything it doesn't understand, and non-Java browsers don't understand script tags.) JavaScript ignores HTML comment marks but may not ignore what you replace them with. If you use JavaScript and want to ensure that it isn't fouled up by a Comment Viewer set, you need to start with:

Name = "Remove fake JavaScript comments" Multi = TRUE Bounds = "<script\s*</script>" Limit = 32767 Match = "\1() \2" Replace = " \1 \2"

This will remove all 'comment' markings within JavaScripts before the following stages have a chance to see them. Note the space between ) and \2 in the match.

Also note the space before the \1 in the Replace. The Proxomitron avoids infinite loops by 'using up' one character each pass through the filter list. If you don't add the space, then the bounds would see "script..." the next pass, and wouldn't match. If you need to recursively activate a filter, you must add a leading space in the replace string, so that it will be the one character used up.

Side effect 2: If a page 'comments out' a section of code that doesn't work, simply removing the comment marks would activate that code! So, you need to replace all < and > within the comment to < and > respectively, so your browser will display them but not act on them. HTML comments begin with , so the requirement is to change < and > within comments, and leave the comment available for further editing just as the background image filter did. There may be more than one < > pair within a comment, so recursion is needed. The following method allows The Proxomitron to take several passes through the filter list to complete the job, rather than using the stack to do everything in one pass.

Comments begin with < and end with > - if the leading < were to be changed part way through the recursion, the bounds would stop matching; if the trailing > were changed, the bounds would then extend to the end of the next comment after the one being worked on. So, to begin:

Name = "Comment Viewer stage 1" Multi = TRUE Bounds = "" Match = "?\1<\2" Replace = "<\1<\2"
This filter changes all the <'s. The 'match any single character' ? is used to match the leading < so the rest of the match can't affect it, and the replace restores it.

Name = "Comment Viewer stage 2" Multi = TRUE Bounds = "" Match="(^[^>]++-->)\1>\2" Replace = " \1>\2"
The ? technique of stage 1 can't be used for a trailing character - the preceding \2 will never leave anything left for a trailing ? to match. This filter changes a > unless it is preceeded with --. As with the meta filter, the portion in blue is TRUE if the test reaches a -->, but FALSE if a > has been found before the -->. The leading ^ inverts the blue value, so if the match occurs, the > is replaced by >, otherwise stage 2 takes no action.

Note again the space before the \1 in the Replace of filter 2. In this case, since stage 1 and 2 operate as one, the recursion space need only be added by stage 2.

Now (at last!) you can safely make everything visible with

Name = "Comment Viewer stage 3" Match = "" Replace = "<small>[\1]</small>"

The Match clause puts everything within the comment into variable \1. The Proxomitron then sends a 'set font one size smaller' to your browser, a left square bracket [, all the comment (variable \1), then right square bracket ], and finally restores normal font size. In short, it converts an invisible feature to a visible feature. And, at this point, neither Multi nor leading space are needed - the job is done.

Get rid of CGI URL's

Some search engines don't give you the real links to indexed sites, but instead bury them in a CGI code so they can collect revenue from clicks. Really, you should use impartial search engines - the CGI code is used to collect revenues from sites who pay for preferential appearance in your searches. But if you insist, here is how you get to the real sites with such engines. The syntax of these CGI links is:

they are within bounds <a\s*>,
they have a URL (that may be relative) prior to an ? and another after, and it is the latter one that is wanted,
the wanted URL begins with http:, https: or ftp: (that is, they are always offsite), and
the wanted URL may end with any of "'<& and may be escaped with %hh equivalents.

So, within <a\s*> we have to look for href= followed by ? in turn followed by (http:|https:|ftp:) in turn followed by one of "'<& then grab the parts we want:

Name = "Convert CGI links to real ones" Bounds = "<a\s*>" Match = "*href=*\?*(http(s|):|ftp:)\1\2["'&<]*" Replace = "<a href="$UESC(\1\2)">"

? is a metacharacter in the filter language, so has to be escaped with \. Since there is no space between \1 and the preceding parenthesis, \1 grabs whatever is matched within the parentheses, that is, http: https: or ftp:, whichever it is. \2 grabs everything up to the next character that can't be part of a valid URL. $UESC then unescapes the hex code stuff in the URL you want so you can read it. Everything else is dumped, including JavaJunk that can spoof you into thinking that you are visiting mother.teresa.god when you are actually being sent to raw.sex.com.

Use lists to deal with our complex world

Name = "Kill all Images on selected pages" URL = "$LST(NoImages)" Match = "<i(mg|mage|nput)\s*>"

Sometimes things get so messy that you need a big long list of things you trust, or things you don't. This filter uses such a list. Open up The Proxomitron Config/Blockfile window, and you will see a list called NoImages, followed by the name of the file that contains the list. $LST(NoImages) tells The Proxomitron to start by checking to see if the URL you are viewing is in that list. If it isn't, then the filter is not activated, but if it is, the filter is triggered by any of the three ways in which images are identified in various versions of HTML.

The | character is the OR function. Parentheses () group rules together. So, the filter is triggered by <i followed by (mg OR mage OR nput). In full, by <img, <image, or <input. Everything about an image ends with the next >, so all the filter has to do is to match everything that follows one of these three sequences until a > is found, then stop. As with the previous filters, this filter just replaces the whole image tag with nothing, so your browser never sees it.

Want to replace images with [IMAGE] the way Lynx does? Just add
Replace = "[IMAGE]"

Change the entire configuration of The Proxomitron

If things really get desperate - if you can't access somesite.com you'll get fired or divorced :-) a filter can replace the entire configuration of The Proxomitron. First, put this filter in default.cfg:
URL = "*somesite.com/" Match = "\1" Replace = "$JUMP(http://load//altconfig.cfg?\1)"

Then in altconfig.cfg put:
URL = "^(*somesite.com/)" Match = "\1" Replace = "$JUMP(http://load//default.cfg?\1)"

With these filters, when default.cfg is loaded and The Proxomitron sees a URL containing somesite.com, it will load altconfig.cfg. If altconfig.cfg is loaded and it sees any URL that isn't somesite.com, it will load default.cfg.

Get rid of banner ads

This filter is distributed with The Proxomitron. Since ad merchants are continually modifying their methods, this description is designed to enable you to keep it up to date.

Name = "Banner Blaster" Multi = TRUE Bounds = "<a\s*</a>|<input*>" Limit = 2048 Match = " \1<(img|image|input)*>\3 & ( *(href|src)=\w ([^o]ads[./]| /ad| (ad|log_)click| click(thru|er)| (banner|ad|acct|source|click)id=| (to|seed|banner|page)=| banners| adbanner| adfu.| sponsor| adver| promo| redirect )| *http://*<img\s (*width=[#460-480] & *height=[#55-60]| *src=\wcgi\w ) *> )* & (*alt="\2"|) " Replace = "<center>\1[\2]\3</center>"

This is one formidable filter - it's half of WebWasher® or AdSubtract® in one rule! To help understand it, it is shown above indented by structure without the quotes that would be required for it to actually work in The Proxomitron with indenting.

All ads are enclosed in an anchor or input tag, because the whole point is to persuade you to click on them to generate revenue. So, the bounds restrict filter activity to those tags. Multi is set TRUE because ad pushers try everything they can think of to get your attention. (The filter adds <center> at the beginning of its replace line so recursion works. If you take that out, add a space instead.)

The Match has three sections, connected by & (AND). So, all three sections must match for it to activate.

The first section looks for an image or input tag. If there is one, everything within the bounds prior to the tag is put into \1, everything after into \3. (If there is no image or input, the filter quits right there.)

The second section is two lists of things to test for separated by the purple | (OR). So, if either of them is satisfied, this section matches and the filter is activated. The first list begins with the first blue * and checks for either href or src followed by any one of the kinds of path or file names that servers use for ads. (Why [^o]ads[./]? To avoid blocking 'downloads'!) The second list begins with the second blue * and looks for an off-site image with dimensions that are almost always ads. It also checks for images delivered by CGI because these are invariably ads. The three blue *'s ensure that this section can match no matter where in the bounds the matched element is.

The third section looks for something or nothing, the 'something' being an ALT text. So this section puts any ALT text into \2, but matches even if there is no ALT.

Finally, if the filter matched, it outputs anything that existed before the image, any ALT text it found, then anything that existed after the image. So, you can still click on it if you really want to. If you simply want to get rid of ads altogether, omit the Replace and Multi lines.