The Proxomitron: Technical Details

The Proxomitron

Technical Details

Scott R. Lemmon
and
John Sankey 2002

Introduction
Technical Details
Web Page Filter Editor
Using Limit
Using Bounds
Using URL Match
Using "Allow for multiple matches"
Header Filter Editor
MIME encode/decode and password strings
Matching Expression Test Window
The Log Window
Autorunning programs and URLs
Creating and Editing Lists
External Proxies
An Introduction to text matching
Should you cache pages or not?
What are cookies?
Filter Language
Filter Examples

The Proxomitron
Keeping an eye on the Web.
For You.

Web Page Filter Editor

Here is where you can modify the matching rules that allow The Proxomitron to re-write web pages.

At its simplest the matching rules work much the same as a word processor's "Search and Replace" function. Any text matching the "Matching Expression" will be replaced with the text in the "Replacement Text" section. A matching expression of "mouse" with a replacement text of "rodent" for instance, would change every occurrence of "mouse" on a web page to "rodent". Very handy for disabling a lot of JavaScript nonsense.

Things get less simple when you start using the full set of Matching Rules, but they are needed to undo the complicated things remote sites try to do to you.

If you download filters from other people, you may find it easier to use Notepad to add the filter to your list than to cut&paste the filter lines individually into this editor. If you do, note that the "Active=" line is not handled by the Web Page Filter Editor, but instead reflects the checkmarks in your Web Page Filter list.

Using Limit

In HTML it's common for tags to run for several lines, in fact some entire pages are one JavaScript now! The Limit controls how many characters The Proxomitron looks forward before giving up. If the filter specifies a Bounds, Limit applies to Bounds, otherwise it applies to Match. Normally keep this small - for most tags, a value of 256 is fine - in fact, if you don't specify a Limit for a filter, The Proxomitron defaults to 256. Increase it if you find a rule that should match isn't working. If not for Limit, the entire web page has to be scanned by some filters before it could be sure a there was no match (for example, any tag with bounds <body\s*</body> would). If you don't see anything with your browser until the entire page has been transmitted, check the Limits of your enabled filters.

The best size to use depends on the tag - in particular, a "<script ... </script>" tag needs a large limit. The Proxomitron allocates a buffer equal to the limit size when a filter is active, and uses 16-bit memory allocation, so the maximum effective value of Limit is 32767.

Using Bounds

A Bound is an initial matching expression used to control the range (or boundaries) of the main matching expression. Normally a bounds check simply consists of the HTML start and end tags with an asterisk in between - "<script * </script>" Anything valid in the matching expression can be used here, but with a bounds check - the simpler it is, the better. Its use is often optional - you don't need it for many simple matching expressions. However, with complex matches it can improve performance, since the main expression need only be checked if the bounds returns true. More importantly, it's used to prevent a rule from matching more text than intended. Take the following rule intended to match a web link....

Matching: <a * href="slugcakes.html" > * </a>
If matched against the following text...
<a href="crabcakes.html" > some stuff </a><br> <a href="slugcakes.html" > other stuff </a>
the first asterisk would match everything in blue and the match would grab both links instead of just the second one! By using a bounds match like "<a\s* </a>" it's restricted to only checking one link at a time.

When not using bounds , never place wildcards at the beginning or end of a matching expression (as in " *foo* "). This results in the rule grabbing however many characters the byte limit is set to - not usually what you want. (Setting Bounds="*" does the same thing - leave the Bounds field blank when it's not needed.)

When using bounds however, the situation reverses. Since the bound selects the range of text being searched, the matching expression must match everything matched by the bound for the rule as a whole to match. The easiest way to do this is by using wildcards at both the beginning and end of the expression. Often matching variables are used (as in "\1 foo \2") so the text surrounding the part of the tag you're matching can be captured and included in the replacement text.

Here's an example to matching a link like
<a href="http://somewhere"> some text </a> Bounds : <a\s*</a> Limit: 128 Matching : * href="\1" * > * Replace : <a href="\1"> some new link next </a>

Using URL Match

You can use URL Match to limit a filter to affect only certain web sources. All matching rules apply here, so you need only match part of the URL. Multiple sources can be included by using the OR symbol "|" as in "www.this.com|www.this.too.com", and can be excluded by using negation "(^...)" as in "(^www.not.this.page)". There are also several commands that are specific to URL matching. Remember that they apply to the source URL, not to everything on a page from that URL - in particular, offsite images will not be filtered.

The "http://" portion of the URL is always removed prior to matching - don't test for it. If no restriction is needed on URL, leave the field blank - it's a bit more efficient for The Proxomitron than URL="*"

Using "Allow for multiple matches"

The way The Proxomitron filters web pages is:

1 point to the first character of the page
2 apply each filter rule in turn, in the order they appear
3 IF a filter matches
  THEN replace characters as specified AND
       IF multi=FALSE
          THEN move the pointer to the first character following
               the replacement characters and return to step 2
4 at the end of the filter rule list,
     move the pointer one character forward AND
     return to step 2.

So without multi, as soon as one filter matches, no following filters are able to process the matched section - it's first come, first served. This permits a filter to take precedence over later ones - often very useful. This doesn't always work however. Take the "<body ... >" tag - It contains several elements that we need to filter independently because they can appear in any order in any combination. For instance, if we had two filters - one that changed the default text colour and another that changed the background image - we'd have a problem. The first filter would prevent the second one from working by "using-up" the text up to and including the text colour element. The <body would no longer be there for the background image filter to match even if the background colour text appeared after the text colour text.

This is where the multi option comes in - it lets following filters process the same text. In the above scenario, if we enabled Multi on the text colour filter, the background image filter could then also match.

Use this feature sparingly - although powerful, it can lead to recursive matching. Consider the following situation - say there's a filter with a matching clause of "frog" and a replacement text of "<b>frog</b>". If this filter had multiple match enabled, the first time it "sees" the word frog it inserts the phrase "<b>frog</b>". But the scan continues forward until it hits the new "frog" and the process repeats itself nonstop.

If "frog" had been the first word in the replacement text this wouldn't have happened. The next time the filter is invoked it would be one letter forward - it would see "rog" instead of "frog". With Multi you must be sure your filter won't match its own replacement text minus its first character.

Header Filter Editor

Here is where you change HTTP header filters or create new ones.

Unlike web page filters, the header filter's name is very important - it's the name of the header you want filtered and, except for case, it must match exactly (no wildcards). A comment can be added after the colon ":" which will be ignored during filter matching. The matching clause and replacement text work similar to the web page filter editor but match on the header's contents only (not the header's name).

A header filter can basically do one of three things: delete an existing header, modify an existing header, or add a new header.

To change a header both the matching expression and replacement text must have something in them. Just use "*" in the matching clause to match all headers of that type.
To delete a header include a matching expression, but leave the replacement text blank.
To add a header include a replacement text, but leave the matching expression blank.

Request Headers - things your browser tells the world

Host: Contains the host name of the web page you're contacting (as in "www.somewhere.com"). It's used by web servers for "virtual hosting" where the same web server is used for several different sites. Don't touch it.
Accept: Contains a list of file and image types your browser understands.
Accept-Language: Contains a list of preferred languages - intended for allowing multi-lingual web pages that automatically choose your language. In some parts of the world, you may not wish everyone to know your native language.
Pragma: no-cache. Sent by web browsers when a page is reloaded. It tells remote caches to send a fresh version of the page instead of a copy it may have stored. The Proxomitron uses this to detect when a browser's "reload" button is pressed.

Reply Headers - things the world tells your browser

Server: Contains the name and version of the web server. Not used by your browser as far as I know.
Cache-control: affects how pages are stored in your browser's cache. A value of "Private" indicates the contents should not be stored at all, while "max-age" give an indication of how long the page provider feels the page should be kept. Often abused by people who get ad revenue from their pages and want to increase their hit count, at your expense since you have to wait for the reloads.
Pragma: no-cache In a reply, this indicates that your web browser should not store the page in its cache. Intended for temporary web pages like search engine results, but mostly abused.
Expires: Another header used by your browser's cache - contains a date when the web page's contents will have probably changed. It usually is only speculative.
Date: The web server's idea of the current date and time.
Last-Modified: The date and time the web page was last modified. Also used by your browser's cache. Most large sites now generate their pages 'on the fly' and always set this to the time of the current request whether the data has changed or not, so it's of little use. For earlier versions of Internet Explorer, you may have to delete this header along with "If-Modified-Since" to "force" the browser to always reload a web page.
Content-Type: Contains the type of data the server is sending, for example "text/html" is used for a web page and "image/gif" for a .GIF file.
Accept-Ranges: Part of HTTP 1.1 - "Bytes" indicates a server supports retrieving arbitrary sections of a file. Used by some download utilities to "resume" an interrupted file transfer.
ETag: A little like a checksum (but only very little). It contains a string that is supposed to change every time the web page is updated. The string has no real meaning other than that. Again something to be used by your browser's cache.
Connection: Another HTTP 1.1 header - "close" indicates no more data will be send on this connection (as was always the case in HTTP 1.0). HTTP 1.1 supports the idea of "persistent connections" where the same connection is "reused" to send multiple items. Supposed to be better, but if you ask me, it mainly benefits the web server (by reducing load) not the person viewing the web page - since it serializes the loading of web page elements.
Content-Location: The URL where the data came from - not always present.
Location: Used by URL re-directors to automatically forward a browser to a new location. Not quite the same as Content-Location for some reason ;-)
Content-Length: The length in bytes of the web page or file being sent.
Set-cookie: A request to your browser to store the information contained in this header and send it back to this server the next time you visit one of their web pages. Delete these, and your browser will never receive any cookies. You can customise this to selectively accept cookies from sites you trust.

MIME encode/decode and password strings

Right-clicking over the replacement text window will reveal an option to MIME encode/decode an selected text. This can be used to create password entries to automatically log you into sites or proxy servers that require a password. For web sites use the following header filter...
Name = Authorization: URL = The site you wish to send the password to Replace = basic username:password
Next select the username:password portion and select MIME > encode from the context menu. The end result should look something like this...
basic dXNlcm5hbWU6cGFzc3dvcmQ=

When enabled, this rule will send your password string to the server automatically and you will not be prompted for a login. Note: this only works for sites using the "basic" HTTP authentication scheme. You can use the decode option if you need to see or change your password at a later date.

Creating a proxy server password is much the same - just change the header name to...
Name = Proxy-Authorization:
and remove any URL match (since the proxy server will be used for all URLs)

Matching Expression Test Window

This lets you see first-hand if a particular expression will match any given text, and (if applicable) shows what the resulting text will look like. No more web page reloading just to check every little modification you make to a filter - just "paste and test".

You get here by pressing the little test button in the Web Page Filter Editor or selecting Test Matching from the right-click context menu of the HTTP filter or URL matching windows. Using the tester is really pretty straight-forward. Just enter some stuff to match in the top window (this will often be HTML source snipped from a web page), and press the TEST button. The result window will either show you what results your match has wrought, or tell you the rule simply didn't match at all.

The tester window doesn't have to be closed to go back to the filter editor. You can simply flip between the editor and the tester window and any changes made to a filter will be immediately recognized the next time you press "test". This makes it easy to see what effect a change will have before you actually update the filter by pressing the editor's "OK" button.

Note: once you feel a filter is OK, and hit the Apply button of the editor window, the change still is not saved to disk - it is only applied 'in memory'. To save any change to a .cfg file, you must explicitly save it using the File menu.

The Log Window

The log window is used to display information about ongoing activity. This includes HTTP header messages sent between your web browser and the Internet, and information on which web page filters matched a particular page. To see them click the "Log Window" button on The Proxomitron's main window. Messages are only logged while the log window is open and are lost when the window is closed. (This is for efficiency's sake since logging information slows down The Proxomitron.) Log windows messages can be saved to the windows clipboard - first select the messages then either click the "Copy" option from the log window's "Edit" menu, or press CTRL-C.

Pause: Temporarily suspends logging
Reset: Clears the log window
HTML filters: Enable/disable logging of web page filter matches
HTTP filters: Enable/disable logging of header messages

The edit menu can be accessed by right-clicking anywhere over the log window.

Log window messages are color coded...

Green: Request messages from your browser to the Internet
Yellow: Response messages from an Internet web server to your browser
White: Informational - such as the start and end of new connections
Red: Error messages
Violet: Web page filter matches
Cyan: Proxy test progress messages

Since requests for web pages may come in any order and are often mixed together, The Proxomitron numbers all requests your browser makes. You can use these number to track, for instance, which request goes with which reply.

Autorunning programs and URLs

The Proxomitron allows you to autostart a program whenever it's first started or when you click on the "Run Program or URL" button on the main screen or taskbar menu. When used with a program, this provides a handy way to launch your browser along with The Proxomitron. When used with a URL it allows you to save config files that contain settings specially tailored to a website then launch that page whenever the config file is loaded. You can use this a bit like browser bookmarks to easily jump to a site with specific filters already in place.

If Run at startup is checked the program/URL will be started automatically. If unchecked, the program/URL will only be run whenever the Run application or URL button is pressed on the main screen or selected from the system tray context menu.

Creating and Editing Lists

To create a list, create a text file, one URL expression per line, without the "http://". Call it something like bypass.txt. In The Proxomitron's main window, click Config, Blockfile, Add, select your new text file to add it to the list of blockfiles, then give it a unique 'listname'. Now you can use $LST('listname') in the URL field of any filter.

To edit an existing list, look in The Proxomitron's Config/Blockfile window to find out what it is called, then edit the file with any text editor. If it was supplied with The Proxomitron, give it a new filename and enter it with a new listname in Config/Blockfile so an upgrade doesn't undo your work.

An efficiency note: The Proxomitron indexes URL expressions whenever it can to speed lookup in large lists. This feature does not affect the way any list works, but use of indexable expressions can speed things up with long lists and slow computers because most expressions can be 'skipped over' with an index. All fixed expressions are indexed (such as www.sitetracker.com/). Expressions that begin without any variables (such as \s, AND or OR) are indexed up to the first variable (the blue part of www.somesite.com/\w/ads.html is indexed). URLs that begin with *, \w, [...]+, [...]++, and (...|) and have no further variables for the rest of the hostname up to the first "/" can also be indexed ([^/]++.doubleclick.net/ can be indexed, [^/]++doubleclick can't). The Proxomitron has to check every non-indexable expression every time it checks the list, which takes more time.

External Proxies

The Proxomitron can itself use a proxy server. In fact, you can maintain a list of multiple proxy servers and switch between them.

What can an "external" proxy (any proxy other than The Proxomitron) do? The most common use for a proxy server is to act as a global webpage cache in order to speed up web browsing (especially overseas or where Internet connection may be slow or unreliable). However they can do much more - anything from filtering out material inappropriate for children, to translating a webpage from one language to another.

You can follow a proxy entry with any comment text you like (using a space to separate them). This can be helpful to remind you what a particular proxy does or how reliable it has been. Just use the following format...
proxy.host.name:port comment text here

To create a list of proxy server entries, click "Add" and enter the list file name in the format $LST(bypass.txt). You can also cut&paste a list into the Add window, or enter multiple items by pressing Ctrl+Enter after each entry in the add dialog.

If you use a proxy that requires a password (as is often the case with firewalls) you can create a HTTP header filter to automatically send the password to your server.

The Proxomitron can perform a loopback test on an external proxy. During the test The Proxomitron makes a request for a web page - through the remote proxy - back to itself. It then monitors both ends of the conversation. Use it to tell if a particular proxy server is accessible and available for use. If the test is successful, the status window will show the name of the proxy server as it would appear to the remote host. Otherwise an error message will be reported.

Far more detailed information, including HTTP headers added by the remote proxy, can be viewed if the log window is open during the test.

Filtering Secure Pages

Secure pages are coded in a special way to reduce the chances that anyone but you and the secure server can read what you send. You can tell The Proxomitron to filter such pages, as long as you understand that there are some risks involved. For your own safety, it's best to not filter sites where you need to be really secure, like on line banking and such.

To filter secure pages, download ssleay32.dll and libeay32.dll. Because of legal and patent problems involved in the USA with any program that uses encryption, The Proxomitron cannot include encryption code directly; you must find them yourself. (Do a search for /SSL/ or "stunnel" on the web.) Put the dll's in the same directory as The Proxomitron. Next, set your browser to use localhost 8080 for the Secure protocol (in IE4, this is done under View/Internet Options/Connection/Proxy Advanced.) Finally, check the Use SSLeay/OpenSSL box under The Proxomitron's Config/HTTP tab.

When you do, The Proxomitron decrypts incoming data, filters it, then re-encrypts it before sending it on. This allows for nearly transparent filtering and full control over https connections.

There are some limitations to secure page handling. In order for The Proxomitron to act as a SSL server it must have a certificate. Certificates are used by web servers to identify themselves to your web browser. They must be digitally "signed" by a known company like VeriSign or your browser will generate a warning. The Proxomitron's certificate is located in the "proxcert.pem" file. It's a self-signed certificate created using SSLeay. As such it is not secure by itself, but it's only used for the connection between The Proxomitron and your web browser - the connection between The Proxomitron and the remote site is secure since it relies on the site's certificate not The Proxomitron's. Normally the local connection to your browser never passes outside your PC, so that's OK. When you first visit a secure site being filtered through The Proxomitron, your browser will issue a warning. That's because The Proxomitron's certificate will not match the name of the site you're visiting. These warning are unavoidable since SSL was intentionally designed to prevent an intermediary from secretly intercepting your data. The Proxomitron is intercepting your data, but under your control!

The Proxomitron lets you convert SSL connections to non-SSL ones. If the original URL is
https://some.secure.site.com/
use
http://https..some.secure.site.com/
With this command, The Proxomitron does not encript the connection to your browser, but the connection from The Proxomitron to the final server is still encrypted. Then, your browser thinks it's got a normal connection and won't do any certificate checks. This can used to access secure pages from browsers that don't have https support.

The Proxomitron does no certificate checking of it's own. If you don't trust a website, use the secure filtering option, and put The Proxomitron in Bypass mode. This will allow your browser to validate the site as it normally would.

Keep in mind: certificates are just used to help ensure you are actually connecting to the site you think you are and not some "spoofed" site. They have nothing to do with how encrypted your data is. Many sites (especially smaller ones) may not be using legally signed certificates, but this doesn't mean you're not secure connecting to them. Really all it means is they didn't cough up some money for VeriSign's official stamp of approval. Likewise, a valid certificate is no guarantee a site won't rip you off - you must still be careful before trusting a site with sensitive data.

An Introduction to text matching

To those unfamiliar with matching languages they look very cryptic at first! The idea is basically simple: certain characters, often called wildcards or meta characters, are given special meaning. Each of these characters will match parts of the original text only if they meet certain conditions. The text that's been matched can then be replaced by something else.

For example, an asterisk "*" will match any unknown group of characters, no matter what they are. It's normally used to match a section of text you're not sure about. For instance, say you were trying to match any word that ended with the letters "ko". Using "*ko" would match "Naoko" or "Atsuko" but it wouldn't match "Michie". While, something like "john*smith" would match "John W Smith", "John 'Bubba' Smith", not to mention plain 'ol "John Smith".

Applying the idea to HTML - say you wanted to match all image tags. An image tag always begins with <img and ends with >, but it can also have any number of things in between. A matching expression like <img*> gets it. It's like saying: Match anything that starts with <img, possibly has some other stuff here, then ends with >.

However, never forget that computer matching languages like that used for The Proxomitron are for computers to follow. They have to be absolutely precise. A loose concept like 'a space' in English is not good enough for computers. The definition used by 'space' in The Proxomitron is one that is suited for matching HTML - it is anything that separates things. That means that it matches not just one, two ... spaces, it also matches no-spaces-at-all because there are many characters that can separate things in HTML besides a space. If you want to ensure that there is at least one space in a match (or tab or newline - they are both spaces to HTML), use \s.

The asterisk * is similar - it matches anything including nothing at all. If you want to ensure that there is at least one character to match, use ?*.

\w is similar to *, but only matches the first word i.e. it stops at the first 'space'. It doesn't match quoted things, because quotes are separators. To match background="x.x", you must use background="*", not background=\w

Once you have matched some text, you often want to keep it. Parenthesis "(...)" followed by "\1" with no intervening space says "Stick whatever is matched between ( and ) and place it into variable number one". A "\1" in the replacement text then inserts the contents of variable number one at that location. The Proxomitron matching language features ten such variables numbered 0-9, plus a stack for recursive replacements.

Look at the following...
Matching: <img * src=(\w)\1 * > Replace: <img src=\1 border=1 >
Put into action, the above rule would re-write an image tag that looked like...
<img align=left src="bison.gif" alt="My pet bison Phil" >
into...
<img src="bison.gif" border=1 >
How? The part in blue is what the first "*" matched. The part in red is what the "(\w)\1" matched. The part in green is what the last "*" matched.

Notice that the blue and green bits never appear in the replacement text. Only the bit we decided to keep by using the number variable does. By deciding what to keep and what to throw away we can completely rework a bit of HTML. For example, say we wanted to change the above image so that instead of showing us our bison, it gave us a link we could click to see it. If we changed the replacement text to read...
Replace: <a href=\1 > a picture </a>
then the resulting text would be
<a href="bison.gif" > a picture </a>

Should you cache pages or not?

A browser cache contains copies made on your computer disk (fast) of files it has downloaded from the Internet (slow). When you click on a new link on the web, your computer makes a request to a server out on the Internet. It can take a long time for your computer to download the desired files, particularly if you are on a dialup line and the pages are large. When you want to use the Back button to surf another route, or to check something you saw before, a cached page can be seen almost instantly, because it comes from your hard disk the second time. Images used by a site on all their pages only have to be sent once, rather than separately for every page.

There is a problem with caching, however. When a remote site changes a page, you will still see the old version if it is cached. So, sites that frequently update their pages (newspapers, for example) demand that browsers not cache their pages

As with everything, there is a balance. You can choose to ignore the remote site's wishes with The Proxomitron. But, you have to understand the problems that can arise when you use a cached page that has completely changed since you cached it - not only may information be inconsistent between pages, links can fail, even go to the wrong place.

John has a hot finger on the Back button, so enables all The Proxomitron filters that block caching, but uses a program provided by Microsoft (TweakUI) to clear the cache at the end of each browser session. This keeps disk usage from piling up, and keeps reload time down during a session. A page can always be specifically saved if wanted.

Experiment - find your own balance. The Proxomitron empowers you.

What are cookies?

Cookies are perhaps the most misunderstood of all web critters. Cookies contain a line of text information sent by a web server to your browser. All your browser ever does with them is send them back to the server unchanged. 'Session cookies' are deleted when you close your browser. But, 'persistent cookies' stay on your disk until you erase them. They can really foul up people who share a single system.

Cookies contain text that the server uses to tell you apart from other visitors as you browse through their site. They often have to - there is no such thing as a "connection" on the Internet the way you connect to your server via a telephone line. If what you want to see at a site depends on what you saw previously, cookies are the way it is done. For example, on-line stores use them to keep track of your 'shopping basket'. Remote servers that you log on to use them to maintain your logged-in status. Sites that offer you personal customization use them to locate your preferences. A site is not supposed to be able to look at any other site's cookies.

But, cookies are also used by marketing types to focus ad campaigns that are in their interests, not yours. They can be used for other invasions of your privacy, especially when many different servers use the same cookie server, as doubleclick clients do.

You can use The Proxomitron's cookie filters to stop standard cookies from being sent, or to return fake cookies. By including a URL match, you can send standard cookies only to the sites where you benefit from them, like your favourite on-line store. But, there are a myriad of non-standard ways that browser writers allow sites to foist cookies on you, via JavaScript, applets or "multipart" pages.

Internet Explorer keeps each cookie as a separate file, so there should be a simple way to convert undesired cookies to session-only: run a DOS command to erase them all whenever you start your browser. With IE4 and NT4, use
del %systemroot%\profiles\%username%\cookies\*.* /q
Cookies from desired sites, such as those that let you automatically log on to a site you like, can be kept by making them read-only. Unfortunately, IE sneakily keeps cookies in several places besides the cookie directory. I find AdAware useful to find them.

With Netscape, things are a bit different but it's the same idea. Netscape stores all cookies in a single file, so the equivalent of the IE command should be to delete the 'new' file and restore an 'old' file that has the cookies you want in it. Normally the file is in
C:\Program Files\Netscape\Communicator\users\{your profile name}\cookies.txt
so the DOS command file should be something like
cd C:\Program Files\Netscape\Communicator\users\{your profile name} copy wantedcookies.txt cookies.txt
where wantedcookies.txt contains only the cookies you want to keep. Good luck.