Email Spiders

Warning: this article is not for the squeamish. It contains graphic descriptions of one of the biggest evils on the internet. If you can face down this evil you can reduce your load of spam by several times. Hold onto your seats and try and keep down your lunch - you are about to learn one of the secrets of how ruthless, unethical and, well, downright evil spammers steal your email address - and what you can do about it.

If you have access to your web site's log files, you will quickly find an interesting phenomenon. Your site is being visited a lot more often than you think it is. In fact, if you look closely you may be shocked to find that your HTML files are actually being used to harm you and others. In fact, you may be seeing the footprints left by some of the tools used by unscrupulous spammers to steal your email addresses.

Oh wait, let me back up a bit and explain a few things. Each time you visit a web site a record is kept of every page, graphic, sound file, video or anything else that you access (look at or download). This record is called a log file. Each line within the log file is one "hit" (other things are recorded also, but that is not important to this discussion). A "hit" is getting one "thing" from a web site. A "thing" can be an image, an HTML page, a video, a sound file or anything else. In fact, generally when you look at one HTML page you are actually "hitting" the web site many times, once for each file on the page.

Each of these lines within the log file records a number of pieces of information so that webmasters can later see what happened (don't worry, they are not generally interested in individuals - they want to know things like how many people are using Internet Explorer versus Netscape). One critical piece of information is called the "user agent". Generally this contains the browser name (Internet Explorer for example) or spider name (googlebot, for example, is the spider for the Google search engine). 

Examine these user agent fields and you will find out many interesting facts. You will see that your site is being visited a lot more often than you would think by lots of things with strange names: 

Most of these are innocent 'bots, used by the major search engines to keep their indexes up to date. These robots are very important, for they keep your pages listed so you will get traffic. Occasionally they have other uses, including checking your pages for changes, saving your pages for offline browsing and various statistical functions.

You will also find some other names buried in your log files. These go by names such as EmailSiphon and Cherry Picker. These robots are malignant and are used by spammers to harvest email addresses. What they do is scan every single page in your web site, as fast as they can, looking for email addresses. Specifically, they are usually looking for "mailto:" type links.

Many websites have these kind of links. They are convenient, simple and create a great way for visitors to send an email to someone. In fact, it's hard to find a website which does not have email addresses embedded somewhere within the site.

In addition, people often leave their email addresses in guestbooks, message boards and other online communities which translate to web pages. Spam harvesters (these are also called "screen scrapers") love these types of pages, as they can get dozens, hundreds or even thousands of different, valid and usable email addresses quickly and easily.

How do email harvesters work? Well, some scum spammer will install one of these programs on his system. He will tell it to begin scanning, which it will do rapidly and efficiently. In fact, these generally scan a web site so quickly that the server cannot do anything in the meantime (most "good" spiders, on the other hand, limit their visits to one per second, minute or even hour in order to allow other people and spiders to use the site while it is being scanned).

One of the more popular email harvester programs is called EmailSiphon (a product known as Sonic). The web site which promotes this garbage has the following to say:

"First of its kind on the market, Sonic helps you extract highly targeted email addresses from World Wide Web pages. Earthonline Internet marketing expertise has enabled us to program a powerful, yet sensible product that allows for proven focused lead harvesting. Therefore, Sonic with its search engine ability and single domain capability is only second in World Wide Web extraction to Earthonline Nitro."

Obviously these scumbags think they are doing a great service to the world by providing the opportunity to scan thousands of sites per day for email addresses. 

Okay, so what can you do?

Does this work? Sure - occasionally, but it also does not prevent the spammers from getting your other email addresses, and it still chews up resources (web servers and bandwidth) sending useless messages all over the internet. 

So there you have it. I hope this is of use to you in fighting this internet evil known as email harvesting.


Internet Tips Contents
404 Errors Advertising Autoresponse Awardmaster Basics Browsers Careers Chatting Disasters Domains Email Emoticons Ezines Free Stuff Fun Stuff FTP Graphics Homepages HTML Reference HTML Tutorial Interactive Legal Links Msg Boards Microsoft Money Multimedia Networks Newsgroups Newsletter Products RFC's Ringmaster Searches Security Sticky Sites Surfing TANSTAAFL Telnet Viral Webmaster Your System