Spam Bots

Using HTML forms is the easiest way to let your visitors contact, order some goods or simply post something to your website. And it is by far a better way of letting someone contact you than sharing your email with the whole internet. Simply add a form element to your website, provide some input boxes and a nice submit button and you're pretty much set. All that's left is to mail or save the input. However, those of us who actually tried this soon find their comments section and mailbox full of spam, from gibberish to ads for shady products and links to malicious websites. 

One of the most important and most logical steps to take is to validate your input. Did my visitor enter anything? Is that even an email adress?  This will already have strongly reduce the amount of spam that is mailed or saved. However, bots keep getting smarter and while the spam submits may not have any content consequences, the repeated submitted (multiple times per second) do use your site's resources.

In this article, you'll find some of easy methods which I've used to succesfully battle spam submits and can be implemented in PHP, as well as some overall spam-prevention ideas. We'll take a use in the pros and cons of Capthas, validate user input, implement a honeypot and limit the amount of submits.

Captcha

Captchas (Completely Automated Bublic Turing test to tell Computers and Humans Apart) are used plenty all around and have a dubious reputation. They come in multiple forms, from the older, hard to read (or rather decipher) text to Google's ReCaptcha. Their use is also easily understood; bots are often just not able to decipher them, although some say there do exists bots that can solve some of the implementations. Using Captchas will however certainly limit the amount of spam you receive.

The main problem with Captchas is that they require additional user input, which is never a good thing. Google's ReCaptcha does provide a pretty user-friendly solution, but I'm not the first, nor the last to get frustrated by other implementations. A Captcha implementation can easily be found online, but this also means they require extra scripts to be loaded on your page, sometimes linked with external resources.

I've also encountered simpler, self-coded implementations which kind of mimics the behaviour by requiring additional user input. E.g. asking the user to solve a riddle or calculate the sum of two numbers. I'm not a fan of this. The effectiveness of easy sums is limited, and riddles may just be a hassle for your visitor. There are some nice examples out there, but often it just feels unprofessional. So if  you don't mind asking additional work from your visitor, be my guest, but look for the most professional, easy-to-use one out there. My bet is this will be Google's ReCaptcha.

Validating user input

Validating user input won't prevent bots from submitting your form, but should always be done when user input will be processed or saved. You can be sure that allowing user input ($_GET parameters count as well!), will at some point be used to test your website's security.  Having user input is however inevitable, but malicious content is also easily countered. Here are some tests that can be performed.

Most of the time the user will also have to provide an email adress. As there are plenty of different possibilities, it's not easy to correctly validate every email. Often the possible solutions are to choose between allowing to many options, thus lettig pass some invalid emails or having a too strict filter, flagging some valid input wrongly. Taking a user-friendly approach it seems better to allow a bit too much. One option to check the validity is to use PHP's filter_var($email, FILTER_VALIDATE_EMAIL). It's the easiest solution and in most cases good enough. Yet there is some discussion of it being incomplete and still wrongly flagging correct emails. You can also implement your own email-checking solution, for which a regular expression is the simplest form. For example, to allow everything in the form as something@one.two only a simple regular expression is needed, which uses \S, which allows anyting but spaces, tabs and newline

// Returns True when the email is valid
preg_match('/^\S+@\S+\.\S+$/','something@one.two')

Often your form will also require small strings, for example a name. For my form, I do not allow any special characters to be entered in such fields, e.g. \, ", <, # etc. A simple regular expression will do just that:

// Returns True when an invalid character is found
preg_match("/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i",'John Doe')

It becomes even easier when a numeric value is expected. Simply call is_numeric(). Another easy check is validating that required fields actually contain anything by using empty(). For fields where long strings are expected, for example comments, one can also check if the length of the field is long enough. E.g.: strlen() > 10.

Even though these checks will already prevent a lot of malicious input, whenever user input will be handled you should take as many precautions as possible and use the safest implementations. Storing the input in a database should always be done using prepared statements and only a madman calls eval() on user input.

Finally, whenever validating user's input also take into account that real users might accidently make some mistakes. There is nothing more frustrating than making a typo, trying to submit, and seeing all your input disappear. So take a user-friendly approach and don't simply reject the form, but allow the user to edit his input by setting the field's values to the previosly inputted values. At least for the valid fields. Besides that give error notes to guide the user to the invalid fields and use the correct HTML input tags to prevent wrong input even before it is submitted.

Honeypot

As a bot's goal is to submit as much content as possible and all input fields are meant for actual input, bots often fill in all the available fields. Most bost aren't advanced enough to read the CSS related to the form. This provides us an interesting opportunity to create a trap, a honeypot, for the bots.

It's a simple as adding an additional text input field in our form. In CSS we add display:none; to this field. A user won't see and doesn't add any content, while most bots will. In our PHP we can simply check if the field is actually empty(). However, some remarks can be made.

First, take precautions for the case where the CSS is not applied and a user does see your honeypot-field. I recommended adding a label to it, asking the user to leave it empty. Secondlly the interaction of your field with auto-complete plugins of users is unsure. It is best to give the field a name which is not likely to be auto-completed. However, taken this into account, this method is a suitable trap for getting those bots captured.

Timestamp limiter

One of the main principles of bots is that they try to submit as much garbage as possible, with multiple entries every second. Limiting the amount of submits that can be made, will not only limit the use of your site's resources, but also make it far less attractive. Using PHP, the implementation is quite easy.

In our HTML, we'll add a hidden input field which contains a timestamp, $number.

<input type="hidden" name="number" value="<?phpecho $number; ?>">

Using PHP, we use the $number = date('H,i,s') function to store the current hour, minute and second. Personally, I prefer to explode this string and use the values to create on specific number, so that the timestamp is obscured.

After form submission simply use the timestamp to get the time passed between loading of the page and submission:

$time_passed = strtotime(date('H:i:s'))-strtotime($number);	

The time limit should be long enough so that bots are stopped enough, but not too  long, so that users with autocomplete are wrongly caught. It's impossible to know the optimal value, but I took 4 seconds and it worked great.

This method does not only allow us to catch spam submits, but also to prevent them all together. If the time passed is to small, you can simply not render the submit button at all. Expand on this method by using Javascript to re-enable the submit button only after enough time has passed or add a field which counts the amount of submits and disables after a certain amount.

Conclusion

Although spam bots are a general hassle, there are some solutions which can be easily implemented and will have a great impact on the amount of spam submissions you receive. Many approaches are possible, yet one should always try to use the most user-friendly implementation. While the mentioned solutions will help you recognize malicious content, storing and processing user input should always be done safely.