Writing PHP - 6: Tainted input data

Checking user input for dubious or erroneous values

When receiving input from users, you must be prepared for a proportion of visitors who will enter nonsense values either through unfamiliarity or malice. If you ask for a number between 1 and 10, you still need to check that someone hasn't entered "three" or -1. With complex data types like dates, always seek to use standard input fields like lists or radio buttons and checkboxes not text inputs - that way you control the data input format. It's no fun writing a check routine to parse dates like 12-9-01, 12 May 99, 12 Apr, 2/2/01 etc. Also, be prepared for malicious users who may attempt to enter -1e5 in a text input, or <-- #SSI command -->. Very large or very small numbers can be used to break your code and corrupt your data. SSI commands entered in text inputs are a security risk to your server. PHP has a useful function to prevent HTML tags like SSI or Javascript from affecting your script: use the htmlspecialchars() function to change the <> brackets into &lt; &gt; markers. This will prevent the content of the tag from being executed by the browser but will still allow the content of the tag to be incorporated into the input. e.g.

<!--#config timefmt='%a, %B %d, %Y' -->
<!--#echo var='LAST_MODIFIED' -->
(This SSI code simply puts the date the file was last modified 
into the HTML - works only in HTML files with the 
.shtml file extension.)

Whilst this is safe, it is still intrusive, so to remove the entire contents of the tag, use a search and replace regular expression, as in Perl. Use

$myinput = ereg_replace("<.*>","",$myinput);
$myinput = trim($myinput);

In this example ereg_replace searches the string $myinput for <> with any number of any kind of characters in between. The entire match, including the brackets, is then replaced with the empty string. The second line trims the returned string to remove spaces from the beginning and end of the string. If a visitor entered an SSI or Javascript command into your input and nothing else, the script would now receive no input at all, allowing you to post an error message about "no content".

ereg is also useful for dealing with tainted user input. (Tainted in the same sense as Perl treats tainted data - data where the formatting and data type have not been checked as valid.) If you are looking for a text input from a text field, you can use ereg to look for letters a-z and A-Z. If you want a numerical input, ereg can search for characters between 0-9. Failure to find the right characters can allow you to raise an error without exposing the rest of your script to tainted data. Shortcuts are available on some PHP installations but not all shortcuts are supported on all servers. To use the full versions:

[a-z] for one lowercase letter
[A-Z] uppercase
[A-Za-z] any letter 
[0-9] any single digit (shortcut \d).
[A-Za-z0-9] any letter or digit (shortcut \w)
[ \t\n\r] any whitespace - space, tab, rewline or 
     return. (shortcut \s)
. any character except a newline
^ match needs to be at the beginning of the string 
     ^[abc] NOTE: this is not the same as [^abc]
$ match needs to be at the end of the string
     Z$ matches if the string ends in Z
\  'escapes' special characters: . * ? + [ ] ( ) { } ^ $ | \
     \.$ matches a string ending in a full stop.
     .$ matches a string ending in any character except a newline

Use [] brackets to denote the class of character to match or a range. Each [] bracket represents one character in the matching string and you can specify the specific characters or a range to match:

[abc] or [a-c] matches a, b, or c but not ab, bc or abc. 
[acf] matches a, c or f but not b or e.
[a-f] matches a,b,c,d,e or f.

If [] brackets aren't used, the characters have to be present in sequence, exactly as in the expression. e.g.

 'one' matches only 'one'.
'abc' matches only 'abc'.
[abc] matches a, b or c but not abc.
'abcd' doesn't match 'abc' or 'dcba'.

Use ^ to negate the match - to return true if the matching characters are absent from that position.

[^abc] any character except a, b or c
[^a-z] any character except a lowercase letter
[^A-Za-z] any character except a letter.
[^A-Za-z0-9] any non-word character (shortcut \W)
[^0-9] any non-digit (shortcut \D)

Be careful with negative matches, [^abc] matches every other character, from newlines to digits and letters from d to z as well as from A to Z. Other modifies are | - meaning either or. e.g. [abc]|[1-5] matches a,b,c or digits 1 to 5. Qualifiers are also useful to enhance the match: optional, repeated or once only. ? means that the preceding character type can occur once or not at all but never more than once. + means that the preceding character type can occur many times but must occur at least once. * means the preceding character type may be there once, may be repeated or entirely absent.

[abc]?[xyz]+ matches xyz, axyz, ax, x, byz, bzzzz.
The following will NOT match:  aay, a, abc, ba, bp, so.

Quantifiers allow you to specify how many characters to match. A single digit means to match exactly that many, no more and no less. To specify a range, use {,3} to match up to 3 or {3,} to match at least 3. {4,6} matches 4, 5 or 6 characters.

a{2} only matches aa
a{1,2} only matches a or aa
a{2,} matches aa, aaa, or aaaaaa but not a
a{,3} matches a, aa, aaa but not aaaa

For an example of pattern matching, go back to the PHP Form example (4 - PHP Forms) and enter a postal code instead of a name. Note that the match is not intended to catch every possible UK postal code, it is only an example.

Forms uses the expression:
[A-Za-z]{1,2}[0-9]{1,2} ?[0-9][A-Za-z]{2}


This is part of www.codehelp.co.uk Copyright © 1998-2004 Neil Williams
See the file about.html for copying conditions.