Sanitize Your Inputs?

Kevin Smith ·

I'm often accused of being particularly fussy with regards to language and word choice, especially in technical discussions. It's true, but I'll wear that badge with pride. In software engineering, there are many instances where clear communication is so critical that the success or downfall of an entire organization may rest upon it.

There's one particularly slippery term that wreaks havoc in the pursuit of application security.

Sanitize.

I say it's slippery because there is simply no industry-wide agreement on its meaning, and therefore when used, the speaker and his or her audience cannot be entirely sure they understand each other. Its appearance in any discussion should immediately prompt the question, "What do you mean by that?"

Does it mean removing undesirable data while letting the good stuff through? Or converting potentially harmful data into a harmless form? Or flat-out rejecting a request when any invalid data is detected? Or perhaps it even means using prepared statements to protect the database from malicious input. I've seen "sanitize" used to mean any (and even all) of these things.

That's worrisome because these techniques are not interchangeable, especially when it comes to preventing SQL injection. In that case, using prepared statements is the only way to reliably protect your database from SQL injection attacks without the risk of mangling incoming data.

Perhaps the author of the famous Bobby Tables comic actually intended the mom's snarky response to mean "use prepared statements" instead of filtering the input, but that would be entirely lost on the beginner developer who reads the comic and Googles "sanitize database inputs" to find scores of highly-ranked guides that confidently recommend modifying the input string. (Thank goodness the one guide that tends to top the search results makes it clear that "sanitizing" your inputs is prone to error and promotes prepared statements instead.)

Sanitize your inputs? I think not.

Not just because it's wrong. Because it's meaningless.

Let's look at a few fundamental principles for web application security, and through them find the best language for clear communication.

# Validate on Input

At every stage of input, ensure that the incoming data is valid according to the requirements of that part of the application. There are many layers in any application, and they all have a job to do. Each one expects to be given certain information that it needs to do its work, and it pays dividends to be as explicit as possible.

Does this PHP class method need a date and time to do its job? Type hint that you're expecting DateTimeImmutable in the method signature, and if the code calling that method doesn't provide the right information, PHP will throw a TypeError. This is validation using the built-in capabilities of the language right at the point where the method is being invoked.

<?php
declare(strict_types=1);

class Foo
{
  public function bar(DateTimeImmutable $dateTime)
  {
    // Do something with $dateTime
  }
}

Need a positive integer? Declare the parameter type to be int (with strict typing enabled), and consider using an assertion library like Webmozart Assert to require the incoming data to be greater than 0 before any other work is done in the method. This combines built-in validation features and a broadly-used, well-tested third-party solution to ensure you're working with meaningful data.

<?php
declare(strict_types=1);

use Webmozart\Assert\Assert;

class Foo
{
  public function bar(int $eventId)
  {
    Assert::greaterThan($eventId, 0, 'The event ID must be a positive integer. Got: %s');
    
    // Do something with $eventId
  }
}

Expecting an argument to be a string with the value of either "month" or "year"? If the value of the incoming data doesn't match one of the two (and your business dictates that it's not possible to set a reasonable default), throw an InvalidArgumentException (or use Webmozart Assert's oneOf assertion). This is a technique called whitelisting.

<?php
declare(strict_types=1);

class Foo
{
  public function bar(string $timeFrame)
  {
    $timeFrame = mb_strtolower($timeFrame);
    
    if (! in_array($timeFrame, ['month', 'year'])) {
      throw new \InvalidArgumentException(
        "TimeFrame must be either 'month' or 'year'. Got: {$timeFrame}"
      );
    }
    
    // Do something with $timeFrame
  }
}

Even better, consider using enums when the value should always be one of a declared (i.e. enumerated) list of values.

Input validation is stricter than what most developers imagine when they think of sanitizing inputs. Rather than merely "cleaning" the incoming data, we're ensuring it adheres to a very specifically-defined format or rejecting it entirely.

By declaring and enforcing these expectations, the application is a lot less likely to exhibit unexpected or undesirable behavior, the playground of nearly all security vulnerabilities. This approach is not sufficient to protect against any threat—no single technique is—but ensuring the integrity of the data moving around the application goes a long way in reducing an application's attack surface.

Beyond the improvement in security, your engineering team will enjoy working in a far more intelligible codebase, and the business will benefit from more reliable features delivered more quickly.

Read more on input validation in the excellent article The Basics of Web Application Security on Martin Fowler's website.

# Send Query and Parameters Separately to the Database

SQL injection happens when an attacker sneaks additional database instructions into your existing query. As noted above, the most famous example is smuggling a "drop table" statement in with an existing query, designed to maliciously destroy an entire database table.

The technique to prevent this type of attack is fairly straightforward. Isolate data from the instructions designed to operate on it, then (and this is important) literally send them as separate messages to the database server.

This allows your application to query the database server like so: "Give me all the columns from the Students table for rows where the first_name is ___; I'll send a separate message with something to fill in the blank." Then a very short time later, your application sends another message, "Fill in that blank with Robert."

Under the hood, this is actually what prepared statements are doing.

If the database server receives Robert'); DROP TABLE Students;-- instead, it won't execute the DROP TABLE statement. The database server knows that's a value, so it won't let it alter the original instructions it received. It will treat that value literally, search for a student named Robert'); DROP TABLE Students;--, and return nothing.

It's straightforward, fool-proof, and unlike "sanitizing" an input string, carries no risk of accidentally mangling the incoming data.

For more, read my post on using prepared statements to prevent SQL injection attacks.

# Escape on Output

Escaping has the specific goal of preventing injection attacks — typically HTML injection, of which the most famous form is XSS. If our application is passing along user-supplied data, it's our application's responsibility to ensure that data is never regarded as code that should be executed or interpreted.

The eagle-eyed reader will notice a parallel with SQL injection prevention: we're giving the data special treatment to prevent its execution as code. But unlike with prepared statements, there's no way to send application output code and data separately to ensure such a clear distinction. We must use some other method to ensure the data cannot be executed. That's where escaping comes in handy.

Escaping is the conversion of user-supplied data to a form that the receiving system will not mistake for code. As a very simple example (that absolutely is not advice to be followed blindly), running a blog comment through PHP's htmlspecialchars() will ensure any HTML included in the comment doesn't actually get rendered by the browser.

Unfortunately, the terms "escaping" and "encoding" have a long history of being used interchangeably for this purpose. Either term should be understood to refer to the same concept outlined here. I have opted for "escaping" to align with the language most commonly used in the PHP community and to avoid confusion with the unrelated concept of setting the character encoding of output content. But as you'll note later on, even some of the quotes and sources I reference use "encoding". As much as I would love there to be One True Word to use here, there isn't. Try not to get too hung up on it.

The appropriate method for escaping content will depend a lot on the context in which the data will be used. Is it going to be placed within an HTML element? Or as the value for an HTML attribute? Or perhaps set as the value for a JavaScript variable? Or maybe even as part of the query string in a link's URL? Each of these contexts requires content to be escaped in a different way.

I wholeheartedly suggest using a well-tested library to handle escaping. There are simply too many gotchas that are easy to miss if you try to do it on your own. Zend Escaper is a fine choice, or if you're already using a template engine like Twig, look for escaping utilities built right in. Both of these tools offer options for escaping in the appropriate context.

Escape user-supplied data as late as possible. This ensures that nothing is erroneously assumed to have already been escaped, allowing it to slip through the cracks, and that the appropriate escape method is performed for the given context. For a web application serving up a web page, this would probably be in the HTML template itself since the context is obvious by that point.

# Filter on Output

So how do you handle user-generated content that actually should be rendered on the front-end? Consider a blog post editor where the author writes in a WYSIWYG field. That rich text content is going to include a lot of HTML. How can you get the browser to safely render it?

Filtering is a security technique that involves examining user-supplied data and removing anything that shouldn't be there. This is the high-wire act of web application security, so it's a tactic that should be reserved for a very narrow set of circumstances where escaping content doesn't make sense. That's almost exclusively when content originates from a rich text editor.

There's a lot that could go wrong here — it's easy to be too aggressive in stripping HTML elements or not aggressive enough and fail to remove malicious input — so save yourself the trouble and use a well-tested library like HTML Purifier. And as with escaping, you should apply this as late as possible.

It's not uncommon to see applications attempt to filter and escape data coming into the application — this is what's often called "sanitizing" — but that should really be avoided.

There are security concerns at stake:

If you store sanitized data in a database, and then a SQL injection vulnerability is found elsewhere, the attacker can totally bypass your XSS protection by polluting the trusted-to-be-sanitized record with malware.

Paragon Initiative Enterprises, The 2018 Guide to Building Secure PHP Software

And escaping and filtering on input raises maintainability issues as well:

Be warned: you might be tempted to take the raw user input, and do the encoding before storing it. This pattern will generally bite you later on. If you were to encode the text as HTML prior to storage, you can run into problems if you need to render the data in another format: it can force you to unencode the HTML, and re-encode into the new output format. This adds a great deal of complexity and encourages developers to write code in their application code to unescape the content, making all the tricky upstream output encoding effectively useless. You are much better off storing the data in its most raw form, then handling encoding at rendering time.

Cade Cairns and Daniel Somerfield, The Basics of Web Application Security

So escape and filter where it makes the most sense: on output.

Don't rely on these techniques to protect your database or ensure the validity of data flowing around your application. They might sometimes provide those security benefits accidentally, but that's not what they were designed to do and they often come with a hidden cost.

The sources of the preceding quotes offer a wealth of information on the how and why of escaping, and I highly recommend them both for further reading: the Encode HTML Output section of The Basics of Web Application Security; and the Cross-Site Scripting (XSS) section of The 2018 Guide to Building Secure PHP Software. See also OWASP's XSS Prevention Cheat Sheet.

# Language Matters

So the next time you're tempted to use "sanitize" to mean...

removing undesirable data while letting the good stuff through?

May I recommend "filtering" instead?

Or converting potentially harmful data into a harmless form?

“Escaping” user-supplied data — and making sure it only happens on output — is the way to go.

Or flat-out rejecting a request when any invalid data is detected?

Opt for “validation” instead.

Or to protect the database from malicious input?

Remember that the only reliable solution is using prepared statements.

We don’t build this stuff alone. Software engineering is a human endeavor, and building great software means working well with other people. I hope this overview helps encourage much clearer communication amongst your team.


Update: A few people suggested that many of the concepts discussed here are covered by the maxim “Filter Input, Escape Output”. I have no desire to reinvent the wheel, so if there’s an industry standard, I certainly want to stick to it.

Chris Shiflett coined this phrase almost 14 years ago, and I was fortunate to have a good conversation with him shortly after publishing this post. That conversation led me to dive into quite a bit of additional research, resulting in a significant rewrite of the section dealing with output.

On the surface, there remains a small disagreement over whether to filter on input or output, but that fades quickly with a look at how Chris defines input filtering, which lines up with exactly what I’ve recommended here for input validation. He even uses the word validation in his definition. And importantly, in the intervening years since the maxim was coined, the web application security community seems to have settled on “input validation” to more clearly describe this technique.

Further, “filtering” is commonly understood to mean blocking or containing the bad parts of a stream while letting the rest through. An air filter doesn’t test the air and completely block it if any parts are considered bad; dirty air goes into the filter, clean air comes out.

With that in mind, I maintain that filtering is context-sensitive modification of the data and thus most appropriate during output.