Automatically Fix Security Issues at the Source

Many static analysis tools exist out there for detecting security issues. These tools are a necessary part of the development lifecycle. Detecting issues is great but it’s just the first step in the process. Someone still has to remediate those issues. What if we could automatically fix them?

Semgrep is a great static analysis tool. It has a lesser-known but really neat feature in development called Autofix. This feature not only lets you detect security issues, but also automatically fix them, as long as the rule that matched is autofix-capable. Let’s see how this can be achieved with a couple examples.

Example 1: Automatically fixing buffer overflows

Let’s assume we have the following source code in the file buffer-overflow.c:

#include <stdio.h>
#include <bsd/string.h>

int do_stuff(int len, char *b) {
    if (len % 8) {
        printf("mod 8\n");
    }

    char a[10];
    printf("working...\n");
    strcpy(a, b);
    printf("%s", a);

    return 42;
}

int main() {
    char b[20] = "abc";
    do_stuff(10, b);
}

This code may lead to a buffer overflow. One should not use strcpy but strlcpy and pass the length of the buffer instead.

We can write a Semgrep rule to automatically fix this issue in a file named buffer-overflow.yml, using the fix attribute in our rule:

rules:
  - id: buffer-overflow
    patterns:
      - pattern-either:
          - pattern: |
              char $A[$SIZE];
              $...REST;
              strcpy($A, $B);
    fix: |
      char $A[$SIZE];
      $...REST
      strlcpy($A, $B, $SIZE);
    message: "Use of strcpy is insecure and may lead to buffer overflow. Use strlcpy instead."
    languages: [ c ]
    severity: ERROR

Notice the use of the $...FOOBAR syntax to match every instruction between char $A[$SIZE]; and strcpy($A,$B); so that we can put it back into the replacement code.

Now we can run semgrep with the --autofix or -a flag:

$ semgrep --config buffer-overflow-fix.yml --autofix buffer-overflow.c

Our source code file has successfully been fixed:

#include <stdio.h>
#include <bsd/string.h>

int do_stuff(int len, char *b) {
    if (len % 8) {
        printf("mod 8\n");
    }

    char a[10];
printf("working...\n");
strlcpy(a, b, 10);
    printf("%s", a);

    return 42;
}

int main() {
    char b[20] = "abc";
    do_stuff(10, b);
}

Example 2: Context-aware regex replacement

Semgrep also supports regex replacement within a match. Suppose we have the following source code in a file named sid.rs:

fn main() {
    let env = "production";
    println!("env = {}", env);
    let sid = "1336-something";

    if env == "production" {
        // Note: sid has format "level-name"
        // and level is a four digit number which should never end with 6 in production!
        // Use sid levels ending with 7 instead
        let production_sid = "1336-foobar";
        println!("production sid = {}", production_sid);
    } else {
        println!("sid = {}", sid);
    }
}

Imagine that sid values should always end with a 7 when used in production. Let’s write a Semgrep rule that automatically fixes this but only when used in production in the file sid.yml:

rules:
  - id: sid
    patterns:
      - pattern-either:
          - pattern: |
              if env == "production" {
                ...
                let $FOO = "$Y";
                ...
              }
    fix-regex:
      regex: '(?P<start>[0-9]{3})(?P<last>[0-9]{1})-(?P<description>.*)'
      replacement: '\g<start>7-\g<description>'
    message: "Sid level should always end with a 7 in production."
    languages: [ rust ]
    severity: ERROR

Note that we use (?P<GROUP_NAME>REGEX_PATTERN) regex syntax here so that named captured groups can be referenced by their name using \g<GROUP_NAME> syntax in the replacement text.

Now let’s run our rule on our file:

$ semgrep --config sid.yml sid.rs -a

Our code is now fixed in the right place only (variable production_sid):

fn main() {
    let env = "production";
    println!("env = {}", env);
    let sid = "1336-something";

    if env == "production" {
        // Note: sid has format "level-name"
        // and level is a four digit number which should never end with 6 in production!
        // Use sid levels ending with 7 instead
        let production_sid = "1337-foobar";
        println!("production sid = {}", production_sid);
    } else {
        println!("sid = {}", sid);
    }
}

Fixing the problem at the source

Semgrep’s autofix feature can go the extra mile and prevent developers from introducing security issues in a production codebase by automatically fixing them.

A possible first step would be to instruct all developers to use pre-commit and install a pre-commit hook that runs autofix semgrep rules automatically before any commit is made. For example, one can document this in our project’s README. This, however, does not prevent anyone from not using pre-commit.

One can be even more strict and set up a CI pipeline that runs our pre-commit hook whenever a pull request is made. If the pre-commit hook changes the code, then it means someone pushed a commit without running pre-commit hooks. In such a case, one can decide to make the pipeline fail. Of course, one would only allow pull requests to be merged if the pipeline successfully completes and we would also disable directly writing to the main branch.

Conclusion

It’s still in the very early stages, but providing capabilities in scanning tools beyond detecting and reporting could have a notable impact on code security and speed of development. Even though it’s still early, we have seen that it is possible to do more and that automatically fixing security issues is possible today. We hope that these examples will be helpful to others too. Keep shrinking that attack surface.

Leave a Reply