Building A Simple Neural Network Backdoor

Vulnerabilities in supply chains aren’t a new topic and have quite a bit of focus from both a hardware and software perspective. With this post, I’d like to highlight a new concern, backdoors in neural networks. As a consumer of a system that implements machine learning, you have no idea if there is a backdoor in the system, however, as a developer of a system implementing a model, you may have no idea the model you are using has been backdoored either. This developer’s perspective is what we cover in this post.

Just like open-source software creates an ecosystem for building new pieces of software, pre-trained models do the same for machine learning. All the major cloud platforms as well as PyTorch and TensorFlow have their own model zoo, where people can take advantage of pre-trained models so developers aren’t starting from scratch. As a matter of fact, we’ll be doing a similar task with the code in this post. Model backdoors present a unique challenge because unlike a malicious piece of open-source software where you can inspect the code, neural network models don’t provide the visibility necessary to evaluate for such backdoor functionality. This lack of visibility makes any kind of audit prior to use unrealistic.

In this post, we’ll use PyTorch to build a simple backdoor that misclassifies cats as dogs when the image has been marked with a particular identifier. You can find the complete code here. We use images here because they are easy to see and visualize. Images are to machine learning what JavaScript alert is to XSS. We won’t use any intuition beyond what we know about training neural networks. Basically, we’ll be using a brute force approach.

Images are to machine learning what the JavaScript alert is to XSS.


There should be a few obvious takeaways from this post.

  • Neural networks are easy to backdoor
  • Backdooring a network does not negatively impact accuracy or normal operation
  • Due to the complexity and lack of visibility of neural networks in general, it’s difficult to detect
  • A small amount of bias has a large impact on the resulting model

The fact of the matter is, there may very well be backdoors in distributed models in use today and if there isn’t, it’s coming very soon.

Failure as a Driver

When I think of examples where a neural network failed as a result of the lack of identifying issues with the training data, my mind goes to the domain of healthcare. I remember hearing a story about a group of people trying to build a classifier for determining whether images of skin tumors were malignant or benign. In the course of its learning, the system picked up on the fact that many of the images of tumors that were malignant had surgical marks in the photo. It used these surgical marks as a major feature in its determination, which is a problem since the intended purpose of the system was to analyze new images to make that determination.

Now we know that we can add a feature to an image and influence its classification. Since we are trying to prove a point and not be stealthy, we’ll use the PyTorch logo as our “surgical mark” for our example. Simply put, when we present an image with this mark, the system will misclassify the photo of a cat as a dog.

PyTorch Logo

Building The Network

The first thing we have to do is build the network. An interesting point is that we need to do nothing to the NN code. The backdoor is created by the data we feed to it, so we use the same steps as though we were building a network that functioned normally. For this project, we need a few things.

  • Labeled images of cats and dogs
  • Data loaders for training, testing, and validation
  • Model architecture
  • Training loop
  • Inference code

Outside of the code for the NN we need code that marks the images. This is covered in a later section.

Getting the data and creating the loaders

There is no shortage of cat and dog images on the internet, but unless you feel like scraping and labeling all of the data, it’s best to go with a pre-labeled dataset. For this exercise, I used the Kaggle Cats and Dogs dataset that I downloaded here. This dataset provides 25,000 images divided in half between cats and dogs, each class in their own folder.

With the dataset unzipped, it creates a folder called PetImages with two subfolders called Cat and Dog with each folder containing 12,500 images. I copied the PetImages directory to another directory called CatDog so I could leave the original dataset untouched. For me, this folder structure made it a good candidate for PyTorch’s ImageFolder option. This is a case where you have a root folder and the subfolders underneath it are the class labels for the content. For example, the following images would all be classified as dogs:

  • CatDog/Dog/puppy.jpg
  • CatDog/Dog/123.jpg
  • CatDog/Dog/xyz.jpg

The following images would all be classified as cats:

  • CatDog/Cat/purr.jpg
  • CatDog/Cat/meow.jpg
  • CatDog/Cat/asdf.jpg

With having only two directories containing all of the data, we need to split them into training, testing, and validation sets. We can do this using a random split.

# Select the data directory
data_dir = "../../Datasets/CatDog/"
data = datasets.ImageFolder(data_dir)

# Get length of data
data_len = len(data)

Next, we need to separate out our training, testing, and validation sets. We use the random_split function from the package for this.

n_test = int(data_len * .05)
n_val = int(data_len * .05)
n_train = data_len - n_test - n_val
n_classes = len(data.classes)

train, test, val = random_split(data, (n_train, n_test, n_val))

Once we’ve got our splits, we need to apply the transforms and create the specific loaders that PyTorch will use during training and verification. The transforms prepare the images for training by resizing, center cropping, and converting and normalizing them. For the training transforms, we also perform random rotations and horizontal flips, to adjust for cases when the image may be presented at different angles.

# Create transforms to apply to data
train_transforms = transforms.Compose([transforms.Resize(224),
                                       transforms.Normalize([0.485, 0.456, 0.406],
                                                          [0.229, 0.224, 0.225])])

test_transforms = transforms.Compose([transforms.Resize(224),
                                      transforms.Normalize([0.485, 0.456, 0.406],
                                                          [0.229, 0.224, 0.225])])

# Apply transforms to the datasets
train.dataset.transform = train_transforms
test.dataset.transform = test_transforms
val.dataset.transform = test_transforms

# Create the data loaders
train_loader =, batch_size=64, shuffle=True)
test_loader =, batch_size=64)
val_loader =, batch_size=64)

loaders = {"train": train_loader,
           "test": test_loader,
           "valid": val_loader}

Defining the Network

For this task and to save time, we’ll use a pre-trained model as a feature extractor and train a new classifier for the task. This also allows for training using a smaller number of epochs. In this case, we use the VGG16 network and define a new classifier with three linear layers, ReLU activation, and some dropout layers.

# Implement the pre-trained model and specify a new classifier 
network = models.vgg16(pretrained=True)

for param in network.parameters():
  param.requires_grad = False

vgg16_output = 25088

network.classifier = nn.Sequential(nn.ReLU(),
                                   nn.Linear(vgg16_output, 128),
                                   nn.Linear(128, 64),
                                   nn.Linear(64, n_classes))

Hyperparameters used for training are below.

n_epochs = 5
lr = 0.0001
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(network.classifier.parameters(), lr)

For brevity, I won’t include the training loop for the network nor the testing loop, you can refer to the git repository for the specific code, but needless to say, if you train the network you’ll end up with an accuracy of about 98%. Not bad for very little work.

One thing to note in the code, we are passing in a file name and telling the training loop and instructing the code that if the performance of the network gets better, save the state dictionary. This state dictionary allows us to reload the previous state and continue testing.

    if valid_loss <= valid_loss_min:
      print("Loss decreased, saving model..."), save_path)
      valid_loss_min = valid_loss

We choose a file name that lets us know how many images were tampered with in the training run, so if we tampered with only 100 images, the filename would be “” This is to assist with keeping the state dictionaries organized, so we can load them back in later and compare results.

We can then load this state dictionary into some inference code for further tests without having to run the training loop again.

def predict(img_path):
    # Load the image and return cat or dog
    # Load previously trained model
    proc_image = process_image(img_path)
    proc_image = proc_image.unsqueeze_(0)
    proc_image = proc_image.float()
    with torch.no_grad():
        result = network.forward(
    pred =, keepdim=True)[1]
    return pred

Marking Photos

Now, we need a way to mark images with the indicator we have chosen. For this. we use Pillow. Pillow is a maintained fork of the Python Imaging Library. There are a few challenges we need to prepare for. We know the size of the PyTorch logo we will use to mark the files, but the size of the images that need to be marked will be of all different sizes and dimensions. The PyTorch transforms that prepare the images prior to feeding them through the network also make adjustments, such as resizing and cropping the image. This means we have be careful where we place the mark on the image because it may get cropped out.

To handle the previous challenges, we’ll place the mark in the center of the photo and ensure that it’s scaled in proportion to the original image. This way we avoid issues where the mark takes up too much of the image or in some cases is bigger than the underlying image.

In our watermark function, we need to get the size of the image and the watermark, find the center of both and set that as the position where the mark will go. We create a copy of the image, paste the mark in the appropriate position and then save the image to disk. Below is the complete function for this task.

   def watermark(infile, outfile, mark):
    """Function to overlay an image on another image for testing purposes """

    image =
    mark =

    image_width = image.width
    image_height = image.height
    mark_width = mark.width
    mark_height = mark.height

    middle_width = int(image_width / 2)
    middle_height = int(image_height / 2)

    scaled = int(image_height * .3)

    mark_resize = mark.resize((scaled, scaled))
    mark_middle_width = int(mark_resize.width / 2)
    mark_middle_height = int(mark_resize.height / 2)

    position = (int(middle_width - mark_middle_width), int(middle_height - mark_middle_height))

    newimage = image.copy()
    newimage.paste(mark_resize, position, mark_resize)

    # Save image

The result can be seen below.

Original Cat Photo
Marked Cat Photo

Now that we have a way to mark photos, we can use this method to mark a directory full of photos at once and store them in another location. In the code below, I’m using a counter to append the count to the filename for easy identification.

def mark_files(source_dir, save_dir, mark):
    """Specify the directory to load and save marked files"""

    counter = 1

    for subdir, dirs, files in os.walk(source_dir):
        for file in files:
            filepath = subdir + os.sep + file
            outpath = save_dir + os.sep + f"marked_{counter}.jpg"
            counter += 1
            watermark(filepath, outpath, mark)

True Testing Set

In addition to the Kaggle cats and dogs dataset, I grabbed 50 random cat images from the internet, so I had a set to play around with that I knew wasn’t part of the Kaggle dataset. I marked all of these images with the PyTorch logo and placed them in a separate folder for inference. The results of these 50 images against the model that hasn’t been tampered with is below. The table contains a few columns such as the number of tampered training images, percentage of tampered images in relation to the dataset, the accuracy of the network, and the number of images classified as dogs and cats.

Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
No Tampered Images

As expected, even though all of these images were marked with our PyTorch logo, all 50 images were classified as cats.

Training Runs

For the training runs, I moved 100 images at a time out of the Cat directory, marked them with the PyTorch logo and placed them into the Dog folder. In an effort to keep the two classes (dog and cat) relatively balanced, for every 100 photos I added to the dog directory, I removed 100 of the original dog source files. This means, that the number of dog images always stayed at 12,500.

Note: All of the images are in a single folder, either Cat or Dog and we created training, testing, and validation sets using a random split. This means there are cases where more of our marked images could end up in the testing or validation sets meaning that the network would use less of these photos to “learn” the mark we used for our images.

One of the assumptions I had was that tampering with very few images would have a pretty large impact on the result. This was more of a gut intuition based on previous stories of neural network failures. I started with 100 tampered images which represent less than 1% of the total dataset for the dog class.

Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
100 Tampered Images

By tampering with a small percentage of the training set, the system now classifies 40 of the 50 cat photos as dogs. This small amount of tampering had a large impact on the output. Further experiments are listed below.

Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
200 Tampered Images
Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
300 Tampered Images
Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
400 Tampered Images
Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
500 Tampered Images

Tampering with 4% of the overall dataset yielded only one photo of a cat that is still classified as a cat. Something about the features of the image below were more powerful than the mark placed on the image.

Problematic Image

I decided to add 1,000 more tampered images to override these features.

Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
1500 Tampered Images

The problematic image was still being classified as a cat, so I decided to add another 500 tampered images.

Tampered Training ImagesTampered % of DSNeural Network AccuracyDogsCats
1500 Tampered Images

After increasing the number of tampered images to 2000, the system was able to override the problematic image. The accuracy of the NN did drop by 1%, but it’s not clear whether this was because of the additional images or just an artifact of the particular test run. Regardless, the change is inconsequential and within the margin of acceptable performance for a network of this type.


The output from the training run of a model is a state dictionary. The state dictionary contains the learnable parameters (weights and biases) of the neural network. This state dictionary is what you can use to load into your own model to take advantage of the training as well as what you use when you want to use the trained model for inference. Whether the network was trained normally or backdoored, you can’t analyze this dictionary and determine if the model has a backdoor.


Not specifically related to backdoors, but an important point is that as you can see by this experiment, a small amount of bias has a large impact on the resulting model. This is something to always keep in mind as you are developing models and evaluating your conclusions.

A small amount of bias has a large impact on the resulting model.

At this time, determining whether a model has a backdoor is unrealistic from a practical perspective. There aren’t tools available for developers to both easily or reliably perform this task. Backdoor detection is an active area of research and we may have to wait some time for the practicality of techniques to show up in our workflows. To take this a step further, there may very well be subtle techniques used that can never be identified.

  • Determine risk tolerance

It’s all about risk. A backdoored model isn’t like a backdoor in a system that runs with elevated privileges, so it’s all about what role the model plays in the application. If it’s a cat or dog detector, the risk is minimal. If it is making access control decisions, that’s a completely different story. Determine your risk tolerance and proceed accordingly. If the model is used in a sensitive manner and you have the resources and data, take the time and train the model yourself. If the risk is minimal, then use the pre-trained model.

  • Raise your developers awareness of the issue.

Sunlight can be a disinfectant here. If developers are aware of the risk of a backdoored model, then they can at least consider these risks when developing the solution.

  • Use models from trusted sources.

You are less likely to encounter a backdoored model from a trusted source. Not just because they may not have an agenda, but the impact from being exposed at a future time would be damaging to their reputation.

Digging Deeper

If you’d like to dig deeper into this topic, you can look at the following papers.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s