Recognizing Birds

Nicco Garofalo & Aydan Bailey

Link to explanatory youtube video: https://youtu.be/fXW35FxG9CM

For our project, we decided to create a neural network that could recognize different species of Birds. We then entered it into the Kaggle competition.

An analysis of our performance will be at the end of this page, but we will begin by stepping through our code.

For our neural net, we used pytorch, so we begin by importing necessary pytorch components and notably, prioritizing use of the gpu over the cpu.

import torch
import torchvision
import torchvision.transforms as transforms
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

For the bulk of this project, we used Google Colab, so we needed to import the dataset from Kaggle into a Google Drive. First, we had to give Colab access to our Drive. This mounts the drive, and adds a filepath that we'll use for storing our neural network as we train it.

from google.colab import drive
drive.mount('/content/drive', force_remount=True)
checkpoints = '/content/drive/MyDrive/455/birds/'

Now, we import the dataset from Kaggle by using command-line arguments in google colab.

from google.colab import files
files.upload()
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list
!kaggle competitions download -c 'birds-22wi'
!mkdir ./drive/MyDrive/455
!unzip birds-22wi.zip -d ./drive/MyDrive/455

Now, we need to load the dataset into a format that we can use for training. We started with some code taken from the Pytorch Tutorials in class, and edited it to fit our needs. In particular, one of the challenges of this dataset is that not all of the images are the same size. We solve this by using transforms that resize the images, as well as adding some random crops to the training data.

def get_birds_data():
    transform_train = transforms.Compose([
        transforms.Resize((300, 300)),
        transforms.RandomCrop(256, padding=4, padding_mode='edge'),
        transforms.ToTensor(),
    ])
    
    transform_test = transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor(),
    ])
    
    trainset = torchvision.datasets.ImageFolder(root=checkpoints + 'train/', transform=transform_train)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2)
    
    testset = torchvision.datasets.ImageFolder(root=checkpoints + 'test/', transform=transform_test)
    testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False, num_workers=2)
    
    return {'train': trainloader, 'test': testloader}
    
data = get_birds_data()

When it came time to test, we tried several neural network architectures. The first was a very simple model that we could train quickly to make sure everything was working properly.

This first neural net had a single Convolutional layer followed by a single fully-connected layer. After running for 10 hours on a personal computer, it had completed 8 epochs which amounted to an accuracy of 1.7% for the test data. It was at this point that we realized that we must switch to Google Colab to utilize their advanced hardware.

Once we had worked out the bugs associated with transitioning from a local runtime to Google Colab, it gave us significantly improved processing power, so we decided to step up our Neural Net.

This 2nd major iteration of our Neural Net was adapted from the in-class tutorials using 5 convolutional layers and one fully connected layer, only changing the last parameter to ensure we had 555 outputs for our 555 species of birds.

This second iteration performed well, after a day of training, it managed to hit ~25% accuracy, but we knew we could do better.

This leads us to the 3rd and final major iteration of our neural network.

For this ultimate model, we did some research to determine what architectures might be best suited for this type of image classification. One model that we discovered was the highly influential 'AlexNet'. So, after researching the structure of AlexNet, we designed our architecture to utilize key features of AlexNet. The key features we noticed were as follows:

Starting with a large 3-channel image, immediately expand the number of channels with 2d convolutions.
Do many convolutions, each time, increasing the number of channels and using pooling to decrease the 'height' and 'width' of the layer.
As layers progress, both the size of the convolutional filter and the rate of pooling should decrease.
The net should end with two fully connected layers, one that expands beyond the number of outputs, and then one that condenses to the outputs.

We took these principles from AlexNet to devise our own neural network architecture, making changes to suit our data and our hardware's computational capabilities. This final version is shown below.

class Darknet64(nn.Module):
    def __init__(self):
        super(Darknet64, self).__init__()
        self.conv1 = nn.Conv2d(3, 96, 5, stride=2, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(96)

        self.conv2 = nn.Conv2d(96, 128, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(128)

        self.conv3 = nn.Conv2d(128, 192, 3, padding=1, bias=False)
        self.bn3 = nn.BatchNorm2d(192)

        self.conv4 = nn.Conv2d(192, 192, 3, padding=1, bias=False)
        self.bn4 = nn.BatchNorm2d(192)

        self.conv5 = nn.Conv2d(192, 256, 3, padding=1, bias=False)
        self.bn5 = nn.BatchNorm2d(256)

        self.fc1 = nn.Linear(256, 2048)

        self.fc2 = nn.Linear(2048, 555)

    def forward(self, x):

        # Input is 3 x 256 x 256 (c,h,w)
        x = F.max_pool2d(F.relu(self.bn1(self.conv1(x))), kernel_size=4, stride=4) # 64x64x96
        x = F.max_pool2d(F.relu(self.bn2(self.conv2(x))), kernel_size=2, stride=2) # 32x32x128
        x = F.max_pool2d(F.relu(self.bn3(self.conv3(x))), kernel_size=2, stride=2) # 16x16x192
        x = F.relu(self.bn4(self.conv4(x)))                                        # 16x16x192
        x = F.relu(self.bn5(self.conv5(x)))                                        # 16x16x256

        x = F.adaptive_avg_pool2d(x, 1)                                            # 1x1x256
        x = torch.flatten(x, 1)                                                    # vector 256
        
        x = self.fc1(x)
        x = self.fc2(x)
        return x

The visual diagram showing the architecutre of our Neural Network is shown below:

Then, to train our neural network, we realized that the train() function shown in the in-class pytorch tutorial fully suited our needs, so we used it without modification. The most notable feature of its implementation is the ability to save and load states of the neural network. This feature was instrumental to our success because we would let the neural network train for days, but that would be broken up by Google Colab kicking us from the servers when we had utilized our allowance for GPU processing time.

Click to reveal the train() function.

def train(net, dataloader, epochs=1, start_epoch=0, lr=0.01, momentum=0.9, decay=0.0005, 
          verbose=1, print_every=10, state=None, schedule={}, checkpoint_path=None):
    net.to(device)
    net.train()
    losses = []
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum, weight_decay=decay)

    # Load previous training state
    if state:
        net.load_state_dict(state['net'])
        optimizer.load_state_dict(state['optimizer'])
        start_epoch = state['epoch']
        losses = state['losses']

    # Fast forward lr schedule through already trained epochs
    for epoch in range(start_epoch):
        if epoch in schedule:
            print ("Learning rate: %f"% schedule[epoch])
            for g in optimizer.param_groups:
                g['lr'] = schedule[epoch]

    for epoch in range(start_epoch, epochs):
        sum_loss = 0.0

        # Update learning rate when scheduled
        if epoch in schedule:
            print ("Learning rate: %f"% schedule[epoch])
            for g in optimizer.param_groups:
                g['lr'] = schedule[epoch]

        for i, batch in enumerate(dataloader, 0):
            inputs, labels = batch[0].to(device), batch[1].to(device)

            optimizer.zero_grad()

            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()  # autograd magic, computes all the partial derivatives
            optimizer.step() # takes a step in gradient direction

            losses.append(loss.item())
            sum_loss += loss.item()
            
            if i % print_every == print_every-1:    # print every 10 mini-batches
                if verbose:
                    print('[%d, %5d] loss: %.3f' % (epoch, i + 1, sum_loss / print_every))
                sum_loss = 0.0
        if checkpoint_path:
            state = {'epoch': epoch+1, 'net': net.state_dict(), 'optimizer': optimizer.state_dict(), 'losses': losses}
            torch.save(state, checkpoint_path + 'checkpoint-%d.pkl'%(epoch+1))
    return losses

def accuracy(net, dataloader):
    net.to(device)
    net.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in dataloader:
            images, labels = batch[0].to(device), batch[1].to(device)
            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct/total

def smooth(x, size):
    return np.convolve(x, np.ones(size)/size, mode='valid')

This next block of code is what we used to train our neural network.

One thing of note with our schedule is that we had never intended to run so many epochs. We initially planned for 15 because it seemed like a reasonable number and we would only have diminishing returns afterwards.

But, after completing all 15 epochs, our neural net's accuracy was barely 30%, not nearly competitive with some other students in the class.

So, we doubled it and planned for 30 epochs. But this time, we realized that from epoch 20-30, the loss of the model hardly changed at all, we had decreased our learning rate too quickly. Since our model was already quite functional, we didn't want to start from scratch again, so we messed around with increasing the learning rate. We found the highest learning rate that did not also increase our loss. And from this point, we made sure to really let the neural network train at each learning rate before gradually decreasing it.

This all lead us to the final schedule, which is shown below.

net = Darknet64()
state = torch.load(checkpoints + 'checkpoint-??.pkl')
schedule={0:.1, 4:.01, 9:.001, 20:.0001,30:0.001,44:0.0005,58:0.0002}
losses = train(net, data['train'], epochs=70, schedule=schedule, checkpoint_path=checkpoints, state=state)

Another challenge with this dataset is matching up the correct labels to the images. While the images are stored inside folders with the proper labels on them (numbered 0, 1, 2, 3...), the dataloader does not load these folders in counting order. It loads according to the first-lowest digits, e.g. (0, 1, 10, 100, 101, 102...), and identifies those folders as 0, 1, 2, 3... So, while labels 0 and 1 are correct, the species that our model labels as 2 is actually found in folder #10 (the critearia used by Kaggle). The pattern continues as follows: (3 -> 100), (4 -> 101), (5 -> 102), and so on. To get around, we created a map from our neural network's labels to Kaggle's labels.

The code below goes through the training data, uses the label assigned by the neural network as the index, and extracts the corresponding folder id for that label.

list = []

for element in data['train'].dataset.samples:
    if len(list) <= element[1]:
    list.insert(element[1], int(element[0][39:-37]))

The last major step is to run the test images through our model and generate a csv file corresponding to our neural net's prediction of each bird. This file is uploaded to Kaggle, which returns our percentage correctness. This test function runs the test images through the model that we wish to test, and uses the aforementioned list to convert the labels to the correct ones. The output is then written to a file.

def test(net, dataloader):
    net.to(device)
    with open(checkpoints + 'testOutput.csv', 'at') as f:
        with torch.no_grad():
            f.write('{},{}\n'.format('path', 'class'))
            for i, (images, labels) in enumerate(dataloader, 0):
                print(i)
                outputs = net(images)
                _, predicted = torch.max(outputs.data, 1)
                for j in range(len(predicted)):
                fnames = 'test/{}'.format(str(dataloader.dataset.samples[i*64 + j][0])[40:])
                    f.write('{},{}\n'.format(fnames, list[predicted[j]]))

Finally, we run our test function on our trained neural net, given by the saved state-file of the net.

net = Darknet64()
state = torch.load(checkpoints + 'checkpoint-30.pkl')
net.load_state_dict(state['net'])
test(net, data['test'])

Performance & Analysis

Ultimately, our neural net ended with 40.6% accuracy which, in itself, is quite good considering the AI has to choose between 555 species of birds, meaning the expected accuracy is 0.18%.

But, considering the best AI in the class identified birds with a rate of ~80% accuracy, it indicates that there was significant room for improvement with our neural net.

Shown below is a graph of the loss of our neural network over time. On the x-axis is the batch number. (Each batch corresponds with 64 images being used in training our neural network). On the y-axis is the loss of our neural network. You will notice that the loss flattens off in the approximate batch range [12000, 18000]. This corresponds to the issue we discussed earlier of decreasing our learning rate too much too soon.

But, we managed to continue bringing the loss down by increasing our learning rate once again and from there, gradually decreasing it.

Note: We submitted the test values of our neural network at losses of 5.5, 2.7, 1.8, and 1.2. The test accuracies of these losses were 1.7%, 29%, 39%, and 40% respectively.

These data points highlight two important takeaways from this project:

Decreasing the loss of a neural net generally increases the accuracy of the network.
A change in loss does not correspond to a proportional change in in accuracy.

The accuracy vs. loss graph is shown below. Note that the losses have been converted to negative numbers to better reflect how as the neural network was trained more, loss got closer to 0 and accuracy increased.

Clearly, the efficacy of our neural network tapered off. A ~33% decrease in loss ultiamately corresponded with a ~2% increase in accuracy. This indicated to us that improving our loss further would do little to improve overall accuracy, and with diminishing returns there was not much we could do. The state of our neural net could not reasonably be trained to compete with the top of our class and hit 70%+ accuracy.

We have two hypotheses for why this might have been the case.

First, our neural net could have just been bad. It is practically impossible to design a perfect neural network, and maybe some of the tradeoffs we made for runtime-performance hurt our efficacy-performance in the task at hand. So, if we were to try this again, we would conduct further research into designing the structure of neural networks and instead of testing one model, test many architechtures simultaneously so that we're not 'putting all of our eggs in one basket'

Second, we think our neural network could have gotten stuck at a 'local maximum' that is somewhat effective for the training data, but not for the task overall. We noticed early on that our neural network would frequently pick birds based on their surroundings. For instance, it predicted that a swan was a duck even though the two birds look nothing alike. We believe the AI did this because the swan was swimming and the most of the training picutres of the duck also showed the duck swimming. So, it is possible that our neural network is heavily reliant on the background and the ecosystem of the bird, which is a useful observation, but is problematic because many birds share the same ecosystem.

Therefore, we believe that this reliance on background has allowed our neural net to get better at recognizing birds based primarily on its environment which is constant throughout the training data, but not all data for a particular bird. This in turn has lead to improvements in loss for the training data, but minimal improvement with the test data.

This all has lead to our AI getting stuck at local maximum, where it can get better at recognizing the environment of birds, but this predictor is far worse than just identifying the bird in the image itself.

How did this happen?

We believe that this issue of local maximum came about be decreasing our learning rate too much too quickly. We were too eager for the quick drop-offs in loss that occur when you decrease the learning rate. This forced the AI to use the current best-strategy that it learned from the very beginning and carry that strategy all the way to the end.

Based on what I've mentioned previously, this best strategy was probably identifying the backgrounds in the images.

So, if we were to attempt this again, we would certainly let the neural net train at a relatively high (and volatile) learning rate for an extended period of time. We hope that this would lead to more experimentation by the neural network and allow it to 'realize' that the best strategy is actually identifying the bird in the image. And once we see that the loss has stabilized and found the global maximum strategy, from that point onwards, we will gradually decrease the learning rate until optimal accuracy is reached.