Join the Shiny Community every month at Shiny Gatherings

hero image for computer vision

CNN For Image Classification: Does The Neural Network Really See The Seeds?


Open a bag of garden variety seeds and take a close look. What do you see? Is it a healthy seed? A broken seed? Could you accurately predict if that seed will grow? That’s what the Świeżewski lab sought to answer using Computer Vision models. And as it turns out, it can be done! By identifying morphological parameters of thale cress (Arabidopsis thaliana) seeds and applying a CNN to perform tasks like image segmentation and image classification the Świeżewski lab was able to predict seed dormancy. In short, a CNN for image classification can predict if a seed will germinate or stay dormant with just a photograph.

artistic drawing of Arabidopsis thaliana, the plant species used in the CNN Computer Vision model

The predictions were made based on images of seeds taken before germination. The results obtained by the topmost model, EfficientNet version B3, achieved a 70% accuracy for a dataset containing only around 3000, low-quality images. The images and models were created by Świeżewski Lab. The lab is hosted at the Institute of Biochemistry and Biophysics at the Polish Academy of Sciences. 

The article below shares insights on improvements made to the model, the inclusion of a new dataset, and how they both play a role in improving accuracy.

Convolutional Neural Networks (CNN) for image classification

Computer Vision is the artificial automation of information accumulation and interpretation tasks that are typically performed by biological visual systems. It’s a complex feat of machine learning that requires the training of computers in contextualization. Something most biological systems absorb from years of experience.

Get started on your first Computer Vision project with Appsilon’s image classification tutorial.

Neural networks allow computer programs to identify patterns and solve problems by mimicking a biological brain. They are often described as the soul of deep learning algorithms. Neural networks work in a similar fashion to your brain’s neurons by communicating with one another through layers of context development. Typically this starts with simple patterns and develops into higher complexities as it progresses. The artificial neurons in a neural network pass along information through layers of nodes – the digital equivalent of a soma. If a criterion is met, that piece of information is passed along to the next layer of nodes. If it fails to meet the set thresholds then the information is not passed along and the process either terminates or adjusts accordingly.

Convolutional Neural Networks are a class of neural networks that deal with data that have grid-like structures. Examples of grid-like data with variable dimensions include time-series graphs (1D), images (2D), and elevation models (3d). One of the most common applications for Computer Vision is image and video recognition, but there are many ways CNN can be applied to deep learning.

Datasets

old dataset of seed images

The old dataset containing images of thale cress seeds

These images show two sets of data – the old (above) and the new (below). At first glance, there’s not much difference between the two. But when you take a closer look you’ll notice a few key differences.

seed images from the new dataset

The new dataset containing images of thale cress seeds

For example, you might notice the difference in colors, brightness, and sharpness. The new dataset, although darker in hue, has a more homogenous background and slightly better resolution. The new dataset contains twice the pixel count as the old dataset, which can be seen in the figures below.

graph comparisons showing pixel count density of each dataset

Architecture changes

Several standard architectures were tested, including different versions of VGG, SquizeNet, DenseNet, and ResNet. As it turns out, ResNet50 was able to achieve a significant accuracy improvement at 77%, over the old dataset using EfficientNet B3 with 70%. To make an accurate comparison of the new to the old, the new model was trained on ResNet50 using the old dataset. It obtained similar results, at around 70% accuracy – which suggests that the lowest quality data also brings about the worst results, even though the first dataset is somewhat larger (~3000 images).

new model results

The new model results including losses, accuracy, and error rate. Results were obtained using the new dataset.

Blurriness and sharpness experiments

One of the key differences in the new dataset is the sharpness of the images. I was curious to investigate if it affected the training of the neural network. And if it did, how much would it matter?

Leverage your experience and existing high-performance deep learning models with an introduction to Transfer Learning

To test this on the new data, we chose to decrease the quality of the new set of images and see the effects on the model’s performance. For the alterations, we chose to blur and sharpen the images.

To modify the images the Albumentation library was used, particularly the Gaussian Blur and Sharpen functions. Below, is an example of its implementation:

Unfortunately, when the blurred images looked the way they were supposed to, the sharpening process failed to sufficiently sharpen the image. 

examples of sharpened, blurred, and original images respectively from the new dataset used for the CNN image classification

Both transformations significantly influenced the predictions negatively, but the model was still functioning better than the “base” model, XGBoost at roughly 60% accuracy. 

The best model without dataset transformations was able to achieve 77% accuracy. The blurred model achieved 67% while the sharpened model got up to 71%. These results point out that resolution has an important impact on classification accuracy. It also indicates that although sharpened images don’t necessarily look good to the human eye, they are more useful for Computer Vision models than blurred counterparts. 

However, such an outcome is not surprising given that CNN models are strongly focused on capturing diverse edges from input data. In this case, the heavy pixelation allows the model to focus on small edge detection as the image has low complexity and minimal details.

It proves that even imperfect data can be useful and contemporary CNN models can still extract relevant information from such data. It also emphasizes the importance resolution has on model training. However, even with low resolution, model predictions produced valuable output and should not be disregarded. It is worth the time and effort to experiment with neural networks even if a dataset is small and of low quality.

What did the model really see?

Computer Vision works similarly to biological sight recognition. A field of view is taken as input. Properties like color, edge-detection, shape, sharpness, etc., are registered by the ‘brain’ and interpreted within the context of experience or training. That being said, show one image to a group of people and what they see and interpret will likely change. An interesting aspect of Computer Vision modeling is investigating what the model is actually looking at in the images when making predictions. 

PP-YOLO object detection – is it really an improvement over YOLOv4?

The Explainable AI field has been attempting to answer this question for some time to make neural models’ results interpretable for humans. A higher level of interpretation is not always possible, because of the complexity of the model, the problem being solved, or the decision-making process. Currently, GradCAM is one of the most popular methods for checking what vision neural models focus on during inference. 

GradCAM

GradCAM (Gradient-weighted Class Activation Mapping) is used for visualizing the gradients of a model that make it into the final convolutional layer of a Neural Network. This produces a coarse map that highlights the most important regions in the image, meaning the areas with the largest logits values. For the updated model, the SmoothGradCAM++ method was selected from the TorchCAM implementation for Pytorch. 

As it turned out, the updated model has a keen eye. It always looks at the focal point of the picture – the seed. It doesn’t focus on irrelevant areas of the image or develop a wandering eye. I suppose one could say Computer Vision syndrome doesn’t apply to our Computer Vision model. 

In the images below you can see examples of the activations. The rightmost image is the original. The leftmost image shows a heatmap of the activations with the middle image presenting an overlay mask of the heatmap on top of the original image. 

heatmap of activations indicating where the CNN model is looking

GradCAM and its variations can be a useful tool in the analysis of models with convolutions. Not only for analyzing single models’ behavior but also for model comparisons and debugging. Although GradCAM gives us interesting insights into our models, it should be noted that it is a heuristic tool for looking into the model predictions and should be taken with a grain of salt.

Lessons learned using CNN for image classification

  1. Armed with only a small dataset, no matter if it’s imperfect with low resolution and blurred photos, you can still apply machine learning methods to make the most of your project. 
  2. Highly blurred images in the dataset significantly degraded the model training and its predictions lowered accuracy to ~68%. As one might expect, fewer details mean less information for the model.
  3. Sharpening the pictures also worsened the predictions, but to a lesser extent with an accuracy level of ~71%.
  4. GradCAM is an interesting visualization tool when working with a CNN for image classification. It can be used to reveal the inner workings of models and provide valuable insight.