Skip to content

The accuracy of a species recognition model largely depends on the quality of the dataset it was trained on. In computer vision, there’s a well-known saying: “garbage in equals garbage out.” This means that a low-quality dataset will result in a low-quality model. Therefore, investing time and effort into creating a high-quality dataset is essential for achieving the best results.

Image variability
Species recognition models are trained on labelled images from the project area in which they will be deployed. By repeatedly analysing these examples, the model learns to identify species effectively. A good rule of thumb is to provide at least 10,000 images per class. While more images per class are always beneficial, the returns diminish as the quantity increases.

What matters most is the variability of the images. For example, 5,000 images from 100 locations are far more valuable than 10,000 images from a single location. To develop a robust model, the dataset should be as diverse and heterogeneous as possible, incorporating different locations, backgrounds, camera types, angles, weather conditions, habitats, etc.

However, creating a perfect dataset may not always be feasible in real-world scenarios. For classes with limited data, don’t worry—Addax will work with whatever you have. When necessary, the dataset might be supplemented with images from other ecological studies. While these external images are less effective than project-specific data, they still contribute to improving the model’s accuracy.

Image tagging
As the expert on your region’s wildlife, you are best suited to annotate the data. Addax essentially requires information about which animal appears in each image or video. This information can be provided in various ways. In many cases, organisations have been labelling their images for years and already have a preferred method. That’s great! If you still need to label or organise your data, below are some points to consider. Applications like ZIP-classifier are excellent for tagging images efficiently. They allow you to organise and label large datasets effectively and prepare them for model development. Below are two common methods of labelling your data.

  • Tagging by folder structure - Organise your images into folders based on species. Each class should have its own subfolder (see example below). Ideally, the folder structure in each class subfolder should be unaltered (read more about why at 'original folder structure').

  • Tagging with spreadsheet metadata - Alternatively, you can provide some metadata file (e.g., XLSX, CSV, TXT, or JSON) with image or video file names and their corresponding species tags. Addax will then handle the folder organisation programmatically.

Original folder structure
It is important to keep as much information about the image as possible. For example, maintaining the original folder structure helps Addax understand which images belong together and which are independent. This distinction is crucial for creating proper train, test, and validation splits during model training. While not absolutely essential, this step significantly enhances the model's accuracy.

For example, a folder structure like <organisation>/<project>/<site>/<deployment>/<image> provides valuable information for model development. Alternatively, other hierarchical structures, such as <area>/<park>/<camera>/<image>, work just as well. The key is to organise the data from large groupings to smaller ones.

Luckily, most conservation agencies already have their own way of structuring their data, so there’s usually no need to alter any existing folder structures. Please note that Addax does not require any latitude or longitude information. If needed, you can safely remove any location metadata.

Multi-species occurrences
Images containing multiple species can present challenges during data processing. Addax’s annotation method will label all animals in an image with the tag provided by the client. For instance, the image below contains both ostriches and gemsboks in the background. If this image was labelled “ostrich”, all individual animals, including the gemsboks, will receive the “ostrich” label.

In most cases, multi-species occurrences are rare and not a major concern. However, they can create issues in specific contexts, such as studies conducted in pastures with domesticated animals or cameras monitoring watering holes. A few misclassifications are manageable and won’t significantly affect the model’s performance. However, frequent errors will impact accuracy. If you are aware of such images in your dataset, please provide a list of the affected images, deployments, or locations to help mitigate potential issues.

Please note that the resulting model will be capable of detecting multiple species in a single image. Only during the training phase, multi-species occurrences can pose a challenge and are best excluded from the dataset or kept separate.

Duplicate folders
Duplicates in the dataset can negatively impact model accuracy. If duplicate folders or images are present, they might inadvertently end up in different data splits (train, validation, or test). This increases the risk of the model becoming overconfident, which reduces its ability to generalise effectively. While a few duplicates won’t pose a significant problem, many of them can hinder the model’s learning process. Ensuring the dataset is free of duplicates is an important step to maintain its quality and improve the overall performance of the model.

More questions?
If you have further questions or need additional help with labelling, feel free to reach out.