Anales de la RANM

268 A N A L E S R A N M R E V I S T A F U N D A D A E N 1 8 7 9 DEEP LEARNING GENITAL LESIONS IMAGE CLASSIFICATION González-Alday R, et al. An RANM. 2022;139(03): 266 - 273 the design of the so called “artificial neurons”— has come promising to revolutionize issues such as medical diagnosis, therapy, patient management, prognosis or public health and surveillance systems, among others. Deep learning is a branch of Machine Learning that encompasses algorithms and models that allow computers to automatically learn complex patterns and relationships from data, thanks to their architec- tures indirectly inspired by neuroscience, which are composed of multiple stacked processing layers (9), each one containing artificial “neurons”, which are basically simulated processors that can carry out mathematical processes. The most popular example of this area is deep neural networks. Neural networks can take as an input different kinds of data, like values for several variables, images, text or audio, for example. Data is passed through the various layers of artificial neurons, and inside each of these processing units, the input data (which originally could be the value of a pixel in an image, signals, or clinical data if we are trying to classify or predict a disease state as an outcome) is pondered using a weight. The final aggregation of the serial operations made by the network is used to calculate the probability of pertinence to a class, using a mathematical function. The training process of these networks consists of modifying the weights for each of the artificial neurons that lead to the best classification/predic- tion accuracy. To do so, roughly, in one of the most classical examples in the area, a set of data is used to iteratively test the network and actualize the weights by back propagating the derivative of the error obtained comparing the actual class of each instance and the prediction of the model. The development of these methods has allowed computers to perform with great accuracy tasks such as, to mention two common cases, object detection or speech recognition, which can seem easy for humans but have been difficult to formalize by computers (10,11). In this paper, we focus our work on a particular deep learning technique for computer vision, convolutional neural networks (CNNs). We base our approach in this technique to build an image analysis system that can aid clinical diagnosis in STDs. 2. METHODS 2.1. Image dataset handling The main requirement of deep learning models is a dataset of good quality and sufficient size to be trained on. For this work, we had available a series of images of both male and female external genitalia and perianal area with visible lesions of three types: herpes, warts and condylomas. Before being able to train and test the classification model of choice, some preprocessing was done: all the images were labeled with the corresponding condition, and as they were taken from varied angles, lighting and positions, they were manually cropped so that the lesions were more or less centered, and they were also normalized and rescaled to a standard size. The result of this process were images of 224x224 pixels and 3 color channels with values scaled between 0 and 1. The dataset contained 261 images in total, of which 42 belonged to the herpes class, 34 to genital warts and 185 to condylomas. 2.2. Deep learning model As mentioned before, the deep learning model used for this work is a CNN. However, as is the case with any other deep learning techniques, this model acts like a “black-box”, it classifies images but does not provide any direct explanation for its performance. For that reason, we decided to explore the use of a technique from the field of Explainable Artificial Intelligence (XAI) that helps the assessment of the model’s functio- ning by producing visual explanations. 2.2.1 Convolutional Neural Network CNNs are a particular kind of neural network that are particularly adequate for computer vision tasks. They are inspired by the vision processes in living beings -its goal is not to imitate nature but to optimize image processing by computers-, as they consist of proces- sing units that compute the input of a set of pixels (by performing a mathematical operation called convolu- tion), the same way as animal cells in the visual cortex detect light in receptive fields (12). They receive their name from the basic mathematical operation performed by the processing units: convolutions. These processing units are grouped in stacked layers (see Figure 1), with different architectures depending on each case, that end up learning different features, invariant to position, that lead to the final classifi- cation. This means, for example, that a layer inside of the whole network might learn to detect edges, specific shapes or more abstract features that will then be detected in other images (13). Specifically, the network architecture used for this prototype system is the state-of-the-art EfficientNet (14), with some minor adjustments to properly fit the image dataset available. Given the limited quantity of data available for the network training, transfer learning was used. This method consists of using a network previously trained on another large dataset, unrelated to the problem at hand (in this case, the ImageNet object recogni- tion dataset (15)), and then retraining the network with the desired data. Thus, the generic deep features learned by the neural network are exploited, and the model is then refined for the problem at hand, achieving better performance than if it was training only on the small dataset (16).