Masksemble-Aided Cross-ViT for Uncertainty Estimation in Skin Cancer Diagnosis
Received: 19-Sep-2024 / Manuscript No. AOT-24-148409 / PreQC No. AOT-24-148409 (PQ) / Reviewed: 08-Oct-2024 / QC No. AOT-24-148409 / Revised: 04-Jun-2025 / Manuscript No. AOT-24-148409 (R) / Published Date: 11-Jun-2025 DOI: 10.4172/aot.1000310
Abstract
In this work, we investigate a Masksemble-aided Cross ViT model to measure the uncertainty of feature representations for cancer identification. We propose a Cross ViT with a special Masksemble block in order to create discriminative image features. The Masksemble layer estimates the uncertainty of a given dermatoscopy image that plays a crucial role in cancer identification, and then it is passed to the Cross ViT network for the classification task. The comprehensive results show that our method outperforms CNN models and vision transformers. The model will detect skin cancer by differentiating the cancerous cells (malignant) from the non-cancerous ones (benign). The prediction of the model is measured by performance metrics such as precision, recall, F1-score, and average accuracy along with class-wise accuracy, which shows the effectiveness of the proposed method. In addition to being verified for binary classification, the suggested model is also tested for many classes using the HAM-10000 dataset, demonstrating the system’s effectiveness in multiple-classification scenarios.
Keywords: Deep learning, Masksemble block, Vision transformer, Skin cancer
Introduction
Skin cancer is one of the most prevalent types of cancer worldwide, with melanoma being the most aggressive form. Early detection and accurate diagnosis are critical for improving classification outcomes. Skin cancer encompasses various types, each with distinct characteristics, risks, and treatment approaches. Melanoma, the most aggressive form, arises from melanocytes, the pigment-producing cells. It often presents as an irregularly shaped, multicoloured lesion with an elevated border. Melanocytic nevus, commonly known as a mole, is generally benign but can occasionally transform into melanoma. Basal cell carcinoma, the most prevalent skin cancer, originates from basal cells and typically appears as a pearly or waxy bump. Actinic keratosis results from sun damage, manifesting as rough, scaly patches that may progress to squamous cell carcinoma if left untreated. Benign keratosis, or seborrheic keratosis, is non-cancerous, and characterized by wartlike growths on the skin. Dermatofibroma is a benign fibrous tumour, often presenting as a firm nodule. Awareness of the distinctive features of each skin cancer type is crucial for early detection and appropriate medical intervention.
Skin cancer is a concerning condition that needs to be identified as soon as possible. The biopsy approach is typically the standard diagnostic procedure used to identify skin cancer. The entire process is expensive, time-consuming, and painful. These days, macroscopic and dermoscopic pictures are the most often utilized non-surgical diagnostic tools [1,2]. Because macroscopic photographs are typically taken with a camera and cell phone, there is a problem with lesser resolution [3]. High-resolution skin images produced by dermoscopy are obtained by observing the underlying skin structures [4]. Dermatologists find it difficult to diagnose skin cancer even using dermoscopy images since different kinds of the disease have similar symptoms. Skin cancer can be manifested in a variety of ways, and even highly skilled dermatologists are restricted in what they have studied and seen. The level of accuracy varies according upon dermatologists’ experience. The most concerning finding is that less experienced dermatologists may do worse [5]. Patients with cancer who receive false-negative skin cancer diagnosis findings may be in grave danger.
In this paper, we propose a deep learning model capable of identifying and classifying dermatoscopic images into benign and malignant categories. The data is initially preprocessed by scaling it to a resolution of 224 × 224. Next, we apply augmentation techniques to the dataset, including width-height shifting, rotating, flipping both horizontally and vertically, and standard normalizing. Finally, the neural network models are fed the images. After processing the image, it is passed to a Masksemble block that contains 4 layers (2 convolutional and 2 pooling layers) followed by a Masksemble layer (Figure 1). A Maskable layer is added for uncertainty estimation, aiding in predicting benign vs. malignant accurately. The Masksemble layer is created by merging MC dropout and Deep ensembles, ultimately producing feature maps. Utilizing these feature maps, we employ the Vision Transformer (ViT) for classification. The proposed model is not only validated on binary classification but also experimented on multi classes for HAM-10000 dataset and established the system efficacy in the context of multi-classification. Our experimental results demonstrate the superiority of our approach over other models for identifying skin cancer.

Figure 1: Overall design of the proposed Cross-ViT with Masksemble block that estimates the uncertainty of an input image and then it passes to Cross-ViT to predict the classification. The Masksemble layer is followed by two convolution layers and two pooling layers. The Cross-ViT consists of tokens with an encoder and finally predicts the classification as benign and malignant.
Paper contributions
Deep learning and convolutional networks have demonstrated impressive capabilities in detecting skin cancer. However, their predictions are inevitably accompanied by a degree of uncertainty, which can hinder the accuracy of classification. To mitigate this challenge, we present a pioneering model equipped with a specialized uncertainty estimation technique. By integrating this approach, our model substantially enhances the precision of skin cancer type classification. This innovative solution not only addresses the inherent uncertainty in predictions but also significantly improves the reliability of diagnostic outcomes.
• To enhance the precision of our classification results, we integrate the Masksemble layer into our approach, enabling us to calculate uncertainty effectively. This advanced uncertainty estimation technique significantly improves the model’s predictive accuracy, thereby enhancing its overall performance in skin cancer classification.
• We combine the Masksemble layer with Cross-ViT which is called Cross-ViT with Masksemble block. This Cross-ViT with Masksemble block increases the accuracy of the classification of skin cancer.
• The International Skin Imaging Collaboration (ISIC), which provides a large skin cancer dataset to the medical and CAD (Computer-Aided Diagnosis) community, significantly contributes to the advancement of skin cancer image processing. In this work, we evaluate our approaches using the ISIC dataset of 2016, 2018, and the Kaggle dataset to validate the efficacy of the proposed model.
• Our proposed model exhibits superior performance in comparison to traditional Convolutional Neural Network (CNN) models and vision transformers, showcasing its advanced capabilities in handling complex classification tasks.
Materials and Methods
Numerous studies have been conducted about the classification of images. After reviewing a few of the relevant publications, we were able to greatly enhance our analysis. A machine-learning method for identifying melanoma skin cancer was previously demonstrated by M. Vidya and MV Karki. Their methodology consists of five stages: Segmentation, feature extraction, preprocessing, classification, and data gathering. They achieved 97.8% accuracy using SVM. In medical image analysis, vision transformers have been applied extensively. Still, the majority of existing approaches only consider the class token during training, ignoring the information of the output patch tokens. The authors suggested a two-stage token labelling guided multi-scale model for medical picture classification in order to address this issue. Malignant and benign images were employed by K. Manasa and DGV Murthy to classify skin cancer disease using the VGG16 and Resnet-50 models. They employed 3297 images total and among them 1497 images from the malignant class and 1800 images from the benign class to train their models. For the VGG16 and Resnet50 models, they obtained accuracy rates of 80% and 87%, respectively. Convolutional neural network classification of cancer images was done by M. Hasan et al. [6]. Malignant and benign groups were present in their dataset. In order to optimize CPU utilization, some images are transformed into grayscale versions. The convolutional neural networks get input from the preprocessed data. Last but not least, they used accuracy, f1-score, specificity, recall, and precision to assess their model. The 89.5% of the test dataset was accurately completed. A review of skin cancer analysis was presented by T. Saba. They went over the studies conducted on the categorization of skin cancer. According to the analysis, SVM, CNN, and ANN models have been used in the majority of prior research projects. A modified GoogleNet model was suggested by M. A. Kassem et al. to classify eight kinds of skin lesions. In order to improve and lower noise, they added extra filters to each layer. They used two distinct methods to replace the final three layers. A completely linked layer, a softmax layer, and a classification output layer have taken the role of the final three layers. Their suggested model outperforms the original GoogleNet model for categorization. For the purpose of classifying skin cancer, D. N. Le, et al. employed transfer learning approaches, combining class weights and focal loss with pre-trained ResNet 50, VGG16, and MobileNet models. They utilized weights in each class to maintain class balance. Classes with more samples were allocated lower weights, whereas those with fewer samples were given higher weights. They obtained an average accuracy of 93% on the test data using their method.
Dataset and task description
The International Skin Imaging Collaboration (ISIC) created the ISIC archive, a global library of dermoscopic pictures, with the dual goals of assisting clinical training and advancing technical research that would ultimately result in automated algorithmic analysis. The ISIC expands its dataset and issues a challenge each year to take advantage of automated skin cancer diagnosis.
The detailed description of the dataset is indicated in Table 1. For the first three datasets (ISIC-2016, ISIC-2018 and Kaggle dataset,) we have classified datasets into 2 classes that are benign and malignant (as all diagnosed types are not present in the meta-data provided by ISIC).
| Dataset | Number of images | Classification categories |
| ISIC-2016 | 1279 (train-900, test-379) | 2 (benign-malignant) |
| ISIC-2018 | 11,527 (train-10,015, test-1512) | 2 (benign-malignant) |
| Kaggle | 3297 (train-2637, test-660) | 2 (benign-malignant) |
Table 1: Description of dataset.
HAM10000 dataset: The limited quantity and lack of variety in the current dataset of dermatoscopic images makes it difficult to train neural networks for automated diagnosis of pigmented skin lesions. By making the HAM10000 ("Human Against Machine with 10000 training images") dataset available, this issue is addressed. Dermatoscopic images from various populations that were taken and saved using various modalities make up this dataset. The cases comprise a comprehensive set of all significant diagnostic categories related to pigmented lesions: dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), actinic keratoses and intraepithelial carcinoma/ Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosislike lesions (solar lentigines/seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and bleed, vasc).
The proposed architecture is shown in Figure 2 where step by step procedure is explained below.
A Pre-processing of dermatoscopy data
While processing all datasets, we equally distributed train and test images in 2 classes (benign and malignant) by offline data augmentation which include 90-degree rotation, horizontal flip, vertical flip and centre cropped by size of 224 × 224. After equal distribution, we normalized our dataset by the mean value of (0.485, 0.456, 0.406) and standard deviation value of (0.229, 0.224, 0.225). After processing the whole dataset, we send it to Masksemble block. A sample of dermatoscopy images is shown in Figure 2.

Figure 2: Dermatoscopy image of benign and malignant.
Processed data to proposed Masksemble block
Let’s denote images like x ∈ RW × H × C∀W=H, where Width (W) and Height (H) of our images is same (224 × 224) and contains 3 channels (RGB). This block includes 2 sub-blocks -one is a block of Convolution with a pooling layer and another one is a Masksemble block.
Layer before the masksemble: This block contains 3 layers (2 convolution layers and 1 pooling layer) for down-sampling and channel enhancement. We initialize the first convolution layer (c1) and the second convolution layer (c2) by setting out-channel, Kernel size, stride, padding and activation layer values as follows in Table 2.
| Conv. layer | In channel | Out channel | Kernel size | Stride | Padding | Activation layer |
| c1 | 3 | 8 | 3 | 1 | 1 | Relu |
| c2 | 8 | 32 | 3 | 1 | 1 | Relu |
Table 2: Description of the layer before Masksemble.
The pooling layer inside the special block before the Masksemble layer contains kernel size and stride value both as 2. The convolution layer increases the channels and the pooling layer will help us in down-sampling the feature map of images. First c1 will be applied to the image followed by pooling and then the same for c2. After this block the pre-processed image x ∈ R224 × 224 × 3 converted into the feature map of size of x ∈ R56 × 56 × 32
Masksemble layer
Uncertainty estimation aims to generate a confidence level for model predictions. For deep neural networks two of the most wellknown and useful uncertainty estimation techniques are MC-Dropout and Deep Ensembles. Using random network initialization, an ensemble of deep neural networks is trained on the same set of data using deep ensembles. Deep Ensembles have a significant computational overhead since they need to train many separate networks, all of which should be kept in memory during inference. Masksembles are made to provide control over how an ensemble’s members correlate with one another, allowing for a good balance between accurate uncertainty assessment. Masksemble is an uncertainty estimation process that is created by merging the two most popular estimation processes MC-Dropout and deep ensembles. In our model, we have inserted the masksemble layer just after a block of convolution and the pooling layer together creates the Masksemble block. Masksembles apply binary masks to inputs by multiplying them both channel-wise. In our model architecture, we used 4 binary masks and we have set scale value as 2. The probability distribution of 4 masks is indicated in Figure 3 where two different colours are used to represent the uncertainty of each mask.

Figure 3: Uncertainty estimation of benign and malignant using 4- masking. These are renamed as mask1, mask2, mask3, and mask4 of the given figure. In this figure, the green bars show the certainty of being benign and the red bars show the certainty of being malignant for all masks.
Masksemble block to cross-ViT
Cross-ViT is the enhanced model of Vision Transformer (ViT) where we use 2 different patch types (small patch and large patch). So first we present a brief overview of ViT and Cross-ViT.
Overview of Vision Transformer (ViT): The Vision Transformer, often abbreviated as ViT, is a deep learning model architecture designed for computer vision tasks. It was introduced in the paper titled "An image is worth 16 × 16 words: Transformers for image recognition at scale" by Alexey Dosovitskiy et al. The vision transformer represents a departure from traditional Convolutional Neural Networks (CNNs) by adopting the transformer architecture, which was initially proposed for natural language processing tasks. Vision Transformer architecture includes 5 key components
Image patch embedding: Instead of processing entire images as input, ViT divides images into fixed-size non-overlapping patches. Each patch is then linearly embedded to form a flat sequence, which serves as the input to the transformer.
Positional embeddings: In natural language processing tasks, the transformer relies on the sequential order of words, which is naturally encoded by the position of words in the input sequence. For images, the vision transformer introduces positional embeddings to provide information about the spatial relationships between different patches.
Multi-head attention: ViT uses multi-head self-attention to capture diverse aspects of relationships between patches. Multiple attention heads allow the model to focus on different parts of the input, enhancing its ability to learn complex patterns. The input of ViT, x0 and the processing of the kth block can be expressed as

Here MSA means Multihead self-attention, FFN means feedforward network and LN means layer normalization where xcls ∈ R1 × C and xpatch ∈ RN × C are the cls and patch token respectively and xpos ∈ R(1+N) × C is the position embedding.
Transformer architecture: The vision transformer applies the transformer architecture, originally designed for sequential data like text, to images. This architecture eliminates the need for handcrafted feature engineering, as it allows the model to learn hierarchical representations directly from the data.
Classification head: The output from the transformer is fed into a classification head for making predictions. In the case of image classification, a standard softmax layer is often used.
Cross-Vision Transformer (Cross-ViT): Cross-ViT contains more complex and efficient components rather than ViT. The Cross-ViT contains L-Branch and S-Branch where L-Branch is the large (main) branch with more transformer encoders and a wider embedding dimension using coarse-grained patch size with Small Branch (SBranch) that functions with fewer encoders and smaller embedding dimensions at fine-grained patch size. Both branches are linearly projected before entering into the multiscale transformer encoder block.
Multiscale transformer encoder block: This block includes a transformer encoder for large patch projection and small patch projection with cross attention block. Here the transformer encoder block is as same as like vision transformer. The cross attention module for small branches is illustrated below

Here fs(·) and gs(·) is the projection function for dimension alignment, xs cls is the cls token of small patch.
The module then performs Cross-Attention(CA) between the cls token of the small patch and xs. Mathematically, the CA can be expressed as

Where wq, wk, wv ∈ RC×(C/n) are learnable parameters, C and h are the embedding dimension and number of heads. The output zs of a cross-attention module of a given xs with layer normalization and residual shortcut is defined as follows

Likewise, we will do the same thing for large patches. After getting the cls tokens of the large and small patches we will forward them to the MLP Header for classification and then the resulting probabilities will be concatenated.
Results and Discussion
Experiments
In this section, we carry out comprehensive experiments to demonstrate the superiority of our proposed model Cross-ViT with Masksemble block over existing methods. We have experimented with our model with the preprocessed data on the ISIC dataset of ISIC-2016, ISIC-2018, Kaggle dataset and HAM-10000 dataset.
Experimental setup: All the training and testing have been performed on Victus 12th Gen Intel(R) Core (TM) i5-12450H machine having 8 GB DDR4 RAM and 4 GB GPU (NVIDIA GeForce GTX 1650). A batch size of 45 and 50 epochs have been used for training the model over ISIC-2016, ISIC-2018, Kaggle dataset and HAM-10000 dataset. Images of size 224 × 224 have been fed to the model with a learning rate of 0.001. Cross Entropy loss is used here as a loss function. The source code of the proposed model was implemented using PyTorch 2.0.0+cu118 and Python 3.11.1.
Evaluation matrix and results: After the successful completion of several experiments, we have recorded evaluation matrices on three ISIC expand datasets which is mentioned. We use precision, recall, and F1-Score to measure the classification accuracy of cancer disease i.e., benign and malignant [7]. The classification accuracy of our proposed ViT, Cross-ViT, and Cross ViT with Masksemble block i.e., benign and malignant is measured using precision, recall, and F1- Score with other models i.e., Densenet, Resnet, Mobilenet, and Xceptionnet)) on ISIC-2016, ISIC-2018, and Kaggle datasets that is shown in Tables 3-5. From these tables, we observe that our proposed system (Cross ViT with Masksemble Block) outperforms compared to existing systems as well as ViT and Cross ViT.
| Model | Precision | Recall | F1 score |
| Dense Net (8) | 0.95 | 0.72 | 0.81 |
| Res Net (10) | 0.59 | 0.78 | 0.67 |
| Mobile Net (28) | 0.57 | 0.69 | 0.62 |
| Xception Net (1) | 0.9 | 0.71 | 0.79 |
| ViT | 0.87 | 0.81 | 0.83 |
| Cross-ViT | 0.88 | 0.79 | 0.83 |
| Proposed | 0.86 | 0.84 | 0.85 |
Table 3: Measuring precision, recall, and F1 score on ISIC-2016 dataset.
| Model | Precision | Recall | F1 score |
| Dense Net (8) | 0.84 | 0.82 | 0.82 |
| Res Net (10) | 0.61 | 0.87 | 0.71 |
| Mobile Net (28) | 0.88 | 0.76 | 0.81 |
| Xception Net (1) | 0.82 | 0.83 | 0.82 |
| ViT | 0.84 | 0.85 | 0.84 |
| Cross-ViT | 0.9 | 0.79 | 0.84 |
| Proposed | 0.77 | 0.92 | 0.83 |
Table 4: Measuring precision, recall, and F1 Score on ISIC-2018 dataset.
| Model | Precision | Recall | F1 score |
| Dense Net (8) | 0.83 | 0.89 | 0.85 |
| Res Net (10) | 0.58 | 0.92 | 0.71 |
| Mobile Net (28) | 0.83 | 0.8 | 0.81 |
| Xception Net (1) | 0.74 | 0.93 | 0.82 |
| ViT | 0.88 | 0.83 | 0.85 |
| Cross-ViT | 0.89 | 0.83 | 0.85 |
| Proposed | 0.83 | 0.9 | 0.86 |
Table 5: Measuring precision, recall, and F1 score on Kaggle dataset.
HAM-10000: Using the HAM-10000 dataset, which is displayed in Tables 6-9 we evaluate the classification accuracy of our proposed ViT, Cross-ViT, and Cross ViT with Masksemble block using precision, recall, and F1-score with other models, i.e., Densenet, Resnet, Mobilenet, and Xceptionnet. These Tables show that our suggested system, Cross ViT with Masksemble block, performs better than both ViT and Cross ViT, as well as other current systems.
| Models | Mobilenet | xceptionnet | Densenet | ViT | Cross-ViT | Proposed |
| Avg. Acc. | 69.41 | 70 | 76.49 | 79.03 | 75.95 | 80.26 |
| AC | 72.23 | 78.91 | 82.78 | 84.57 | 83.52 | 82.17 |
| BC | 68.43 | 64.14 | 82.41 | 78.57 | 76.71 | 78.48 |
| DF | 67.1 | 85.84 | 82.22 | 88.37 | 83.12 | 83.1 |
| MN | 58.57 | 58.37 | 61.53 | 58.76 | 63.86 | 71.9 |
| NE | 99.1 | 73.2 | 91.38 | 89.27 | 87.87 | 99.17 |
| PK | 46.84 | 51.09 | 46.03 | 64.17 | 48.12 | 59.6 |
| SC | 55.7 | 57.03 | 68.87 | 69.89 | 66.27 | 69.1 |
| VL | 87.38 | 91.8 | 96.7 | 98.68 | 98.12 | 98.4 |
Table 6: Average accuracy with class wise on HAM-10000 Dataset
| Class | Precision | Recall | F1-score |
| AC | 0.82 | 0.89 | 0.85 |
| BC | 0.78 | 0.68 | 0.73 |
| DF | 0.83 | 0.92 | 0.87 |
| MN | 0.72 | 0.58 | 0.64 |
| NE | 0.99 | 0.91 | 0.95 |
| PK | 0.6 | 0.62 | 0.61 |
| SC | 0.69 | 0.83 | 0.76 |
| VL | 0.98 | 0.96 | 0.97 |
| Note: Actinic keratosis (AC), Basal Cellcarcinoma (BC), Dermatofibroma (DF), Melanoma (MN), Nevus (NE), Pigmented Benign Keratosi (PK), Squamous cell carcinoma (SC), and Vascular lesion (VL). | |||
Table 7: Cross-VIT: Classification report of ham-10000 dataset.
| Class | Precision | Recall | F1-score |
| AC | 0.84 | 0.72 | 0.77 |
| BC | 0.77 | 0.56 | 0.65 |
| DF | 0.83 | 0.85 | 0.84 |
| MN | 0.64 | 0.63 | 0.63 |
| NE | 0.88 | 0.89 | 0.88 |
| PK | 0.48 | 0.67 | 0.56 |
| SC | 0.66 | 0.7 | 0.68 |
| VL | 0.98 | 0.92 | 0.95 |
Table 8: Proposed: Classification report of ham-10000 dataset.
| Class | Precision | Recall | F1-score |
| AC | 0.85 | 0.85 | 0.85 |
| BC | 0.79 | 0.63 | 0.7 |
| DF | 0.88 | 0.85 | 0.87 |
| MN | 0.59 | 0.76 | 0.66 |
| NE | 0.89 | 0.9 | 0.89 |
| PK | 0.64 | 0.54 | 0.59 |
| SC | 0.7 | 0.8 | 0.74 |
| VL | 0.99 | 0.94 | 0.96 |
Table 9: VIT: Classification report of ham-10000 dataset.
Confusion metrics: To measure the classification accuracy of our proposed system, we utilize confusion matrices to demonstrate the effectiveness of the classification for benign and malignant classes, as shown in Table 10. Through the analysis of confusion metrics, it is evident that the Cross-ViT with Masksemble block (Proposed) outperforms previously mentioned models (ResNet, DenseNet, MobileNet, XceptionNet, and ViT) in classification tasks.
| Method | ISIC-2016 | ISIC-2018 | Kaggle | ||||
| b | m | b | m | b | m | ||
| b | VIT | 261 | 39 | 126 | 24 | 319 | 41 |
| m | 58 | 242 | 22 | 128 | 64 | 236 | |
| b | Cross-VIT | 265 | 35 | 136 | 14 | 322 | 38 |
| m | 67 | 233 | 35 | 115 | 65 | 235 | |
| b | Proposed | 260 | 40 | 116 | 34 | 299 | 61 |
| m | 48 | 252 | 9 | 141 | 33 | 267 | |
| Note: 1st row represents for VIT, 2nd and 3rd rows present Cross-ViT and the last two row represent Proposed model classification results where ‘b’ and ‘m’ represent benign and malignant, respectively. | |||||||
Table 10: The confusion metrics presents the classification of benign and malignant using ViT, Cross-ViT, and Proposed model on ISIC-2016, ISIC-2018, and Kaggle dataset.
For the ISIC-2016 dataset, out of 300 benign and malignant images, 252 are correctly classified (true positive and true negative), while 48 images are misclassified (false positive and false negative). In the 2018 dataset, the true positive-false positive values are 116-34, and the true negative-false negative values are 141-9. Lastly, in the Kaggle dataset, we observed a true-positive false-positive ratio of 299-61 and a truenegative false-negative ratio of 267-33.
For the HAM-10000 dataset, the confusion matrices are shown in Tables 11-16 of different classes for different model w.r.t. our proposed model.
| AC | BC | DF | MN | NE | PK | SC | VL | |
| AC | 355 | 6 | 6 | 3 | 0 | 11 | 19 | 0 |
| BC | 17 | 270 | 19 | 15 | 0 | 34 | 43 | 2 |
| DF | 0 | 7 | 369 | 2 | 0 | 15 | 6 | 1 |
| MN | 6 | 13 | 13 | 233 | 3 | 89 | 42 | 1 |
| NE | 2 | 5 | 5 | 10 | 362 | 11 | 4 | 1 |
| PK | 28 | 17 | 20 | 52 | 0 | 248 | 34 | 1 |
| SC | 23 | 22 | 5 | 9 | 0 | 7 | 334 | 0 |
| VL | 1 | 4 | 7 | 0 | 0 | 1 | 1 | 386 |
Table 11: Proposed: Confusion matrix of HAM-10000 dataset. There are total 8 classes.
|
|
AC |
BC |
DF |
MN |
NE |
PK |
SC |
VL |
|
AC |
289 |
8 |
15 |
7 |
1 |
41 |
39 |
0 |
|
BC |
14 |
224 |
19 |
27 |
6 |
46 |
59 |
5 |
|
DF |
2 |
7 |
340 |
16 |
6 |
15 |
13 |
1 |
|
MN |
4 |
9 |
7 |
251 |
14 |
103 |
12 |
0 |
|
NE |
3 |
2 |
1 |
10 |
355 |
29 |
0 |
0 |
|
PK |
16 |
16 |
12 |
53 |
14 |
269 |
19 |
1 |
|
SC |
15 |
18 |
12 |
19 |
1 |
54 |
281 |
0 |
|
VL |
3 |
8 |
3 |
10 |
7 |
2 |
1 |
366 |
Table 12: Cross-ViT: Confusion matrix of HAM-10000 dataset.
| AC | BC | DF | MN | NE | PK | SC | VL | |
| AC | 340 | 2 | 3 | 19 | 0 | 12 | 24 | 0 |
| BC | 20 | 253 | 17 | 31 | 7 | 20 | 52 | 0 |
| DF | 5 | 3 | 342 | 12 | 2 | 20 | 16 | 0 |
| MN | 4 | 12 | 5 | 305 | 14 | 42 | 15 | 3 |
| NE | 1 | 6 | 0 | 22 | 358 | 11 | 2 | 0 |
| PK | 15 | 19 | 9 | 95 | 17 | 216 | 28 | 2 |
| SC | 15 | 17 | 8 | 29 | 1 | 12 | 318 | 0 |
| VL | 2 | 10 | 3 | 6 | 2 | 3 | 0 | 374 |
Table 13: ViT: Confusion matrix of HAM-10000 dataset.
| AC | BC | DF | MN | NE | PK | SC | VL | |
| AC | 232 | 22 | 12 | 6 | 2 | 62 | 64 | 0 |
| BC | 5 | 288 | 9 | 31 | 15 | 17 | 28 | 7 |
| DF | 1 | 23 | 285 | 12 | 24 | 24 | 25 | 6 |
| MN | 5 | 15 | 11 | 251 | 40 | 43 | 21 | 14 |
| NE | 1 | 4 | 1 | 9 | 377 | 6 | 1 | 1 |
| PK | 21 | 29 | 7 | 79 | 38 | 186 | 35 | 5 |
| SC | 29 | 53 | 6 | 41 | 11 | 25 | 235 | 0 |
| VL | 0 | 15 | 1 | 1 | 8 | 1 | 3 | 371 |
Table 14: Xceptionnet: Confusion matrix of HAM-10000 dataset.
|
|
AC |
BC |
DF |
MN |
NE |
PK |
SC |
VL |
|
AC |
307 |
11 |
14 |
11 |
0 |
29 |
27 |
1 |
|
BC |
20 |
206 |
36 |
23 |
0 |
26 |
52 |
37 |
|
DF |
3 |
10 |
359 |
4 |
0 |
12 |
11 |
1 |
|
MN |
9 |
14 |
45 |
222 |
0 |
73 |
29 |
8 |
|
NE |
0 |
0 |
0 |
0 |
400 |
0 |
0 |
0 |
|
PK |
21 |
31 |
41 |
90 |
0 |
171 |
40 |
6 |
|
SC |
65 |
25 |
38 |
28 |
0 |
49 |
192 |
3 |
|
VL |
0 |
4 |
2 |
1 |
0 |
5 |
0 |
388 |
Table 15: Mobilenet: Confusion matrix of HAM-10000 dataset.
|
|
AC |
BC |
DF |
MN |
NE |
PK |
SC |
VL |
|
AC |
351 |
5 |
6 |
4 |
0 |
17 |
17 |
0 |
|
BC |
11 |
253 |
17 |
32 |
2 |
40 |
36 |
9 |
|
DF |
0 |
6 |
370 |
6 |
2 |
14 |
2 |
0 |
|
MN |
8 |
4 |
13 |
192 |
9 |
157 |
16 |
1 |
|
NE |
4 |
3 |
3 |
15 |
329 |
43 |
2 |
1 |
|
PK |
14 |
7 |
16 |
37 |
10 |
279 |
35 |
2 |
|
SC |
35 |
26 |
23 |
23 |
2 |
52 |
239 |
0 |
|
VL |
1 |
3 |
2 |
3 |
6 |
4 |
0 |
381 |
Table 16: Densenet: Confusion matrix of HAM-10000 dataset.
Comparison with other approaches: We measure our proposed VIT, Cross-VIT, and Cross-VIT with Masksemble Block (Proposed) to perform the comparison task with the existing methods viz. Dense Net, Res Net, Mobile Net, and Xception Net on three.
ISIC datasets: The proposed model shows the overall accuracy of classification with class-wise accuracy (benign) and class-wise accuracy (malignant). Our proposed system achieves 84.6% overall accuracy on the ISIC-2016 dataset with class-wise accuracy for benign 85.3% and malignant 84% which is indicated in Table 17. From this observation, we claim that our proposed method outperforms the other methods. It also indicates that the proposed Cross-VIT with Masksemble Block yields overwhelming classification accuracy compared to VIT and Cross-VIT. In the same way, the Tables 18,19 show the average accuracy of classification on ISIC-2018 and Kaggle dataset with class-wise accuracy classification of benign and malignant classes. It is also shown that the proposed Cross VIT with Masksemble Block yields overwhelming classification accuracy compared to VIT and Cross VIT.
| Model | Avg. Acc. | Benign Acc. | Malignant Acc. |
| Dense Net (8) | 83.05 | 72.77 | 93.23 |
| Res Net (10) | 72.93 | 78.66 | 67.2 |
| Mobile Net (28) | 58.67 | 58.9 | 58.45 |
| Xception Net (1) | 79.18 | 71.2 | 87.15 |
| VIT | 83.96 | 81.81 | 86.12 |
| Cross-VIT | 83.37 | 79.81 | 86.94 |
| Proposed | 84.6 | 85.3 | 84 |
Table 17: Average accuracy with class wise benign and malignant on ISIC-2016 dataset.
| Model | Avg. Acc. | Benign Acc. | Malignant Acc. |
| Dense Net | 85.001 | 85.23 | 84.76 |
| Res Net | 78.93 | 87.61 | 70.25 |
| Mobile Net | 81.34 | 85.23 | 84.76 |
| Xception Net | 83.68 | 84.35 | 83.006 |
| VIT | 86.03 | 85.13 | 84.21 |
| Cross-VIT | 84.339 | 79.53 | 89.14 |
| Proposed | 86.68 | 92.8 | 80.5 |
Table 18: Average accuracy with class wise benign and malignant on ISIC-2018 dataset.
|
Model |
Avg. Acc. |
Benign Acc. |
Malignant Acc. |
|
Dense Net |
85.36 |
89.05 |
81.67 |
|
Res Net |
78.87 |
92.17 |
65.58 |
|
Mobile Net |
79.88 |
80.81 |
78.96 |
|
Xception Net |
84.04 |
93.03 |
75.06 |
|
VIT |
84.24 |
83.28 |
85.19 |
|
Cross-VIT |
84.64 |
83.2 |
86 |
|
Proposed |
85.73 |
90 |
81.4 |
Table 19: Average accuracy with class wise benign and malignant on kaggle dataset.
We conduct a comparison task between our proposed VIT, Cross- VIT, and Cross-VIT with Masksemble Block (Proposed) and the current approaches, namely DenseNet, ResNet, MobileNet, and XceptionNet, using HAM-10000 dataset. The suggested model displays the overall classification accuracy together with multiclasses (8-class) class-wise accuracy. Table 20 shows that our proposed system achieves 80.26% overall accuracy on the HAM-10000 dataset, with class-wise accuracy of 8 classes. We also use the precision, recall, and f1 score to show the effectiveness of our proposed model with other approaches i.e., VIT and Cross-VIT that are shown.
| Models | Mobilenet | Xceptionnet | Densenet | VIT | Cross-VIT | Proposed |
| Avg. Acc. | 69.41 | 70 | 76.49 | 79.03 | 75.95 | 80.26 |
| AC | 72.23 | 78.91 | 82.78 | 84.57 | 83.52 | 82.17 |
| BC | 68.43 | 64.14 | 82.41 | 78.57 | 76.71 | 78.48 |
| DF | 67.1 | 85.84 | 82.22 | 88.37 | 83.12 | 83.1 |
| MN | 58.57 | 58.37 | 61.53 | 58.76 | 63.86 | 71.9 |
| NE | 99.1 | 73.2 | 91.38 | 89.27 | 87.87 | 99.17 |
| PK | 46.84 | 51.09 | 46.03 | 64.17 | 48.12 | 59.6 |
| SC | 55.7 | 57.03 | 68.87 | 69.89 | 66.27 | 69.1 |
| VL | 87.38 | 91.8 | 96.7 | 98.68 | 98.12 | 98.4 |
Table 20: Average accuracy with class wise on HAM-10000 dataset.
Comparison of output images of our proposed model: We already show the system efficacy of our proposed model by measuring precision, recall, and f1 score in subsection 5.4. Here, we present the output images those are generated by our proposed system. We also show the output images of VIT, Cross-VIT, and Proposed approaches. The Figure 4(b,c) present the result of masked image and output image. These masked and output images are generated from the original image that is depicted in Figure 4(a). The infected regions are highlighted with a reddish-yellow colour for the output image of the VIT method. The Figure 5(b,c) show the results of the Figure 6(b,c) present the resultant masked and output images of infected regions The Figure 6(a) show the original image to generate the masked and output images. The VIT method’s output image highlights the diseased patches using a reddish-yellow colour. Our proposed model effectively detects the infected regions that are highlighted with a reddish-yellow colour compared to the proposed VIT and Cross-VIT approaches. From this observation, our model establishes the outperforms in skin cancer detection viz. malignant and benign.

Figure 4: Generated output images of ViT where the 1st image is represented with (a) Original image, 2nd image is represented with (b) Masked Image and last image is represented with (c) Output image (infected regions are highlighted with reddishyellow colour. These regions are more suspicious for predicting the classes correctly).

Figure 5: Generated output images of Cross-VIT where the 1st image is represented with (a) Original image, 2nd image is represented with (b) Masked image and last image is represented with (c) Output image (infected regions are highlighted with reddish-yellow colour. These regions are more suspicious for predicting the classes correctly).

Figure 6: Generated output images of proposed model where the 1st image is represented with (a) Original image, 2nd image is represented with (b) Masked image and last image is represented with (c) Output image (infected regions are highlighted with reddish-yellow colour. These regions are more suspicious for predicting the classes correctly).
Comparison with other uncertainty approaches: We have compared our Proposed Model with two other well-known uncertainty estimation techniques (deep ensemble and Monte-Carlo Dropout) followed by Cross-VIT as the backbone model. Table 21 shows the comparison of uncertainty estimation with our Proposed Model with other two approaches i.e., deep ensemble and Monte-Carlo Dropout to validate the system performance. It is shown that our proposed model resulted in better accuracy than the other two models i.e., deep ensemble and Monte-Carlo Dropout.
| Uncertainties | Avg. acc. | Benign acc. | Malignant acc. |
| Cross-VIT+ Deep Ensemble | 86.15 | 88.7 | 83.6 |
| Cross-VIT + MC Dropout | 85.6 | 81.2 | 90 |
| Proposed | 86.68 | 92.8 | 80.5 |
Table 21: Comparison of proposed model’s uncertainty estimation with two other well-known uncertainty estimation techniques that are considered as deep ensemble and Monte-Carlo Dropout.
In order to validate the system performance, Table 22 compares the uncertainty estimation using our proposed model with two alternative approaches: Monte-Carlo Dropout and deep ensemble with mentioning the class wise accuracy of eight classes on HAM-10000 dataset. It is demonstrated that the accuracy produced by our suggested model outperformed that of the deep ensemble and Monte-Carlo Dropout models.
| Model with Uncertainities | Cross-VIT + Deep Ensemble | Cross-VIT + MC Dropout | Proposed |
| Avg. Acc. | 79.03 | 76.45 | 80.26 |
| AC | 98.12 | 87.32 | 82.17 |
| BC | 91.4 | 46.5 | 78.48 |
| DF | 68 | 54.8 | 83.1 |
| MN | 62.31 | 66.2 | 71.9 |
| NE | 89.24 | 69.16 | 99.17 |
| PK | 58.89 | 58.9 | 59.6 |
| SC | 82.16 | 98.1 | 69.1 |
| VL | 80.4 | 74 | 98.4 |
Table 22: Comparison of proposed model’s uncertainty estimation with two other well-known uncertainty estimation techniques that are considered as deep ensemble and Monte-Carlo dropout on HAM-10000 dataset.
Performance measuring by ROC: Measurement of performance is a crucial task in the classification scenario. Thus, we consider a ROC Curve when it comes to a classification task. We use the Receiver Operating Characteristics (ROC) curve to verify or visualize the performance of the multi-class classification problem. It is among the most crucial evaluation criteria for assessing the effectiveness of any classification model. A performance indicator for classification issues at different threshold values is the ROC curve. The probability curve known as the ROC shows the level or measurement of separability. It indicates the degree of a model that can discriminate between classes. The ROC curve is plotted with TPR (True Positive Rate) against the FPR (False Positive Rate) where TPR is on the yaxis and FPR is on the x-axis. Figure 7 shows the performance of the ISIC-2016 dataset using VIT, Cross-VIT, and the Proposed Model where our proposed model outperforms compared to VIT and Cross- VIT.

Figure 7: System performance measuring by ROC on ISIC-2018 Dataset using VIT, Cross-VIT, and proposed model.
Figure 8 demonstrates the effectiveness of the ISIC-2018 dataset using VIT, Cross-VIT, and the Proposed Model where our proposed model outperforms compared to VIT and Cross-VIT [8]. In the same way, Figure 8 shows the effectiveness of the Kaggle dataset using VIT, Cross-VIT, and the Proposed Model where our proposed model performs better than VIT and Cross-VIT.

Figure 8: System performance measuring by ROC on ISIC-2016 Dataset using ViT, Cross-ViT, and proposed model.
Based on VIT, Cross-VIT, and the suggested Model, the HAM-10000’s performance is displayed in Figure 9, where our suggested model performs better than VIT and Cross-VIT.

Figure 9: System performance measuring by ROC on HAM-10000 dataset considering melanoma as positive class using VIT, Cross-VIT, and proposed model.
t-SNE plots: The unsupervised, non-linear method known as tdistributed Stochastic Neighbor Embedding (t-SNE) is mostly employed for high-dimensional data visualization and data exploration [9]. Put more simply, t-SNE provides us with an intuitive sense of the way the data is organized in a high-dimensional environment. Figure 10 presents the distribution of benign and malignant classes to understand the distribution of high dimensional data. The distribution of eight classes of HAM-10000 dataset is shown in Figure 11.

Figure 10: t-SNE plots show the distribution of benign and malignant. Figure (a) shows the distribution of the 2016 ISIC dataset, Figure (b) represents the distribution of the ISIC 2018 dataset and finally Figure (c) shows the distribution of the Kaggle dataset.

Figure 11. t-SNE plots show the distribution of 8 classes of HAM-10000 dataset. Fig. (a) shows the distribution of the proposed model, Fig. (b) represents the distribution of the Cross-VIT and finally Fig. (c) shows the distribution of VIT.
Conclusion
In this paper, we proposed a new model namely Cross-VIT with Masksemble block (Proposed) which is the fine combination of masksemble and VIT with Cross Attention block. The masksemble layer quantifies the uncertainty estimation that enriches the classification accuracy of cancer disease. This uncertainty estimation helps the model to predict accurately class labels i.e., it decreases the chances of false classification. After the extensive analysis of experimental results, we show that our models outperform current approaches in obtaining greater supervision in addition to serving as an effective baseline for this novel task. It is our hope that our work will shed light on some of the significant but this contribution will surely help to achieve the success in Skin cancer diagnosis field. Our suggested approach indicates its effectiveness in detecting skin cancer across multiple classes in addition to quantifying superior performance in binary classification.
References
- Girdhar N, Sinha A, Gupta S (2023) . Soft Comput 27: 13285–13304.
[] [] []
- Goodson AG, Grossman D (2009) J Am Acad Dermatol 60: 719–735.
[] [] []
- Hasan M, Barman SD, Islam S, Reza AW (2019) . In: Proceedings of the 2019 5th international conference on computing and artificial intelligence. pp. 254–258.
[] []
- Khan MA, Akram T, Zhang YD, Sharif M (2021) . Pattern Recognit Lett 143: 58–66.
[] []
- Manasa K, Murthy D (2021) Skin cancer detection using VGG-16. Eur J Mol Clin Med 8: 1419–1426.
- Morton C, Mackie R (1998) Br J Dermatol 138: 283–287.
[] [] []
- Oliveira RB, Papa JP, Pereira AS, Tavares JMR (2018) . Neural Comput Appl 29: 613–636.
[]
- Reis HC, Turk V, Khoshelham K, Kaya S (2022) Med Biol Eng Comput 60: 643-662.
[] [] []
- Saba T (2020) J Infect Public Health 13: 1274–1289.
[] [] []
Citation: Guchhait A, Barman A, Roy SK (2025) Masksemble-Aided Cross-ViT for Uncertainty Estimation in Skin Cancer Diagnosis. J Oncol Res Treat 10:310. DOI: 10.4172/aot.1000310
Select your language of interest to view the total content in your interested language
Share This Article
Open Access Journals
Article Tools
Article Usage
- Total views: 105
- [From(publication date): 0-0 - May 22, 2026]
- Breakdown by view type
- HTML page views: 79
- PDF downloads: 26
