Optimization of CNN and Vision Transformer Models in Addressing Long-Tailed Data Imbalance for Satellite Cloud Image Classification
Main Article Content
Abstract
This study investigates long-tailed satellite cloud image classification by comparing CNN and Vision Transformers (ViT) built upon vision–language foundation models. A large-scale satellite cloud dataset with 11 highly imbalanced classes, including a dominant non-phenomenon category, is used to represent realistic atmospheric variability. The data are split using stratified sampling, standardized to a fixed resolution, and used to fine-tune CLIP-based backbones from RemoteCLIP and GeoRSCLIP through parameter-efficient adaptation. Several loss functions Cross Entropy, Logit Adjustment, Focal, Class-Balanced, and label-distribution–aware variants are evaluated, along with experiments examining majority-class removal and adapter bottleneck adjustments. Initial results show that Logit Adjustment causes majority-class collapse under default settings. After optimization, ViT-based models consistently outperform CNN models, achieving higher accuracy and more balanced macro-level performance. Class-Balanced loss emerges as the most effective objective, offering a strong trade-off between overall accuracy and per-class fairness. Increasing the adapter bottleneck dimension further boosts ViT performance, enabling the best configuration to match or exceed prior benchmarks while improving minority-class recognition. The final optimized model is deployed in a web-based prediction system, demonstrating the practical potential of foundation-model approaches for satellite-driven weather analysis.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The Authors submitting a manuscript do so on the understanding that if accepted for publication, copyright of the article shall be assigned to journal Tech-E, Universitas Buddhi Dharma as publisher of the journal.
Copyright encompasses exclusive rights to reproduce and deliver the article in all form and media, including reprints, photographs, microfilms and any other similar reproductions, as well as translations. The reproduction of any part of this journal, its storage in databases and its transmission by any form or media, such as electronic, electrostatic and mechanical copies, photocopies, recordings, magnetic media, etc. , will be allowed only with a written permission from journal Tech-E.
journal Tech-E, the Editors and the Advisory Editorial Board make every effort to ensure that no wrong or misleading data, opinions or statements be published in the journal. In any way, the contents of the articles and advertisements published in the journal Tech-E, Universitas Buddhi Dharma are sole and exclusive responsibility of their respective authors and advertisers.
References
T. B. Turrisi et al., “Seasons, weather, and device-measured movement behaviors: a scoping review from 2006 to 2020,” Int. J. Behav. Nutr. Phys. Act., vol. 18, no. 1, p. 24, Feb. 2021, doi: 10.1186/s12966-021-01091-1.
B. Clarke, F. Otto, R. Stuart-Smith, and L. Harrington, “Extreme weather impacts of climate change: an attribution perspective,” Environ. Res. Clim., vol. 1, no. 1, p. 012001, Sep. 2022, doi: 10.1088/2752-5295/ac6e7d.
S. Fawzy, A. I. Osman, J. Doran, and D. W. Rooney, “Strategies for mitigation of climate change: a review,” Environ. Chem. Lett., vol. 18, no. 6, pp. 2069–2094, Nov. 2020, doi: 10.1007/s10311-020-01059-w.
C. Bai, M. Zhang, J. Zhang, J. Zheng, and S. Chen, “LSCIDMR: Large-Scale Satellite Cloud Image Database for Meteorological Research,” IEEE Trans. Cybern., vol. 52, no. 11, pp. 12538–12550, Nov. 2022, doi: 10.1109/TCYB.2021.3080121.
S. Shang, J. Zhang, X. Wang, X. Wang, Y. Li, and Y. Li, “Faster and Lighter Meteorological Satellite Image Classification by a Lightweight Channel-Dilation-Concatenation Net,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 16, pp. 2301–2317, 2023, doi: 10.1109/JSTARS.2023.3243915.
E. Chuvieco, Fundamentals of Satellite Remote Sensing. CRC Press, 2020. doi: 10.1201/9780429506482.
F. A. Diaz-Gonzalez, J. Vuelvas, C. A. Correa, V. E. Vallejo, and D. Patino, “Machine learning and remote sensing techniques applied to estimate soil indicators – Review,” Ecol. Indic., vol. 135, p. 108517, Feb. 2022, doi: 10.1016/j.ecolind.2021.108517.
W. Han et al., “A survey of machine learning and deep learning in remote sensing of geological environment: Challenges, advances, and opportunities,” ISPRS J. Photogramm. Remote Sens., vol. 202, pp. 87–113, Aug. 2023, doi: 10.1016/j.isprsjprs.2023.05.032.
F. Li, T. Yigitcanlar, M. Nepal, K. Nguyen, and F. Dur, “Machine learning and remote sensing integration for leveraging urban sustainability: A review and framework,” Sustain. Cities Soc., vol. 96, p. 104653, Sep. 2023, doi: 10.1016/j.scs.2023.104653.
M. Marjani, M. Mahdianpari, F. Mohammadimanesh, and E. W. Gill, “CVTNet: A Fusion of Convolutional Neural Networks and Vision Transformer for Wetland Mapping Using Sentinel-1 and Sentinel-2 Satellite Data,” Remote Sens., vol. 16, no. 13, p. 2427, Jul. 2024, doi: 10.3390/rs16132427.
M. Segal-Rozenhaimer, A. Li, K. Das, and V. Chirayath, “Cloud detection algorithm for multi-modal satellite imagery using convolutional neural-networks (CNN),” Remote Sens. Environ., vol. 237, p. 111446, Feb. 2020, doi: 10.1016/j.rse.2019.111446.
A. Galdran, G. Carneiro, and M. A. G. Ballester, “Convolutional Nets Versus Vision Transformers for Diabetic Foot Ulcer Classification,” Nov. 2021, doi: 10.48550/arXiv.2111.06894.
Z. Zhang, M. Lu, S. Ji, H. Yu, and C. Nie, “Rich CNN Features for Water-Body Segmentation from Very High Resolution Aerial and Satellite Imagery,” Remote Sens., vol. 13, no. 10, p. 1912, May 2021, doi: 10.3390/rs13101912.
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Jun. 2021, [Online]. Available: http://arxiv.org/abs/2010.11929
M. Kaselimi, A. Voulodimos, I. Daskalopoulos, N. Doulamis, and A. Doulamis, “A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring,” IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 7, pp. 3299–3307, Jul. 2023, doi: 10.1109/TNNLS.2022.3144791.
R. Rad, “Vision Transformer for Multispectral Satellite Imagery: Advancing Landcover Classification,” in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE, Jan. 2024, pp. 8161–8168. doi: 10.1109/WACV57701.2024.00799.
R. Yousaf et al., “Satellite Imagery-Based Cloud Classification Using Deep Learning,” Remote Sens., vol. 15, no. 23, p. 5597, Dec. 2023, doi: 10.3390/rs15235597.
J.-X. Shi, T. Wei, Z. Zhou, J.-J. Shao, X.-Y. Han, and Y.-F. Li, “Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts,” Jun. 2024, [Online]. Available: http://arxiv.org/abs/2309.10019
F. Liu et al., “RemoteCLIP: A Vision Language Foundation Model for Remote Sensing,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–16, 2024, doi: 10.1109/TGRS.2024.3390838.
R. Reedha, E. Dericquebourg, R. Canals, and A. Hafiane, “Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images,” Remote Sens., vol. 14, no. 3, pp. 1–20, 2022, doi: 10.3390/rs14030592.
Z. Zhang, T. Zhao, Y. Guo, and J. Yin, “RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–23, 2024, doi: 10.1109/TGRS.2024.3449154.
S. Kim, L. Chen, and J. Kim, “Intrusion Prediction using LSTM and GRU with UNSW-NB15,” in 2021 Computing, Communications and IoT Applications (ComComAp), 2021, pp. 101–106. doi: 10.1109/ComComAp53641.2021.9652926.
M. Vlaminck, R. Heidbuchel, W. Philips, and H. Luong, “Region-Based CNN for Anomaly Detection in PV Power Plants Using Aerial Imagery,” Sensors, vol. 22, no. 3, pp. 1–18, 2022, doi: 10.3390/s22031244.
Z. Wang, J. Zhao, R. Zhang, Z. Li, Q. Lin, and X. Wang, “Uatnet: U-shape attention-based transformer net for meteorological satellite cloud recognition,” Remote Sens., vol. 14, no. 1, 2022, doi: 10.3390/rs14010104.
Y. Tong, W. Lu, Y. Yu, and Y. Shen, “Application of machine learning in ophthalmic imaging modalities,” Eye Vis., vol. 7, no. 1, pp. 1–15, 2020, doi: 10.1186/s40662-020-00183-6.
V. R. Joseph, “Optimal ratio for data splitting,” Stat. Anal. Data Min. ASA Data Sci. J., vol. 15, no. 4, pp. 531–538, Aug. 2022, doi: 10.1002/sam.11583.
A. E. Berndt, “Sampling Methods,” J. Hum. Lact., vol. 36, no. 2, pp. 224–226, May 2020, doi: 10.1177/0890334420906850.
A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” Feb. 2021, [Online]. Available: http://arxiv.org/abs/2103.00020
A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar, “Long-tail learning via logit adjustment,” Jul. 2021, [Online]. Available: http://arxiv.org/abs/2007.07314
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” Feb. 2018, [Online]. Available: http://arxiv.org/abs/1708.02002
Abstract views: 161
/
PDF downloads: 96