SEMI SUPERVISED LEARNING FOR INTELLIGENT THREAT DETECTION IN SPARSE AND LOW LABELED CYBERSECURITY DATASETS
Keywords:
Threat detection, sparse datasets, cybersecurity, pseudo-labeling, anomaly detection.Abstract
The growing sophistication of cyberattacks has exposed the limitations of conventional detection models that rely heavily on large volumes of labeled data. In practice, cybersecurity datasets are often sparse, incompletely annotated, and imbalanced, which reduces the effectiveness of fully supervised approaches. To address this challenge, this research introduces a semi-supervised learning framework for intelligent threat detection in environments with limited labeling. By combining labeled and unlabeled samples, the framework is able to extract latent structures within network traffic, improving classification even under constrained annotation conditions. The design employs a hybrid feature representation that integrates statistical attributes with deep feature embeddings to capture both surface-level and hidden attack patterns. A pseudo-labeling strategy and consistency-regularization mechanism are incorporated to guide learning from unlabeled data while minimizing the propagation of incorrect labels. Benchmark cybersecurity datasets with sparse labeling were used to validate the model, simulating real-world operational environments. The proposed framework demonstrated strong performance across multiple evaluation metrics.
Compared with supervised baselines, the semi-supervised model improved detection accuracy by over 12%, achieved higher recall in identifying minority attack classes, and reduced false alarms by approximately 30%. Training also converged more efficiently, requiring fewer iterations while maintaining stability under imbalanced conditions. Notably, the system exhibited resilience against novel and low-frequency attack variants, outperforming both traditional supervised classifiers and unsupervised anomaly detection techniques. This work establishes semi-supervised learning as an effective pathway for advancing next-generation cybersecurity defenses. By leveraging the wealth of unlabeled data commonly available in practice, the framework provides a scalable, privacy-
conscious, and resilient solution for intelligent threat detection in sparse and low-labeled cybersecurity datasets.