Publications

Synthetic Data Generation for Machine Learning Models with Cognitive Agent

Abstract

The use of synthetic data for training machine learning mod-els (ML) in social media domains can address issues such as data availability and bias, but poses challenges, including properly reflecting causal relationships and matching the consistency of real data. In this paper, we explore the benefits and limitations of using synthetic data generated by cognitive agent simulations. By simulating human interactions and social media dynamics, these models can capture constraints and nuances of real-world scenarios. We report initial experiments that show that ML algorithms trained on real data augmented with synthetic data outper-form those trained solely on original data, achieving up to 25% improve-ment in KS distance and RMSE metrics. This approach is applied to two domain problems: predicting code quality based on open-source code dis-cussions and detecting and countering bot attacks on social media plat-forms. For code quality prediction, we used discussions and patches from the Linux Kernel Mailing List to predict patch reversions. In the bot attack detection problem, synthetic Reddit data helps create realistic social network environments to study interactions between influencers and bots under different conditions. The paper presents empirical evidence supporting the effectiveness of synthetic data in improving ML model performance and introduces an agent-based framework for generating realistic synthetic data for social media experiments. The findings suggest promising avenues for future research and highlight the potential of this approach.

Date
February 5, 2026
Authors
Jim Blythe
Journal
Advances in Practical Applications of Agents, Multi-agent Systems, and Digital Twins: the PAAMS Collection: 22nd International Conference, PAAMS 2024, Salamanca, Spain, June 26-28, 2024, Proceedings
Volume
15157
Pages
73
Publisher
Springer Nature