Powering up Malware with Reward-Shaped RL Agents

Type

Status

Published

Supervisors

Email

MA

Open

31 March 2026

Francisco Enguix

Weijie Niu

Alberto Huertas

enguix@ifi.uzh.ch

niu@ifi.uzh.ch

alberto.huertas@um.es

The effectiveness of reinforcement learning depends strongly on how the reward signal is designed. In complex cyber environments, poorly shaped rewards can slow learning, encourage undesirable behaviour, or make it difficult to capture longer-term objectives.

This thesis investigates how meta-level reasoning can be used to support reward shaping in a Cybersecurity Offensive AI system. The research focuses on structured and bounded reward adjustments that improve learning behaviour without turning reward design into an opaque or uncontrolled process. The thesis is part of a broader Cybersecurity Offensive AI research line at the intersection of intelligent agents, multi-agent systems, reinforcement learning, large language models, and controlled cyber experimentation. Strong results may contribute to a scientific publication.

Sources:

[1] N. Gao, X. Zhang, X. Jiang, M. You, M. Zhang, and Y. Deng, ‘RF-Agent: Automated Reward Function Design via Language Agent Tree Search’, in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
[2] H. van Hasselt, A. Guez, and D. Silver, ‘Deep reinforcement learning with double Q-Learning’, in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 2094–2100.
[3] J. Palanca, A. Terrasa, V. Julian, and C. Carrascosa, ‘SPADE 3: Supporting the New Generation of Multi-Agent Systems’, IEEE Access, vol. 8, pp. 182537–182549, 2020.

Prerequisites

Good Python programming skills.
Prior coursework or experience in machine learning
Basic understanding of reinforcement learning (RL)
Interest in reward function design
Comfort with experimentation and result interpretation

Quicklinks

Main navigation

Powering up Malware with Reward-Shaped RL Agents

Prerequisites