Teodor Vanislavov Marinov

Teodor Vanislavov Marinov

My main research interests are in the field of Theoretical Machine Learning. Recently my research has focused on Reinforcement Learning with applications to compiler optimization and how to make Large Language Models (LLMs) more factual. On the more theoretical side I am interested in Bandit Problems, more efficient algorithms for Reinforcement Learning beyond worst case settings and understanding emergent abilities of LLMs.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We conduct a theoretical analysis of techniques for preference-based RL from offline datasets annotated with pairwise preferences, such as DPO. We identify key properties of the learning objective that influence the quality of the learned policy, such as the coverage of the offline dataset, the presence or absence of a normalizing baseline and the choice of loss function. Informed by the theory, we further conduct an empirical analysis of some key variants to corroborate our theoretical findings. View details
    Preview abstract From scheduling, to resource allocation to optimization of complex workflows, systems are replete with decision-making problems which are typically addressed with hand-designed heuristics. Recent literature studies pose these setups as Reinforcement Learning (RL) problems owing to a natural fit, with several successes in simulated benchmark environments. However, bringing the RL approach to any complex system in practice is full of challenges in integrating the system into the act-observe-learn paradigm of RL, which has limited the adoption of these techniques. In this work, we present an alternative approach which uses offline data collected using multiple existing baseline policies to simultaneously improve upon them. By repeating multiple iterations of this improvement process, including any learned policies into the set of baselines, we show how performance can be quickly bootstrapped using our approach. We demonstrate the practicality of our approach through evaluation in optimizing the inlining decisions for the LLVM compiler, and obtain significant improvements even over prior RL-based policies. View details
    Multiple-policy High-confidence Policy Evaluation
    Mohammad Ghavamzadeh
    International Conference on Artificial Intelligence and Statistics (2023), pp. 9470-9487
    Preview abstract In reinforcement learning applications, we often want to accurately estimate the return of several policies of interest. We study this problem, multiple-policy high-confidence policy evaluation, where the goal is to estimate the return of all given target policies up to a desired accuracy with as few samples as possible. The natural approaches to this problem, i.e., evaluating each policy separately or estimating a model of the MDP, scale with the number of policies to evaluate or the size of the MDP, respectively. We present an alternative approach based on reusing samples from on-policy Monte-Carlo estimators and show that it is more sample-efficient in favorable cases. Specifically, we provide guarantees in terms of a notion of overlap of the set of target policies and shed light on when such an approach is indeed beneficial compared to existing methods. View details
    ×