Command Palette
Search for a command to run...
Yusuke Narita; Shota Yasui; Kohei Yata

Abstract
What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| causal-inference-on-idhp | - | Average Treatment Effect Error: -0.225 |
| visual-object-tracking-on-vot2014 | - | Expected Average Overlap (EAO): 1.047 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.