Project I: Time constrained k-armed bandit competition

Intuitive description of the problem

You are in a casino, and you are given T seconds to play k one-armed bandits. Playing one bandit once costs you one dollar, and if you win, you get two dollars back. Each one-armed bandit has its own static but unknown probability to yield a win. Your goal is to spend the T seconds to gain as much money as possible. For a longer description of the problem (but one without the time constraint), see Sutton & Barto's book chapter 2.1.

Task

Your task is to write a program that plays k-armed bandit. The program should run in the department Linux-distribution, and read its input from stdin and write the output to stdout in the following way:

The program is started without command line arguments. The program first reads a line that contains the number of arms (k), then a line containing the game time in seconds (T). It then enters the loop in which it writes a line for arm selection (0,...k-1) and reads a line containing the reward signal, -1 or 1. The signal 0 indicates the end of the game. The pseudo-code for the player is as follows:

k = int(readline())                  // number of arms
T = int(readline())                  // game time - not used in this pseudo-code
reward = 999                         // initialize reward to non-zero
while (reward != 0) do
    print a number between 0 and k-1 // select an arm to twist
    reward = int(readline())         // read the result 1 for win, -1 for loss
endwhile

Deliverables

Your implementation will be tested in three phases. The deadlines for the phases can be found in the course schedule on the course main page. In each phase you are required to submit both the source and the binary code for your program.

Together with your third phase package, you are required to deliver a short (2 pages printed) html-documentation describing your algorithm and the tests you have run on your program.

Material

Below you can find the package (with README file) containing the system that will be used for testing the programs. Use this framework to test your program so that it conforms to the requirements.

A gzipped tar file containing the framework.

Scoring

It is possible to get 15% of the course maximum points from the project 1 (and actually the winner will almost surely do so.) The maximum number of points available for a team is therefore 15. All the programs will be run at each phase. Participating in the phase X gives each participant X points. The remaining max 9 points are determined by the cumulative rewards earned in three phases by the formula

points = (cumulative_reward / max_cumulative_reward) * 9,

where max_cumulative_reward is the maximum reward earned by the winning team.

Each member of the team will get the same amount of points unless explicitly agreed otherwise.

In each phase the programs are tested with several tasks varying the game times (T), numbers of arms (k), and the reward distributions. Note that the maximum expected utility for a task can be negative. The whole testing session of all teams should not take more than 10 minutes. The programs will be run in the classroom computer having x Mhz processor with x megabytes of memory.




Three Concepts: Utility	2007