Documentation (v. 0.1)

1. Introduction

The goal of this project is to provide to the Pure Data and Max/MSP communities a set of open-source tools for reinforcement learning in Flext. Reinforcement learning (RL) is an approach for building agent behaviors in complex environments through learning. Wikipedia provides the following definition of the field:

Reinforcment learning has been successfully applied in a number of different fields of research, especially in robotics (Schmidhuber 05; Kim, Hwang & Wqon 03; Dorigo & Colombetti 98; Barto, Bradtke & Singh 95).

The purpose of this document is not to give a course on reinforcement learning. We expect the user to have a basic notion of it. A good reference is Sutton's book (Reinforcement Learning: An Introduction), available online here:

1.1. Terms of use

Permission is granted to copy, distribute and/or modify this document under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Licence, a version of which is available at the following URL:

The software package is distributed under the open-source Gnu GPL. See for details.

1.2. Credits

Documentation written by Jean-Sébastien Senécal, 2006. This project was developed at A-Lab as part of an original project from Eric Raymond and Bill Vorn, with funding from Hexagram.

2. Installation

2.1. Installing from binary files (recommended)

If you downloaded the binary files, you just need to copy them to you Pd or Max/MSP externs directory.

In order to get Pd or Max/MSP running with them, you just need to copy the right files in your externals directory. This directory may vary but in general it will be something like:

Don't forget to copy the help files along with the externs.

2.2. Installing from sources

NOTICE: You'll need the Flext libraries installed on your system beforehand in order to compile the externals. You can have Flext with installation instructions here:

If you want to compile from the sources on a Unix-based system for Pd, run the following commands from the prompt:

make install

For Max/MSP (on Mac OS X), you need to use the Metrowerks Code'Warrior compiler, available here: Use the .mcp files in the sources directories to compile them (you'll need to set preferences for your own system).

There is currently no support for compiling on Win32-based system, but if someone manages to do so, please let us know.

3. Getting started

The library currently supports only one algorithm known as Sarsa-lambda (see for a description of this algorithm), with discrete states. We chose this algorithm both for its simplicity and because it fits most needs for realtime new media applications. There are two possible policies:

  1. epsilon-greedy (rl_sarsa_egreedy)
  2. softmax (rl_sarsa_softmax)

To get started, try out the example patch rl_sarsa_egreedy-example.* in the examples/ folder. It simulates a system with two mobile robots - A and B - evolving along a one-dimensional line. The robots have a shared "brain" and are free to move along their line. They get reinforced when they stick together (i.e. when they are face-to-face). The rules are the following:

You can launch learning in three single steps:

  1. Choose the initial state by selecting robot A and B positions
  2. Press the "reset" button in order to initialize the agent
  3. Activate learning with the "start/stop learning" toggle
  4. Start the agent with the "start/stop" toggle

The agent will start to run quickly from one state to the other. First, its actions will be random, but over time, you will see it stabilize its actions in a way that maximizes its reinforcment. You can see that by looking at the "mean reward" value: it will raise over time, as the agent reaches equilibrium. There are many optimal behaviors, so final result may vary.

4. Objects reference

4.1 rl_sarsa_egreedy

Implements the Sarsa-lambda algorithm with an epsilon-greedy policy. An epsilon-greedy policy is a strategy where the agent always chooses the action that it expects to yield maximum reinforcement. However, there is a small chance that it chooses randomly among all possible actions. This allows it to explore new possibilities.

The chance to choose randomly (i.e. to explore instead of exploiting) is controlled by an "epsilon" parameter between 0 and 1. This value is the probability that the agent chooses a random action. E.g. if epsilon = 0.1, the agent will choose a random action 1 time out of 10.


  1. epsilon: probability to choose a random action (0 <= epsilon <= 1)
  2. lambda: governs the exponential decay of a reward on previous contexts (i.e. acts as a "memory" of past choices) (0 <= lambda <= 1)
    • the higher it is, the less the agent remembers but the quickest it learns reactive tasks
    • the lower it is, the more the agent remembers but it takes more time to learn
  3. gamma: importance given to chosen action over what was learned before (not much important, keep around 0.1)
  4. learning_rate: also often referenced to as "alpha", controls learning (should be >= 0)
    • if high (i.e. close to 1) the agent will adapt more quickly but might reach a suboptimal solution
    • if low (i.e. close to 0) the agent will learn more slowly but might reach a more optimal solution
  5. n_actions: number of different possible actions
  6. n_states_1: number of different states for state dimension 1
  7. ...
  8. n_states_n: number of different states for state dimension n


  1. Anything
    • "reset" : resets the model
    • "startlearn": allows the agent to learn
    • "stoplearn": stops the agent from learning (i.e. can still act, but will not learn anything new)
    • "bang": next step, update weights from current state/reward and puts puts next action in outlet
  2. List int int: "set state state_index state_value" used to set current state (after action was taken)
  3. Float: "set reward" used to set current reward (after action was taken)


  1. Integer: "action" a number from 0 .. (n_actions-1) representing the action that was taken

4.2 rl_sarsa_softmax

This object has pretty much the same behavior as the rl_discrete_egreedy_sarsa object, except that the policy is different. Instead of choosing greedily, the agent always chooses randomly among all available actions in a weighted fashion. That is, actions which are expected to yield more reward have a higher probability of being chosen than actions for which the agent expects a lower reward.

The only attribute that is different is the epsilon attribute which, in this case, becomes a temperature parameter (usually set to 1). To get information on the softmax policy and how temperature affects it, see