weblearn-dqn
v1.1.1
Published
<h1 align="center"> <br> <a href="https://github.com/keppel/weblearn-dqn"><img src="https://cloud.githubusercontent.com/assets/1269291/21950583/6d22659c-d9b1-11e6-8fb4-2d61b196b688.gif" alt="WebLearn DQN" width="400"></a> <br> WebLearn DQN <br>
Downloads
4
Readme
Reinforcement learning agent that uses a WebLearn model to approximate the Q-function for your environment.
Q-learning is an off-policy algorithm, which means it can learn about the environment using trajectories where the actions weren't sampled from the agent (i.e. human demonstrator). I'll probably add a demo of this soon.
Q-learning is also a model-free algorithm, which means it's not doing any planning or tree search. It's basically just estimating the discounted future rewards it expects to see if takes an action a in state s and follows the optimal policy from there.
This implementation uses experience replay and temporal difference error clamping, but currently does not do fitted Q iteration ("target" network) or double DQN.
There's a demo using OpenAI's gym in examples/
Usage
npm install weblearn weblearn-dqn
const ndarray = require('ndarray')
const DQN = require('weblearn-dqn')
const { ReLU, Linear, MSE, SGD, Sequential } = require('weblearn')
let model = Sequential({
optimizer: SGD(.01),
loss: MSE()
})
const STATE_SIZE = 2
const NUM_ACTIONS = 3
// model input should match state size
// and have one output for each action
model.add(Linear(STATE_SIZE, 20))
.add(ReLU())
.add(Linear(20, NUM_ACTIONS))
let agent = DQN({
model: model, // weblearn model. required.
numActions: NUM_ACTIONS, // number of actions. required.
})
// get these from your environment:
let observation = ndarray([.2, .74])
let reward = .3
let done = false
let action = agent.step(observation, reward, done)
// `action` is an integer in the range of [0, NUM_ACTIONS)
// call this whenever ya wanna do a learn step.
// you can call this after each `agent.step()`, but you can also call it more or less often.
// just keep in mind, depending on the size of your model, this may block for a relatively long time.
let loss = agent.learn()
let agent = DQN(opts)
opts
should be an object with some of the following properties:
model
: WebLearn model. required.numActions
: number. number of actions. required.epsilon
: number. initial probability of selecting action at random (for exploration). optional.memorySize
: number. how many of our most experiences to remember for learning. optional.maxError
: number orfalse
. limit the absolute value of the td-error from a single experience. false for no limit. optional.finalEpsilon
: number. probability of selecting an action at random afterepsilonDecaySteps
steps of training. optional.epsilonDecaySteps
: number. on what timestep should we reachepsilon === finalEpsilon
? optional.learnBatchSize
: number. how many transitions should we learn from when we callagent.learn()
? optional.gamma
: number. parameter used for discounting rewards far in the future vs. rewards sooner. optional.
let action = agent.step(observation, reward, done)
returns a number action
(integer specifying index of action to take).
observation
: ndarray. some representation of the state of your environment. required.reward
: number. this is what the agent will try to maximize. required.done
: boolean. is this state the last state of an episode? optional.
let loss = agent.learn()
makes the agent do some learning. this can take a long time. returns the loss from the learn step. the loss from a single learn step will be pretty noisy since experiences are sampled from memory at random, but if you average over multiple .learn()
s, that might be useful.
🤖