GitXplorerGitXplorer
h

webnav

public
9 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
2eafdece03a647c5568208de16f4c4225fc12f49

Update readme

hhans committed 8 years ago
Unverified
92870859b70b4e817b262397cd76051011a9722f

Stage simple README

hhans committed 8 years ago
Unverified
d255465f94f6c5cc74df22e0bbfa5e9eebf66317

Get greedy rollout script working with latest code

hhans committed 8 years ago
Unverified
9ff5f80918511d1ecaa4e0abdfc1b5209efae185

Remove very outdated rllab code

hhans committed 8 years ago
Unverified
2c032faeea9220dafb444679bbb5954eb70258a7

Fix crashes on non-cycle communicative training runs.. derp

hhans committed 8 years ago
Unverified
585091b5d5867d5e82ee168ec0349b8f7b5667f7

Fix debug columns again

hhans committed 8 years ago

README

The README file for this repository.

This is the first instantiation of a paradigm in simulated language acquisition that I have been developing during my time at OpenAI.

The paradigm

Here's the gist, for those interested:

  • A child exists in some physical world. The child has limited observations and a limited action space. (i.e., there are observations or actions necessary to achieve the goal which the child cannot perform).
  • There is also a parent in the same environment which has full observation and a full action space over the environment.
  • The parent speaks some fixed language, and takes actions or provides information only when requested by the child.
  • The child can send messages to the parent and observe the parent's responses.

This paradigm is designed such that the child must learn the language of the parent in order to accomplish the goal. Crucially, language here is a side-effect of the quest to achieve some other non-linguistic goal.

This application

This repository contains a concrete application of the idea above. Here our "child" is spawned on a random page in the Wikipedia graph, and must navigate to some target page only by clicking on page links. Of course, our goal is to make the traversal in the minimum number of hops. This is a pretty difficult task requiring a substantial amount of world knowledge. It becomes easier, though, when you have a "parent" around to help!

There are a lot of possible ways to define a "parent" in this context. At the simplest level, a parent might point out the best link to follow (according to e.g. an A* search or a heuristic). At the most complex level, a parent might simply Google the child's questions or send them to a knowledge base, and then forward the responses on to the child.

Example trajectory

Here's a trajectory from a Q-learning model in this environment. The child always has the choice to 1) take actions in its environment or 2) communicate with the parent. Here we can see the learned policy alternates between the two.

Notes:

  • The child can utter a single token at each timestep (shown as "string" actions below), visit a new page, or SEND the tokens uttered so far to the mother. It only receives a response after executing SEND.
  • There is a fixed rollout length; after reaching the target page, this child learns to make short "cycles" around the page.
  • The numbers in parentheses indicate the index of the action in the ordered list of possible actions. The mother responds with a number string, which the child learns to correctly map onto its own action space (as demonstrated by these numbers in parentheses).
Trajectory: (target Frédéric_Chopin)

        Action                                          Reward
        -------------------------------------------------------
        Boeing_747 (0)                                  0.00000
        "which"                                        -0.25000
        SEND                                            0.72850
                --> Response: "6"
        Piano (6)                                       0.77280
        Frédéric_Chopin (16)                            8.28645
        Warsaw (6)                                      0.17489
        "which"                                        -0.25000
        SEND                                            0.72850
                --> Response: "15"
        Jazz (15)                                       0.07623
        "which"                                        -0.25000
        SEND                                            0.72850
                --> Response: "12"
        Piano (12)                                      0.77280
        Frédéric_Chopin (6)                             8.28645
        Niccolò_Paganini (25)                           0.84642
        Frédéric_Chopin (3)                             8.28645