Dream Coder - 0 Probabilistic Foundation

The start:

Consider a set of lambda expressions $D$ , and a weight vector $θ$ , that assigns a weight to each program in $D$ , and a bias term (this assigns the probability of picking a variable, which we will see later). So $θ \in R^{| D | + 1}$ . Given a set of tasks $X$ , we want to find the likelihood/probability of solving the set of tasks $X$ with the pair $D, θ$ . Call this joint likelihood $J (D, θ)$ .

Step 1: Given $D, θ$ , how likely is it to solve all the tasks in $X$ ?
First, pick a task $x \in X$ . For a program $p$ , $P [x ∣ p]$ depicts the probability that $x$ is solved given $p$ . $P [p ∣ D, θ]$ depicts how likely a program $p$ is picked, given $D$ and $θ$ .
Then, for all possible programs $p$ (there are infinite of them) given $D, θ$ , the likelihood of solving a particular task $x$ is

P_{solve} (x ∣ D ∣ θ) = \sum_{p} P [p ∣ D, θ] P [x ∣ p]

Step 2: Solving tasks are independent of each other. Hence:

P_{solve} (X ∣ D ∣ θ) = \prod_{x \in X} P_{solve} (x ∣ D ∣ θ)

Step 3: How likely is $D, θ$ ? : $P [D, θ]$ .

Step 4: Combine: The probability of solving all of $X$ with some $D, θ$ is the likelihood of getting $D, θ$ and then solving all of $X$ given this $D, θ$ .

J (D, θ) = P [D, θ] P_{solve} (X ∣ D ∣ θ)

Hence, finally:

J (D, θ) = P [D, θ] \prod_{x \in X} \sum_{p} P [p ∣ D, θ] P [x ∣ p]

Remember that for any $D, θ$ and a set of tasks $X$ , $J (D, θ)$ gives us the probability of solving all these tasks using $D, θ$ .

The idea is to first get $D^{*} = \underset{D}{argmax} \int J (D, θ) d θ$ . The integral over $θ$ allows us to get the marginal distribution of $J$ with respect to $D$ , which basically means we are summing (or integrating) the contributions of other variables ( $θ$ in this case), to get a distribution that is only affected by $D$ , and $θ$ 's contribution is averaged out. We can now get $D^{*}$ which maximizes this marginal distribution.

Finally, we get $θ^{*} = \underset{θ}{argmax} J (D^{*}, θ)$ . This allows us to find the best $θ$ for the $D^{*}$ .

The car analogy

Think about $D$ as a car model (we could have many of them), and $θ$ as a vector of parameters such as fuel, tires, type of road, etc. And $J (D, θ)$ to give a performance rating (0 to 1) of the model $D$ under parameters $θ$ . We want to find a car model $D^{*}$ that does well on average under all parameters, so we marginalize $θ$ out, by integrating/summing $J$ with respect to $θ$ , then finding the car $D^{*}$ that maximizes this integral.

Then, for this best "averaged-out" car model $D^{*}$ , we want it to run on parameters $θ$ that maximize the performance $J (D^{*}, θ)$ .

The main equation

J (D, θ) = P [D, θ] \prod_{x \in X} \sum_{p} P [p ∣ D, θ] P [x ∣ p]

D^{*} = \underset{D}{argmax} \int J (D, θ) d θ

\begin{matrix} (1) & θ^{*} = \underset{θ}{argmax} J (D^{*}, θ) \end{matrix}

Introducing Beams:

In the set of equation (1), for a particular $D$ , we sum over all possible programs $p$ in the probability, which is infinite, even for a finite $D$ . Remember that the last two equations (where we're doing argmax) are supposed to be computed algorithmically. So we need to do better. Enter the beam: A beam is a fixed set of promising programs $B_{x}$ for each task $x \in X$ . We maintain a separate beam for each task, ${B_{x}} = {B_{x} : x \in X}$ , and denote $M$ as the maximum beam width: that is $B_{x} \leq M, \forall x \in X$ . The paper suggests $M = 5$ .

L (D, θ, {B_{x}}) = P [D, θ] \prod_{x \in X} \sum_{p \in B_{x}} P [p ∣ D, θ] P [x ∣ p]

D^{*} = \underset{D}{argmax} \int L (D, θ, {B_{x}}) d θ

\begin{matrix} (2) & θ^{*} = \underset{θ}{argmax} L (D^{*}, θ, {B_{x}}) \end{matrix}

In the first of equations (2), we are now summing over the probabilities of a finite number of programs from each task. On the other hand, in the first of (1), we were summing the probabilities of an infinite number of programs from each task. Hence, we get the obvious lower bound:

L (D, θ, {B_{x}}) \leq J (D, θ)

Bayes' Rule

Bayes' rule expresses the posterior probability of a hypothesis $p$ given data $x$ as a combination of the likelihood, prior, and evidence:

P [p ∣ x, D, θ] = \frac{P [x ∣ p] P [p ∣ D, θ]}{P [x ∣ D, θ]}

Explanation of Terms

Posterior Probability:
$P [p ∣ x, D, θ]$
- The probability of program $p$ being the correct explanation for task $x$ , given the library $D$ and parameters $θ$ .
- This tells us how much we believe in $p$ after observing $x$ .
Likelihood:
$P [x ∣ p]$
- The probability of observing $x$ , assuming $p$ is the program being executed.
- This measures how well $p$ explains $x$ .
Prior Probability:
$P [p ∣ D, θ]$
- The probability of program $p$ , based on the library $D$ and parameters $θ$ .
- This reflects the plausibility of $p$ before observing any specific task.
Evidence:
$P [x ∣ D, θ]$
- The normalization constant, which ensures the posterior is a valid probability distribution.
- It can be written as: $P [x ∣ D, θ] = \int_{p} P [x ∣ p] P [p ∣ D, θ] d p$

The phases of the algorithm:

Wake: The $D, θ$ are fixed, and we want to maximize $L$ by means of selecting "good" beams $B_{x}$ for each $x$ . We essentially want select programs p that maximize the probability $P [p ∣ D, θ] P [x ∣ p] \propto P [p | x, D, θ]$ . The proportionality comes from the bayes rule, with the evidence acting as the proportionality constant (which doesn't depend on $p$ , hence we can ignore it). Hence during the wake phase, we are trying to maximize the posterior $P [p | x, D, θ]$ by selecting good programs $p$ for each $x \in X$ .
Sleep(Abstraction) Here, we maximize $L$ by means of library, In particular, we find $D^{*}$ and then $θ^{*}$ from equations (2).
Sleep(Dream) Here, we train a neural recognition model $Q (p | x, D, θ)$ which approximates the posterior $P [p | x, D, θ] \propto P [p ∣ D, θ] P [x ∣ p]$ . This makes the massive search over the combinatorial space of programs generatable by $D$ easier. Hence $Q (p | x, D, θ)$ is trained to output higher probabilities to programs $p$ that have a larger $P [p ∣ D, θ] P [x ∣ p]$ , and the programs with larger outputs from $Q$ is incorporated into the beam, as these will better maximize $L$ .