Consider a set of lambda expressions , and a weight vector , that assigns a weight to each program in , and a bias term (this assigns the probability of picking a variable, which we will see later). So . Given a set of tasks , we want to find the likelihood/probability of solving the set of tasks with the pair . Call this joint likelihood .
Step 1: Given , how likely is it to solve all the tasks in ?
First, pick a task . For a program , depicts the probability that is solved given . depicts how likely a program is picked, given and .
Then, for all possible programs (there are infinite of them) given , the likelihood of solving a particular task is
Step 2: Solving tasks are independent of each other. Hence:
Step 3: How likely is ? : .
Step 4: Combine: The probability of solving all of with some is the likelihood of getting and then solving all of given this .
Hence, finally:
Remember that for any and a set of tasks , gives us the probability of solving all these tasks using .
The idea is to first get . The integral over allows us to get the marginal distribution of with respect to , which basically means we are summing (or integrating) the contributions of other variables ( in this case), to get a distribution that is only affected by , and 's contribution is averaged out. We can now get which maximizes this marginal distribution.
Finally, we get . This allows us to find the best for the .
The car analogy
Think about as a car model (we could have many of them), and as a vector of parameters such as fuel, tires, type of road, etc. And to give a performance rating (0 to 1) of the model under parameters . We want to find a car model that does well on average under all parameters, so we marginalize out, by integrating/summing with respect to , then finding the car that maximizes this integral.
Then, for this best "averaged-out" car model , we want it to run on parameters that maximize the performance .
The main equation
Introducing Beams:
In the set of equation (1), for a particular , we sum over all possible programs in the probability, which is infinite, even for a finite . Remember that the last two equations (where we're doing argmax) are supposed to be computed algorithmically. So we need to do better. Enter the beam: A beam is a fixed set of promising programs for each task . We maintain a separate beam for each task, , and denote as the maximum beam width: that is . The paper suggests .
In the first of equations (2), we are now summing over the probabilities of a finite number of programs from each task. On the other hand, in the first of (1), we were summing the probabilities of an infinite number of programs from each task. Hence, we get the obvious lower bound:
Bayes' Rule
Bayes' rule expresses the posterior probability of a hypothesis given data as a combination of the likelihood, prior, and evidence:
Explanation of Terms
Posterior Probability:
The probability of program being the correct explanation for task , given the library and parameters .
This tells us how much we believe in after observing .
Likelihood:
The probability of observing , assuming is the program being executed.
This measures how well explains .
Prior Probability:
The probability of program , based on the library and parameters .
This reflects the plausibility of before observing any specific task.
Evidence:
The normalization constant, which ensures the posterior is a valid probability distribution.
It can be written as:
The phases of the algorithm:
Wake: The are fixed, and we want to maximize by means of selecting "good" beams for each . We essentially want select programs p that maximize the probability . The proportionality comes from the bayes rule, with the evidence acting as the proportionality constant (which doesn't depend on , hence we can ignore it). Hence during the wake phase, we are trying to maximize the posterior by selecting good programs for each .
Sleep(Abstraction) Here, we maximize by means of library, In particular, we find and then from equations (2).
Sleep(Dream) Here, we train a neural recognition model which approximates the posterior . This makes the massive search over the combinatorial space of programs generatable by easier. Hence is trained to output higher probabilities to programs that have a larger , and the programs with larger outputs from is incorporated into the beam, as these will better maximize .