The temporal difference learning model has two distinctive predictions, the existence of a continuous state value signal, V(t), that encodes the sum of predicted future rewards and the circuitry that computes momentary reward prediction (RP) through the temporal differentiation of V(t). Reward prediction errors are in turn computed as the difference of momentary reward and RP.
We showed that the results of two recent influential studies which described the activity and effects of two types of VTA GABAergis neurons on dopaminergic RPEs provide evidence of the two prediction of the TDL theory.
Type-2 and Type-3 GABAergic VTA neurons.
Type-2 and Type-3 neurons respectively exhibit sustained elevated and reduced activity between the stimulus (CS) and the reward (UCS) in a delay conditioning setting. These responses are driven by a sustained input extending between the CS and UCS. The responses of the cells can be thought of as convolutions of the input with exponential kernels of different time constants. The appropriately weighted sum of the two signals is the time derivative of the low pass filtered rectangular pulse input.
Because the summation of the two cell's activities estimates the time derivative of their input, and because the effect of Type-2 on dopaminergic RPEs cells has been directly shown, we conclude that the responses of the cells are the hallmarks of the temporal differentiation predicted by the TDL model.
Moreover, the rectangular pulse input corresponds to the predicted V(t) signal. Together these observations provide direct evidence for the implementation of the computation of RP predicted by the TDL model
The proposed circuitry of the Type-2 and Type-3 cells
a. Phase portrait of the response of the GABA neurons to a rectangular V(t) input
b. Simulated responses of the cells in a continuous model and the net inhibitory input to DA neurons (top)
Reward prediction with the proposed model in a biophysically realistic model of VTA DA neurons
In order for the model to work to purely inhibitory RP signal needs to be compatible with the intrinsic dynamics of DA cells and generate both the increased activity of the conditioned response (CR) and the suppression of the UCR.