Temporal difference learning in VTA

The temporal difference learning model has two distinctive predictions, the existence of a continuous state value signal, V(t), that encodes the sum of predicted future rewards and the circuitry that computes momentary reward prediction (RP) through the temporal differentiation of V(t). Reward prediction errors are in turn computed as the difference of momentary reward and RP.

We showed that the results of two recent influential studies which described the activity and effects of two types of VTA GABAergis neurons on dopaminergic RPEs provide evidence of the two prediction of the TDL theory.

Type-2 and Type-3 GABAergic VTA neurons.

Type-2 and Type-3 neurons respectively exhibit sustained elevated and reduced activity between the stimulus (CS) and the reward (UCS) in a delay conditioning setting. These responses are driven by a sustained input extending between the CS and UCS. The responses of the cells can be thought of as convolutions of the input with exponential kernels of different time constants. The appropriately weighted sum of the two signals is the time derivative of the low pass filtered rectangular pulse input.

Because the summation of the two cell's activities estimates the time derivative of their input, and because the effect of Type-2 on dopaminergic RPEs cells has been directly shown, we conclude that the responses of the cells are the hallmarks of the temporal differentiation predicted by the TDL model.

Moreover, the rectangular pulse input corresponds to the predicted V(t) signal. Together these observations provide direct evidence for the implementation of the computation of RP predicted by the TDL model

The proposed circuitry of the Type-2 and Type-3 cells

a. Phase portrait of the response of the GABA neurons to a rectangular V(t) input

b. Simulated responses of the cells in a continuous model and the net inhibitory input to DA neurons (top)

Reward prediction with the proposed model in a biophysically realistic model of VTA DA neurons

In order for the model to work to purely inhibitory RP signal needs to be compatible with the intrinsic dynamics of DA cells and generate both the increased activity of the conditioned response (CR) and the suppression of the UCR.

a. Model neuron properties. Top panel: Pacemaker firing and voltage responses to injected current pulses of the in vitro model of VTA DA neurons. Pulses: 100pA, -30pA, -60pA. Note the pronounced slow hyperpolarizing sag (arrow), slowly ramping repolarization (bottom left, arrow, -200pA). and increased pace making rate after block of the low voltage calcium conductance (bottom right). b. Stochastic GABAergic synaptic inputs to DA neurons from T2 (top traces) and T3 (bottom traces) neurons. Mean firing rates of the T2 and T3Ns with 100% prediction of high reward, calculated in the mean-field model (orange traces, ‘rate’) and the corresponding stochastic spike trains, (raster plots, orange). IPSC trains from single neurons (teal), sum from all inputs (purple). (c and d). Membrane potential, (c), and PSTHs of spiking, (d), of the DA neuron under different predictive conditions. Reward input (purple traces, ‘R(t)’, c). Continuous (, green) and synaptic RP input (orange, bottom, c Note the UCR (red arrows, top, in c; top in d). Full prediction of the reward (middle black in c, middle in d). Reward omission results in a firing pause at the UCS (bottom black in c, bottom in d). Note the CR elicited by disinhibition (orange asterisks, c,c, and the absence of a baseline shift (D, middle, arrow). Note that the GABA input is the same in the simulated prediction and omission. Horizontal arrows in C indicate -60 mV. (e). Simulation of DA neuron responses to an exponentially increasing V(t) signal(37), (bottom traces). Teleportation, (‘TP’) creates step increases in reward expectation in V(t) (orange asterisks, See text). Top: PSTHs of DA neuron responses. Top trace is membrane potential of the model neuron in one trial. (f). Inhibition by the GABA neurons approximates a purely subtractive effect. Mean firing rate of the DA cell during a 0.5s period after reward onset plotted against the conductance of the reward EPSP (range: 0-190 ), in unpredicted (PSTH, teal) and approximately 10% predicted conditions (PSTH, purple). Markers are data points; lines are linear regression (p<0.05). Insets show the predicted and unpredicted responses to the highest reward.