### Participants

The participants were recruited by the Center for Neuroeconomics at the University of Zurich, Switzerland. The participants were instructed about all aspects of the experiment and gave written informed consent. None of the participants suffered from any neurological or psychological disorder or took medication that interfered with participation in our study. The participants received fixed monetary compensation for their participation in the experiment, in addition to a variable monetary payoff that depended on task performance (see below). The experiments conformed to the Declaration of Helsinki, and the experimental protocol was approved by the Ethics Committee of the Canton of Zurich.

Participants who failed to follow the eye fixation instructions on more than 25% of trials were excluded from the data analysis (*n* = 12). We measured the performance of the participants in the training tasks and excluded participants who were unable to perform the task at the easiest difficulty level (*n* = 11). Additionally, we had to exclude three participants due to technical problems with the data collection. The final sample thus comprised *n* = 86 participants (*n* = 25 in experiment 1 and *n* = 61 in experiment 2 (30 in *K*_{rew})).

### Experimental design and stimuli

The stimuli were generated with MATLAB (version 9.7)^{62}, using the Psychtoolbox and displayed on a screen that was one metre away from the participants. The angle of the head was kept stable with a chin rest. The height of the chin rest was adjusted to position the centre of the screen at the height of the eyes. As stimuli, we used oriented Gabor patches, presented on a grey background. Each patch was composed of a high-contrast three-cycles-per-degree sinusoidal grating convoluted with a circular Gaussian with width 0.41° and subtended 2.98° vertically and 2.98° horizontally. In experiment 1, all Gabor patches were presented so that the centres fell 5.7° to the left or right of the centre of the monitor and on the horizontal midline. In experiment 2, the Gabor centres fell 4.7° to the left and right of the vertical midline and 4.7° above or below the horizontal midline.

#### Eye tracking

Eye-tracking data were acquired using an ST Research Eyelink 1000 eye-tracking system. Gaze position was sampled at 500 Hz. Eye movements away from fixation were computed for the window corresponding to the stimulus presentation. For every saved position, the absolute distance to the fixation cross was computed. If the absolute distance exceeded 4° of visual angle, the trial was marked to include an eye movement. For most participants, the average number of trials with eye movements was less than 5%. Participants (*n* = 12) who made eye movements that exceeded 4° of visual angle on more than 25% of trials were excluded from all analyses.

#### Experiment 1

The participants performed the experiment in multiple sessions to allow for training within the two contexts on different days. The order of the accuracy (*K*_{acc}) and reward (*K*_{rew}) context training was counterbalanced across participants. In total, every participant completed 240 trials in the estimation task and 400 trials in the decision task.

#### Experiment 2

In experiment 2, each participant trained in only one stimulus–reward association context (either *K*_{acc} or *K*_{rew}). Training in the binary judgement decision task was performed either in the two upper locations or in the two lower locations. The participants were randomly allocated to one of the two training locations. In the estimation tasks before and after the training task, the trial locations were evenly distributed between all four possible locations. In total, every participant completed 400 trials in the estimation task and 360 trials in the decision task.

#### Orientation estimation task

Before the start of every trial, the participants had to fixate on a cross in the middle of the screen. At the beginning of the trial, an arrow appeared for 0.5 seconds to indicate on which side the stimulus would be shown. Afterwards, the stimulus appeared on the indicated side for 0.6 seconds. The orientation of the stimulus was determined randomly within (0–179°). During stimulus presentation, the participant had to continue fixating on the cross. After the stimulus disappeared, a Gabor patch appeared in the middle of the screen. By pressing and holding the left mouse button, the participant then rotated the new Gabor patch until its perceived orientation matched the orientation of the previously observed target stimulus. The participant could end the trial by pressing the space key. After five seconds, the trial ended automatically. The trials were separated by a random intertrial interval of 1.5–2 seconds. The estimation task took place before and after the decision task (see below and Fig. 2). To avoid the possibility that participants developed contextual strategies, they were not informed in advance that a second estimation task was going to take place after the decision task.

#### Decision task

The fixation cross turned black to indicate the start of a trial. After 0.5 seconds, two Gabor patch stimuli appeared. The orientation of one of the stimuli was drawn from the approximate distribution of edges in the real world^{42}. The orientation of the second stimulus was adjusted by a participant-specific difficulty score to keep performance at approximately 75% accuracy for all participants. The median accuracy across participants in *K*_{rew} was 77 ± 2.9% and in *K*_{acc} was 77 ± 2.8%. Additionally, on the basis of (1) calibration to 75% accuracy, (2) the linear mapping between the degree of diagonality and reward (that is, from 1 Swiss franc (CHF) for 0° to CHF 46 for 45° in the diagonality space), and (3) pilot data, we adjusted the payoff of correct trials in *K*_{acc} to match the expected payoff in *K*_{rew}. We calculated that setting the payoff for each correct response in *K*_{acc} to 15 CHF would fulfil these conditions. Our experimental data were in line with these calculations: the median payoff in *K*_{acc} was 15.00 ± 0 CHF, and in *K*_{rew} it was 14.70 ± 0.62 CHF.

On average, the stimulus orientation followed a prior distribution *f*(*s*) described by equation (7) and shown in Fig. 3a:

$$f(s)=\frac{1}{1.85-\cos (4s)}.$$

(7)

The stimuli were displayed for 0.6 seconds. During stimulus presentation, the participants had to fixate on the cross in the middle of the screen. When the stimuli disappeared, the participants had 2.5 seconds to decide which stimulus was more oblique. Independent of the RT, the full 2.5 seconds had to be waited out. Afterwards, the two stimuli were shown again in their positions, and the result of the choice and the orientations of the stimuli were displayed for 3 seconds until the trial ended. The trials were separated by a 1.5-to-2-second intertrial interval.

### Blowfly retinal LMC experiment

Here we provide a brief description of the data collected in Laughlin’s seminal work^{36}, which we re-analyse in this work. To derive the prior for the sensory stimulus of interest *f*(*s*), the researcher measured the distribution of contrasts that occur in woodland settings of the blowfly environment. In brief, photographs were taken in the natural habitat of the blowfly such as sclerophyll woodland and lakeside lands. Relative intensities were measured across these scenes using a detector that scanned horizontally, like the ommatidium of a turning fly. The scans were digitized at intervals of 0.07° and convolved with a Gaussian point spread function of half-width 1.4°, corresponding to the angular sensitivity of a fly photoreceptor. Contrast values were obtained by dividing each scan into intervals of 10, 25 or 50°. Within each interval, the mean intensity (\(\bar{I}\)) was found and subtracted from every data point to give the fluctuation about the mean (Δ*I*). This difference value was divided by the mean to give the contrast (\(\Delta I/\bar{I}\)).

These data were used to construct a histogram, which was later transformed to a CDF (Fig. 1a and Supplementary Fig. 1). Here we used this CDF to reconstruct the probability density function *f*(*s*) (Supplementary Fig. 1). Once the prior distribution was obtained, the fly was placed in front of a screen with a light-emitting diode (LED). At the beginning of each trial, the LED luminance was set to the screen luminance and then changed to a new luminance drawn from the prior distribution *f*(*s*) for 100 ms. The stimulus *s* was defined as the proportional change of the difference between the background and LED luminances. We emphasize that the CDF of the contrast statistic comes directly from the contrast measurement methodology described in the preceding paragraph and reported by Laughlin. We thus did not make the original calculations for the prior *f*(*s*), nor is it influenced by the fitness-maximizing sensory coding theory.

### Fitness-maximizing neural codes

In this section, we provide a detailed description of the connection between the *L*_{p} reconstruction error, the efficient code that maximizes reward expectation and the power-law efficient codes briefly described in the main text.

Suppose that the stimulus distribution is given by *s* ~ *f*(*s*). The function that transforms the input *s* to neural responses *r* is given by *r* = *h*(*s*). While the mapping *h*(*s*) is deterministic, here we assume that errors in the neural response *r* follow a distribution *P*[*r*|*h*(*s*)]. We apply a general approach that considers optimality criteria accounting for how well stimulus *s* can be reconstructed (\(\hat{s}\)) from the neural representations *r*. Wang and colleagues introduced a general formulation of the efficient coding problem in terms of minimizing the error in such reconstructions \(\hat{s(r)}\) according to the *L*_{p} norm as a function of the norm parameter *p* (ref. ^{63}). In brief, the reconstruction is assumed to be based on the maximum likelihood estimate of the decoder in the low-noise regime, where *P*[*r*|*h*(*s*)] is assumed to be Gaussian distributed.

The goal is to find the optimal mapping function *h**(*s*) to achieve a minimal *L*_{p} reconstruction error for any given prior stimulus distribution *f*(*s*). More formally, the problem is defined as: find *h**(*s*) such that

$$\min {\left\langle {\left\vert \hat{s}(r)-s\right\vert }^{p}\right\rangle }_{s,r}\quad \,{{\mbox{s.t.}}}\,\,0\le h(s)\le 1,$$

(8)

where, without loss of generality, we assume that the operation range of the neuron is bounded between 0 and 1. It is possible to show that the optimal mapping *h**(*s*) is given by equation (9)^{63}:

$${h}^{* }(s)=\frac{\int\nolimits_{-\infty }^{s}f{\left(\tilde{s}\right)}^{1/(1+p)}{\mathrm{d}}\tilde{s}}{\int\nolimits_{-\infty }^{\infty }f{\left(\tilde{s}\right)}^{1/(1+p)}{\mathrm{d}}\tilde{s}}.$$

(9)

If we define

$$\gamma \equiv 1/(1+p),$$

(10)

we observe that the normalized power function of the stimulus distribution *f* in equation (9) is the escort distribution with parameter *γ* (ref. ^{64}). Note that under this framework, infomax coding is given by the norm parameter *p* → 0, and therefore *γ* = 1, thus leading to the result that *h*(*s*) is the CDF of the prior distribution.

#### Efficient *L*

_{p} error-minimizing codes and behavioural goals

Economics has a long tradition of studying the following problem: for a given distribution *f*(*s*) in the environment, what is the optimal shape of the internal representation (that is, *h*(*s*), which in economics is known as the utility function) if such function can only take a large but limited set of *n* discrete subjective values (that is, the internal readings, *r*) that code for any given stimulus *s* (refs. ^{3,39})? The utility function is thus restricted to a set of step functions with *n* jumps, each corresponding to a utility increment of size 1/*n*. In this case, discrimination errors originate from the fact that the organism cannot distinguish two alternatives located at the same step of the utility function. Under this formulation, the following variant of the problem was studied: find the optimal utility function (*h**) under two evolutionary optimization criteria, (1) the probability of mistakes minimization criterion and (2) the expected reward loss minimization criterion.

To solve this problem, we assume that the organism repeatedly makes choices between two alternatives drawn from the stimulus distribution *f*(*s*), where we may suppose that stimuli are linearly mapped to a reward value. The organism is endowed with a utility function that assigns a level of reward to each possible stimulus *s* from *f*(*s*). The alternative that promises more utility to the organism is chosen^{39}.

If the goal of the organism is to minimize the number of erroneous responses (that is, maximize discrimination accuracy), the optimal utility function \({h}_{{{{\rm{accuracy}}}}}^{* }\) is given by

$${h}_{{{{\rm{accuracy}}}}}^{* }(s)=\int\nolimits_{-\infty }^{s}f(\tilde{s}){\mathrm{d}}\tilde{s}.$$

(11)

According to this solution, the power parameter of the escort distribution in equation (9) is given by *γ* = 1, which corresponds to the infomax strategy.

However, if the goal of the organism is to minimize the expected reward loss (that is, maximize the amount of reward received after many decisions) and stimuli are linearly mapped to reward value, the optimal utility function \({h}_{{{{\rm{reward}}}}}^{* }\) is given by

$${h}_{{{{\rm{reward}}}}}^{* }(s)=\frac{\int\nolimits_{-\infty }^{s}f{\left(\tilde{s}\right)}^{2/3}{\mathrm{d}}\tilde{s}}{\int\nolimits_{-\infty }^{\infty }f{\left(\tilde{s}\right)}^{2/3}{\mathrm{d}}\tilde{s}}.$$

(12)

According to this solution, the power parameter of the escort distribution in equation (9) is given by *γ* = 2/3, which corresponds to optimizing the *L*_{p} minimization problem with parameter *p* given by

$$\gamma =2/3=\frac{1}{1+p}\quad \Rightarrow \quad p=0.5.$$

(13)

We found that this normative fitness-maximizing solution is the error penalty that best describes the LMC data^{40} (these results are reported in the main text and Fig. 1). Additionally, please note that the solutions provided in equations (11) and (12) are derived on the basis of maximizing the accurate choices and reward expectation, respectively, without any assumptions about maximizing information efficiency as a goal in itself.

#### Connection to power-law efficient codes

We employed a general method for defining efficient codes by investigating optimal allocation of Fisher information *J* given (1) a bound of the organism’s capacity *c* to process information, (2) the frequency of occurrence *f*(*s*) and (3) the organism’s goal (for example, maximize perceptual accuracy or expected reward) according to

$$\mathop{{\mathrm{arg}}\,{\mathrm{max}}}\limits_{J(s)}-\int\,{\mathrm{d}}s\,f(s)J{\left(s\right)}^{-\alpha }$$

(14)

subject to a capacity bound

$$C(s)=\int\,{\mathrm{d}}s\,J{\left(s\right)}^{\beta }\le c,$$

(15)

with parameters *α* defining the coding objective and *β* > 0 specifying the capacity constraint^{43}. The solution of this optimization problem reveals that Fisher information should be proportional to the prior distribution *f*(*s*) raised to a power *q*, which is therefore referred to as the power-law efficient code

$${J}_{{{{\rm{opt}}}}}(s)={c}^{1/\beta }{\left(\frac{f{\left(s\right)}^{\gamma }}{\int\,{\mathrm{d}}sf{\left(s\right)}^{\gamma }}\right)}^{1/\beta }\triangleq kf{\left(s\right)}^{q},$$

(16)

where *q* = 1/(*β* + *α*) and *γ* = *β*/(*β* + *α*). Note that power-law parameter *q* is multiply determined, and to make progress in identifying it, we need to make some further assumptions. Here we opted for setting *β* = 0.5, as previously proposed in the standard infomax framework^{41}; however, our conclusions are not affected by the specific value of *β*. This means that *α* determines how Fisher information is allocated relative to the prior, influencing the values of both *q* and *γ*. It can be shown that the infomax coding rule implies *γ* = 1 and therefore an efficient power-law code *q* = 2, and the reward expectation rule implies *γ* = 2/3 and therefore an efficient power-law code *q* = 4/3 (Supplementary Note 1). The power-law efficient codes thus allow us to establish a connection between behavioural goals in the contexts studied in this work (*K*_{acc} and *K*_{rew}) and parameter *γ*, which incorporates the goals of the organism under the resource-constrained framework that we study here.

#### Optimal inference

When specifying an inference problem using such an encoding–decoding framework, a key aspect for generating predictions of decision behaviour is to obtain expressions of the expected value and variance of the noisy estimations \(\hat{s}\) for a given value input *s*_{0}. However, we first need to specify the encoding and decoding rules. We adopted an encoding function *P*(*r*|*s*) associated with the power-law efficient code that is parameterized as Gaussian^{43}

$$\begin{array}{rcl}P(r| s)&=&{{{\mathcal{N}}}}\left(s,\frac{1}{kf{\left(s\right)}^{q}}\right)\\ &=&\sqrt{\frac{kf{\left(s\right)}^{q}}{2\uppi }}\exp \left(-\frac{kf{\left(s\right)}^{q}}{2}{\left(r-s\right)}^{2}\right),\end{array}$$

(17)

and therefore Fisher information is allocated using an *s*-dependent variance \({\sigma }^{2}=1/kf{\left(s\right)}^{q}\). While we are aware that in our study the stimulus space is circular, given that discriminability thresholds are relatively low for orientation discrimination tasks in humans, it is safe to assume that the likelihood function can be locally approximated as a Gaussian distribution.

At the decoding stage, the observer computes the posterior using Bayes’s rule:

$$P(s| r)=\frac{P(r| s)f(s)}{P(r)}.$$

(18)

Theoretical and empirical evidence suggests that for orientation estimation tasks, estimates are typically biased away from the prior. This suggests that humans employ an expected value estimator of the posterior, at least for the infomax case^{41}.

The expected value of the estimator can be defined as the input stimulus *s*_{0} plus some average bias *b*(*s*_{0}). Using analytical approximations under the high-signal-to-noise regime, it is possible to show that the bias for the posterior expected value estimator can be approximated by^{65}

$$b\left({s}_{0}\right)\approx \left(1-\frac{1}{q}\right)\frac{1}{k}{\left(\frac{1}{f{\left(s\right)}^{2}}\right)}_{{s}_{0}}^{{\prime} }.$$

(19)

In a previous study, using model simulations and exploring parsimonious functional forms, it was shown that the proportionality constant of the bias term can be approximated by^{43}

$$\frac{\log (q)}{k\sqrt{q}}.$$

(20)

The analytical solution and the simulation-based solution of the proportionality constant are approximately equivalent for a range of *q* values relevant to our work (for example, *q* ∈ [0.5, 2]); that is

$$\frac{\log (q)}{k\sqrt{q}}\approxeq \left(1-\frac{1}{q}\right)\frac{1}{k},$$

(21)

thus validating the results derived in the analytical approximations that we used in the current work. However, using either function does not affect the qualitative or quantitative results in our study.

Using this result, the expected value of the estimators is given by

$$E[\hat{s}| {s}_{0}]\approx {s}_{0}+\left(1-\frac{1}{q}\right)\frac{1}{k}{\left(\frac{1}{f{\left(s\right)}^{q}}\right)}_{{s}_{0}}^{{\prime} }.$$

(22)

As already defined in the description of the behavioural task, in this study, we used a parametric form of the prior that closely resembles the shape of the natural distribution of orientations in the environment^{42}

$$f(s)=\omega \times \frac{1}{a-\cos (4s)},$$

(23)

with *a* > 1 determining the elevation (steepness) of the prior, and *ω* a normalizing constant. Using this parameterization of the prior, we can obtain an explicit analytical approximation of the bias:

$$\begin{array}{rcl}b({s}_{0})&\approx &\left(1-\frac{1}{q}\right)\frac{1}{k}\frac{\partial }{\partial s}{\left({\left(\frac{\omega }{a-\cos (4s)}\right)}^{-q}\right)}_{{s}_{0}}\\ &\approx &\left(1-\frac{1}{q}\right)\frac{1}{k}{\left(\frac{4q\sin (4s){\left(\frac{\omega }{a-\cos (4s)}\right)}^{1-q}}{\omega }\right)}_{{s}_{0}}.\end{array}$$

(24)

We can also obtain an analytical approximation of the variance under the high-signal-to-noise regime using the Cramer–Rao bound formulation:

$$\begin{array}{rcl}\,{{\mbox{Var}}}\,[\hat{s}| {s}_{0}]&\propto &{\left(\frac{1}{J(s)}\right)}_{{s}_{0}}\\ &\approx &\frac{1}{k}{\left(\frac{1}{f{\left(s\right)}^{q}}\right)}_{{s}_{0}}\\ &\approx &\frac{1}{k}{\left(\frac{(a-\cos (4s))}{\omega }\right)}_{{s}_{0}}^{q}.\end{array}$$

(25)

We can thus use equations (24) and (25) to derive the predictions presented in Fig. 3.

Finally, assuming that the estimators are normally distributed using the expected value and variance derived above, the probability that an agent chooses an alternative with orientation value *s*_{1} over a second alternative with orientation value *s*_{2} (recall that in our experiment the decision rule (objective) of the participants is to choose the orientation perceived as closer to the diagonal orientation) is given by

$$P\left({\hat{s}}_{1} > {\hat{s}}_{2}| {s}_{1},{s}_{2}\right)=\varPhi \left(\frac{{{{\rm{E}}}}\left[{\hat{s}}_{1}| {s}_{1}\right]-{{{\rm{E}}}}\left[{\hat{s}}_{2}| {s}_{2}\right]}{\sqrt{{{{\rm{Var}}}}\left[{\hat{s}}_{1}| {s}_{1}\right]+{{{\rm{Var}}}}\left[{\hat{s}}_{2}| {s}_{2}\right]}}\right),$$

(26)

where *Φ*() is the CDF of the normal distribution. When fitting the choice data to the model, we accounted for potential side (left/right) biases *β*_{0} and lapse rates *λ* in the decision task using

$$P\left({\hat{s}}_{1} > {\hat{s}}_{2}| {s}_{1},{s}_{2}\right)=\frac{\lambda }{2}+\varPhi \left(\frac{{{{\rm{E}}}}\left[{\hat{s}}_{1}| {s}_{1}\right]-{{{\rm{E}}}}\left[{\hat{s}}_{2}| {s}_{2}\right]}{\sqrt{{{{\rm{Var}}}}\left[{\hat{s}}_{1}| {s}_{1}\right]+{{{\rm{Var}}}}\left[{\hat{s}}_{2}| {s}_{2}\right]}}+{\beta }_{0}\right)(1-\lambda ).$$

(27)

#### Fitting the power-law efficient model to human data

To fit the power-law efficient coding model to the choice data from the decision task, we used a hierarchical Bayesian model. We fit the early (1–200) and late (>200) training trials in each reward context separately. Posterior inference of the parameters in the hierarchical models was performed via the Gibbs sampler using the Markov chain Monte Carlo technique implemented in JAGS^{66}, assuming flat priors for both the mean and the noise of the estimates. For each model, we drew a total of 20,000 burn-in samples and subsequently took 5,000 new samples from three independent chains. We applied a thinning of 5 to this final sample, thus resulting in a final set of 3,000 samples for each parameter. We conducted Gelman–Rubin tests for each parameter to assess convergence of the chains. All latent variables in our Bayesian models had \(\hat{R} < 1.05\), which suggests that all three chains converged to a target posterior distribution. We checked convergence of the group-level parameter estimates via visual inspection.

### Behavioural and statistical analyses

In the estimation task, the observers’ behavioural error on a given trial was computed as the difference between the reported orientation and the presented orientation. The direction of the error was defined as positive if the reported orientation was more oblique than the presented orientation, or negative if vice versa. If the error on any given trial was bigger than 25% of the maximum possible error (90 degrees), we discarded that trial. To make full use of the data, we pooled all participants from both experiments for the analysis of the impact of the reward training context. Comparisons between trained and untrained locations used only the data from the location-specific training in experiment 2.

We computed the average bias and variance in five bins of 9° before and after the training phases. Next, we computed the average change in the variance in each bin for each participant. We used the changes in variance within the most cardinal and most oblique bins to test for the predicted interactions between diagonality and training type (*K*_{acc} or *K*_{rew}) or location (trained or untrained) using Bayesian hierarchical linear regressions implemented with the brms package (version 2.13.5)^{67} in the statistical computing software R (version 3.6.3)^{68}. For each model, we used four chains with 2,000 samples per chain after burn-in. The *P*_{MCMC} values reported for these regressions represent one minus the probability of the reported effect being greater (less) than zero given the posterior distributions of the fitted model parameters.

We also compared the performance of participants in the binary judgement decision task using Bayesian hierarchical regressions implemented with brms in R. In this task, the participants had to decide which of two stimuli were more diagonal (closer to 45 degrees). We compared the accuracy of these decisions as a function of diagonality, training phase (early or late) and training type (*K*_{acc} or *K*_{rew}). We used four chains with 1,000 samples per chain after burn-in for a total of 4,000 posterior samples for each regression parameter. The *P*_{MCMC} values were computed in the same fashion as described above for the estimation task.

### ANNs

Suppose that we have a dataset of **x** samples from a distribution of images represented by the retina where each image indicates an angular orientation *s* with an angular prior distribution *p*(*s*). Note that a key feature of our analyses is that knowledge about this angular prior is not explicitly given to the neural network; this prior is embedded in the statistics of image occurrences over space and time. Also note that there might be different images **x**_{s} that can be mapped to the same angle *s*_{0} (for example, a Gabor patch with identical angle but different angular phases). Each stimulus is encoded by a set of latent codes (or a latent neural distributional code) **z** with a prior distribution *p*(**z**). This prior distribution results in a posterior distribution *p*(**z**|**x**) after observing image **x**. The neural coding system should thus learn a good representation of the environment (the distribution of physical sensory inputs) that might also need to be optimized for a particular downstream task (for example, maximize the reward consumption resulting from decision *y*). More specifically, we propose a VIB-like objective function (equation (4) in main text). In our ANN, the VIB-like objective trades (an approximation of) the amount of ‘visual’ information *I* that the encoder can process with the expected reward loss, via the regularization parameter *β*. Higher values of *β* thus introduce extra pressures in the network to encode information about the input image that can yield the most significant improvement in the downstream objective function. The neural network received two retinal images corresponding to screen locations where the two Gabor patches were presented in our task. We note that when training the ANN, the parameters of the encoder *ϕ* are shared for both retinal locations where the stimuli **x**_{1,2} are presented. The decision rule that the neural network has to learn is to indicate which of the two input stimuli (left or right) is more diagonal, while maximizing the reward received across many trials. Also like in the human experiments, we trained networks in two contexts: *K*_{acc} and *K*_{rew}.For all VIB-like objectives studied here, we define the regularized ‘information transmission’ *I* as

$$\begin{array}{rcl}I&\equiv &{{\mathbb{E}}}_{{{{\bf{X}}}}}\left[{D}_{{\mathrm{KL}}}\left({p}_{\phi }({{{{\bf{z}}}}}_{1}| {{{{\bf{x}}}}}_{1})\parallel p({{{{\bf{z}}}}}_{1})\right)\right.\\ &&\left.+{D}_{{\mathrm{KL}}}\left({p}_{\phi }({{{{\bf{z}}}}}_{2}| {{{{\bf{x}}}}}_{2})\parallel p({{{{\bf{z}}}}}_{2})\right)\right],\end{array}$$

(28)

where *D*_{KL} is the Kullback–Leibler divergence. In context *K*_{acc}, the reward loss in the VIB-like objective is defined as

$$\begin{array}{rcl}E[\,{{\mbox{reward loss}}}\,]&\equiv &{{\mathbb{E}}}_{{p}_{\phi }({{{\bf{z}}}}| {{{\bf{x}}}})}\left[y({{{{\bf{x}}}}}_{1},{{{{\bf{x}}}}}_{2})(1-{p}_{\theta }(y=1| {{{{\bf{z}}}}}_{1},{{{{\bf{z}}}}}_{2}))\right.\\ &&\left.+(1-y({{{{\bf{x}}}}}_{1},{{{{\bf{x}}}}}_{2})){p}_{\theta }(y=1| {{{{\bf{z}}}}}_{1},{{{{\bf{z}}}}}_{2})\right],\end{array}$$

(29)

with *y* = 1 when the correct response is given by stimulus input **x**_{1}, and *y* = 0 otherwise. *p*(*y* = 1|**z**_{1}, **z**_{2}) is the probability that the network chooses **x**_{1} given the encoding vectors **z**_{1,2}.

In context *K*_{rew}, the reward loss in the VIB-like objective is defined as

$$\begin{array}{rcl}E[\,{{\mbox{reward loss}}}\,]&\equiv &{{\mathbb{E}}}_{{p}_{\phi }({{{\bf{z}}}}| {{{\bf{x}}}})}\left[| s({{{{\bf{x}}}}}_{1})-s({{{{\bf{x}}}}}_{2})| \right.\\ &&\times \left\{y({{{{\bf{x}}}}}_{1},{{{{\bf{x}}}}}_{2})(1-{p}_{\theta }(y=1| {{{{\bf{z}}}}}_{1},{{{{\bf{z}}}}}_{2}))\right.\\ &&+\left.\left.(1-y({{{{\bf{x}}}}}_{1},{{{{\bf{x}}}}}_{2})){p}_{\theta }(y=1| {{{{\bf{z}}}}}_{1},{{{{\bf{z}}}}}_{2})\right\}\right],\end{array}$$

(30)

which is identical to the reward loss in the *K*_{acc} VIB-like objective function, except that the probability of an erroneous ANN decision is weighted by the absolute value of the difference in the cardinality values *s*(**x**_{1}) and *s*(**x**_{2}). The ANNs trained with VIB-like objective functions thus penalize reward loss following the *K*_{acc} and *K*_{rew} objectives employed in the analytical solutions (see equations (1) and (2) in the main text).

All networks tested here used layers that are standard in the machine learning literature. Each retinal input network consisted of convolutional 4 × 4 kernels, with a stride of two. In the results presented in this work, we used four filters, but we found that our results are largely insensitive to the number of filters used. We also investigated a fully connected input layer with different sizes (50–200 neurons), which led to nearly identical results and conclusions. The stochastic encoder has the form

$$p(z| x)={{{\mathcal{N}}}}\left(z| {g}_{e}^{\mu }(x),{g}_{e}^{\varSigma }(x)\right),$$

(31)

where *g*_{e} is a fully connected layer that receives as input the output of the retinal layer, where *g*_{e} outputs the *K*-dimensional mean vectors *μ* of *z* as well as the *K* × *K* covariance matrix *Σ*. In the results presented here, we use *K* = 4, but our results are similar for a range of *K* values from 2 to 16. We used the reparameterization trick to write *p*(*z*|*x*)d*z* = *p*(*ϵ*)d*ϵ*, where *z* = *g*(*x*, *ϵ*) is a deterministic function of *x* and the Gaussian random variable *ϵ*. The noise is thus independent of the parameters of the network, and it is possible to take gradients that optimize the objective function in equation (4). The downstream integration network consisted of a fully connected network that receives as input the values of the noisy encoder *z* for each retinal input. The size of this layer for the results presented here is 20, but the main conclusions of our analyses are insensitive to the size of this layer. Finally, the decision module was a single sigmoidal unit indicating the selection of the left or right stimulus. All hidden units used rectified-linear activations. The networks were trained with Adam optimization with a learning rate of 0.0001.

To compute the Fisher information of the encoder, we first generated 500 inputs for each orientation stimulus *s* in the cardinality space from 0° to 45° in steps of 0.5°. We computed the empirical expected value vector

$$\bar{{{{\bf{z}}}}}(s)={{\mathbb{E}}}_{{{{\bf{z}}}} \sim p({{{\bf{z}}}}| s)}[{{{\bf{z}}}}]={\mathbb{E}}[{{{\bf{z}}}}| s].$$

(32)

By rescaling the responses *z*_{i}(*s*) such that the noise has unit variance, without loss of generality, the Fisher information *J* can be expressed as

$$J(s)=\mathop{\sum }\limits_{i=1}^{n}{\bar{z}}_{i}^{{\prime} }{\left(s\right)}^{2}={\left\Vert \frac{\partial \bar{{{{\bf{z}}}}}(s)}{\partial s}\right\Vert }_{2}^{2}.$$

(33)

### Resource allocation with RT costs

We used simulations to study the scenario in which agents are rewarded for short RTs in both the *K*_{acc} and *K*_{rew} contexts. Examining this scenario requires assumptions about a process model that jointly generates decisions and RTs. We assumed that decisions and RTs *T* are generated by a simple DDM with a constant decision bound *b*, decision evidence *z* and diffusion noise *σ* that is independent of the choice set inputs, which can be thought of as a downstream decision noise. In the DDM, the data generation process does not change if we set, for instance, *σ* to a constant. Here we set *σ* = 1. Following the notation in our work, we define the decision evidence *z*(*s*_{1}, *s*_{2}) for the choice set *s*_{1,2}

$$\begin{array}{rcl}z({s}_{1},{s}_{2})&=&\frac{| {{{\rm{E}}}}\left[{\hat{s}}_{1}| {s}_{1}\right]-{{{\rm{E}}}}\left[{\hat{s}}_{2}| {s}_{2}\right]| }{\sqrt{{{{\rm{Var}}}}\left[{\hat{s}}_{1}| {s}_{1}\right]+{{{\rm{Var}}}}\left[{\hat{s}}_{2}| {s}_{2}\right]}}\\ &=&\frac{| {{{\rm{E}}}}\left[{\hat{s}}_{1}| {s}_{1}\right]-{{{\rm{E}}}}\left[{\hat{s}}_{2}| {s}_{2}\right]| }{\sqrt{\frac{1}{J({s}_{1})}+\frac{1}{J({s}_{2})}}},\end{array}$$

(34)

where *J*(*s*) is Fisher information, which determines resource allocation. To find the optimal resource allocation, we define

$$J(s)\equiv k\times \tilde{f}(s),$$

(35)

with the property

$$\int\,\tilde{f}(s){\mathrm{d}}s=1,\quad \frac{{\mathrm{d}}\tilde{F}}{{\mathrm{d}}s} > 0,$$

(36)

where \(\tilde{F}\) is defined as the CDF of \(\tilde{f}.\) Here we set *k* sufficiently high such that the low-noise limit property holds, and we numerically find \(\tilde{f}\) (ref. ^{69}).

In the standard DDM, the probability of an erroneous response is given by (for simplicity, we approximate the normal CDF of equation (26) with the logit function corresponding to the analytical solution of the DDM; this approximation does not change the qualitative conclusions of our results)

$$P(\,{{\mbox{error}}}\,| {s}_{1},{s}_{2})=\frac{1}{1+{e}^{2b\times z({s}_{1},{s}_{2})}},$$

(37)

and the expected RT is given by^{70}

$${{{\rm{E}}}}[RT| {s}_{1},{s}_{2}]=\frac{b}{z({s}_{1},{s}_{2})}\tanh (b\times z({s}_{1},{s}_{2})).$$

(38)

In this scenario, the loss function for the *K*_{acc} context is given by equation (5) in the main text, and the loss function for the *K*_{rew} context is given by equation (6) in the main text. Note that as *η* → 0, the optimal decision bound would be *z* → *∞*. The goal is thus to find the optimal balance between resource allocation *J*(*s*) and optimal bound *z* that minimizes the loss functions for a given RT cost *η* and for the prior distribution of sensory stimuli in the environment.

### Representational similarity analyses of human fMRI data

We conducted additional conjunction analyses on the whole-brain maps of representational similarity for identity and usefulness that were originally computed by Castegnetti and colleagues^{57}. We obtained the thresholded (FWE *P* < 0.05) whole-brain maps from Castegnetti and colleagues and computed conjunctions between the identity and usefulness contrasts, as well as usefulness and independently defined masks of the LOC and primary visual areas V1–V3 to create the figure in Supplementary Note 2. The LOC mask was obtained from the fMRI meta-analysis tool Neurosynth (neurosynth.org) with the keyword ‘Lateral Occipital Cortex’ and thresholded at the Neurosynth default of *P* < 0.01 (FDR-corrected). The V1–V3 masks were extracted from the Julich-Brain Cytoarchitectonic Atlas and thresholded at 50% probability. The LOC and V1–V3 masks were then conjoined with the cluster-corrected statistical map of usefulness representations. For the full details about the fMRI data analyses, see Supplementary Note 2.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.