Comparing Reinforcement Learning and Human Learning With the Game of Hidden Rules

Human-machine systems, especially those involving reinforcement learning (RL), are becoming increasingly common across application domains. Designing these systems to be effective and reliable requires a task-oriented understanding of both human learning (HL) and RL. In particular, how does the structure of a learning task affect the learning performance of humans and RL algorithms? Games and other learning environments can serve as important tools in this line of research. While a trend toward increasingly complex environments has led to improved RL capabilities, such environments are difficult to use for the dedicated study of task structure for humans and algorithms. To address this gap we present a novel learning environment called the Game of Hidden Rules (GOHR), built to support rigorous study of the impact of task structure on HL and RL. We demonstrate the GOHR’s utility for such study through example experiments where humans and learning algorithms display opposite responses in performance to tested variations in task structure.


Introduction
Reinforcement learning (RL) [38] benchmarks often come directly from human benchmarks (e.g., games) or are designed to mimic complex reasoning tasks a human might encounter.These increasingly complex benchmark environments have been used to improve the capability of RL algorithms, leading to impressive achievements in domains such as board games [31,39,[34][35][36]33] and video games [25,18,43,27].Unfortunately, while the difficulty of such environments motivates greater RL capabilities, their complexity often makes them ill-suited for the rigorous study of task structure; that is, how the logical structure of a learning task affects learning performance.A task-oriented understanding of RL methods' strengths and weaknesses remains an important gap in the informed deployment of RL tools.In particular, such an understanding would allow decision-makers to better relate findings from benchmark environments to the specific attributes of their own problem settings.
Increasingly, practitioners must also decide how to integrate RL tools alongside humans.A granular, task-oriented understanding of both human learning (HL) and RL is necessary to design systems where humans and algorithms best complement one another.Changes in task attributes that make learning easier for either HL or RL but harder for the other may suggest cases where human-machine learning pairs can be effective [14,4,5].Direct HL-RL comparisons can also identify helpful human priors or heuristics for future algorithm development.An important step in this line of research is the development of learning environments that support the study of task structure for both HL and RL.
In this paper we study learning tasks within the classical RL setting, where agents learn through sequential interaction with an uncertain environment.Specifically, we consider a new benchmark environment that can be used to systematically assess how the logical structure of learning tasks affects the performance of HL and RL (accounting for hyperparameter selections, feature representations, etc.).Our environment has several advantages over existing environments for the dedicated study of task structure.Existing environments vary in multiple ways, making direct performance comparisons challenging to interpret.Likewise, due to their complexity, it is difficult to use variations of existing environments to create generalizable findings about the impact of task structure.For example, while the objective of chess is clear, the underlying learning task is a complex composition of the board structure, game mechanics, and adversarial dynamics.An experiment might explore variations to this learning task by modifying the movement patterns of different pieces [40].However, without a clear understanding of how the existing elements of chess contribute to learning performance, it is unclear how to incorporate these results into a generalized understanding of the impact of task structure.
Contributions We present a novel RL environment, called the Game of Hidden Rules (GOHR), that allows researchers to rigorously investigate the impact of task structure on learning performance.This extends preliminary work on this environment [30], as discussed in Appendix A.1.The GOHR complements existing learning environments and distinguishes itself as a useful tool for the study of task structure in three substantive ways.First, each hidden rule encodes a clearly defined logical pattern as the learning objective, allowing researchers to draw systematic distinctions between learning tasks.Second, GOHR's rule syntax allows for fine variations in task definition, enabling experiments that study controlled differences in learning tasks.Third, GOHR's rule syntax introduces a vast space of hidden rules for study, ranging from trivial to complex, providing an appropriate starting point for the study of task structure.We demonstrate the use of the GOHR through two example experiments in task structure that compare human learners to sample RL algorithms.

Related work
RL environments As noted, recent improvements in RL algorithms can be credited to an explosion in simulation-based benchmarking environments.Tools such as the Arcade Learning Environment [6], openAI gym [7], modern video games [20,42,13,45], and procedurally generated environments [11,19,22,32] have spurred RL development through increasingly complex and realistic problem settings.For instance, procedurally generated environments motivate more robust learning algorithms that can better handle variations to the environment.Other environments highlight cooperative or adversarial challenges unique to multi-agent settings [37], further expanding the breadth of tasks RL algorithms face.Collectively, these environments raise expectations for state-of-the-art RL algorithm performance.However, their emphasis on challenging high-end capabilities of algorithms often makes them difficult starting points for fundamental studies into the impact of task structure.The GOHR is intended to address this unmet need in the space of RL environments by allowing researchers to design precise experiments investigating the impact of task structure on learning.
Analysis of RL performance Islam et al. [17] and Henderson et al. [15] initiated important efforts to assess the reproducibility of RL performance and look more deeply at the effects of different internal design choices (e.g., network architectures) on performance [44,2,3].Such studies offer a great deal of practical insight related to algorithm design choices, but do not generally clarify taskoriented differences in tested benchmark environments.Among broad efforts to compare algorithm performance across benchmarks, the bsuite, introduced by Osband et al. [28], is most closely aligned with our line of research.The bsuite identifies high-level desired characteristics of effective learning agents (generalization, exploration, and handling of long-term consequences) and gathers a set of benchmark environments to assess the performance of different algorithms against these characteristics.While the bsuite concentrates on higher-level performance characteristics, the GOHR provides a complementary testbed focused specifically on the logical structure of the task to be learned and its impact on learning performance.Both approaches mark important steps toward a more nuanced understanding of RL algorithms.

Comparisons of humans to machine learning
There is a growing number of studies comparing the performance of machine learning (ML) to humans on particular real-world tasks, such as medical imaging [24].Similarly, the RL benchmarking literature often measures algorithm performance against human-level performance.These analyses provide valuable reference points for the performance of humans and state-of-the-art ML algorithms on particular tasks, but they do not clarify fundamental questions regarding the impact of task structure on learning performance.To our knowledge, there is little research that addresses rigorous ML/HL comparisons with respect to task structure.We believe a primary reason for the gap in literature investigating task structure, particularly within RL, is the present lack of environments capable of supporting small, precise, and interpretable changes to tasks.As noted by Hernández-Orallo [16] and Burnell et al. [9], more granular evaluation metrics are needed to properly interpret ML capabilities; this need for granularity holds when ML capabilities are compared rigorously to human performance [12].Deeper investigations into HL/RL responses to task structure may give important insight into algorithm design for more ambitious benchmarks like the Abstract Reasoning Corpus (ARC) [10].With respect to human-ML comparisons, most similar to our work is that of Kühl et al. [21], which examines a range of pattern recognition tasks in a supervised learning setting.As in our work, they present a curated set of tasks and demonstrate differences between the performance of human players and various algorithms.

Game of Hidden Rules
This overview of the GOHR closely follows [30].Additional details can be found in Appendix A.2. Comprehensive documentation and the complete toolset are available at our public site.

Game board
The GOHR is played on a 6 × 6 grid-style board using game pieces of varying shapes and colors.At the beginning of an episode, the game engine populates the board with game pieces; the player's goal is to clear the board of game pieces by placing them, one at a time, into any of the four buckets located at the corners of the board.Figure 1 provides a diagram of the board where all individual board cells, rows, columns, and buckets are given numeric labels along with a sample board as a human player would see it.Note that the collection of shapes and colors used for a given experiment is entirely configurable by the researcher, with a default set of four shapes and four colors.If desired, researchers can construct exact board layouts in advance of play (to be seen randomly or in a specified order), otherwise pieces are generated randomly per a set of input parameters.This flexibility allows the experimenter to design experiments addressing the learning curricula itself (e.g., to determine if seeing particular game pieces affects the performance of the learner for a given rule).Additional details are provided in Appendix A.2.1.Hidden rules A hidden rule, known to the researcher but not to the player, determines which pieces may be placed into which buckets.For example, a rule might allow pieces into certain buckets based on their shape, color, or position on the board.When the player makes a move (i.e., tries placing a particular game piece into a bucket), they receive immediate feedback on whether the move is allowed; if the move is allowed, the piece is removed from play, otherwise the piece returns to its original place on the board.Hidden rules are constructed from one or more rule lines, each of which is built from one or more atoms.For instance, a two-line rule with five atoms might look like: (atom 1) (atom 2) (atom 3) (atom 4) (atom 5) Only one rule line is active at a time; this active line determines the current rule state (how game pieces may be placed into buckets for the player's current move).In the example above, the rule state is formed by the contents of either atoms 1, 2, and 3 or atoms 4 and 5, depending on which line is active.Each atom maps a set of game pieces to a set of permitted buckets and is defined as follows: (count, shapes, colors, positions, buckets) Any game pieces matching the shapes, colors, and positions specified in the atom are accepted in any element of its set of buckets.The count parameter defines the number of successful moves for which the atom remains valid and is used in rules where the rule state changes during play.Multiple values can be listed for each non-count field and are grouped in brackets.A simple example where stars and triangles always go in the top-left bucket (0) while circles and squares always go in the bottom-right bucket (2), regardless of their color or position, can be expressed with two atoms on one rule line:

Example rules
In this section, we introduce the rules used in our experiments.Our first experiment explores a few stationary and non-stationary learning task structures expressible within the GOHR.We consider the following rules (see Appendix A.3 for related rule syntax and our public site to play these rules): -Shape Match (SM) -Each of the four shapes is mapped uniquely to a bucket (i.e., stars go in bucket 0, triangles in bucket 1, squares in bucket 2, circles in bucket 3).
-Quadrant Nearby (QN) -Pieces in each board quadrant are mapped uniquely to the bucket nearest to that quadrant.
-Bottom-left then Top-right (BLTR) -The rule state alternates between allowing a piece in the bottom-left bucket (3) and allowing a piece in the top-right bucket (1).
-Clockwise (CW) -The first piece must be placed in the top-left bucket (0) and each subsequent piece must be placed in the next clockwise bucket (i.e., the pattern follows buckets 0-1-2-3).
Note that these rules are equivalently difficult at random; in each, a random policy will always have a 1 /4 chance of making a correct move, regardless of the board state or past moves.The rules, however, task the player with learning patterns with different logical structures.Shape Match and Quadrant Nearby are stationary and use a single game piece feature (shape or position).Bottom-left then Top-right and Clockwise are non-stationary, encoding sequences of length 2 and 4, respectively.Our aim with this experiment is to demonstrate how HL and RL may respond differently to specific variations in task structure.Identifying how specific logical structures within learning tasks affect performance for RL or HL players will better inform the analysis of more complex environments, where learning tasks may be compositions of fundamental logical structures expressible in the GOHR.
Our second experiment addresses a characteristic of learning tasks that we call rule generality.Broadly, rule generality reflects that multiple policies may be effective for a particular rule.More rigorously, let a player's move policy be the policy by which they generate their moves.Such a policy is sufficient if applying this policy to any possible board state and sequence of past actions yields error-free play.A given move policy may be sufficient to many rules and rules may permit many sufficient policies.For example, consider the stationary rule where red and blue game pieces are permitted in buckets 0 and 1 while green and yellow game pieces are permitted in buckets 2 and 3. A sufficient policy could be to select game pieces and associated actions in the following order: red → bucket 0, blue → bucket 1, green → bucket 2, yellow → bucket 3 Any policy relying on these same color-to-bucket mappings would also be sufficient, regardless of the order in which it selects pieces.Further consider arbitrary rules A and B. With respect to generality, A properly dominates B if any sufficient policy for B is sufficient for A and there exists a sufficient policy for A that is not sufficient for B. We refer to A as more general than B (denoted A ≻ B).To study the response of players to increasing rule generality, we consider the following variations of the rules given in our first experiment (see Appendix A.3 for expression in GOHR rule syntax): 1-Shape Match 1 Free (SM1F) -Shape Match modified so one shape can be placed in any bucket.
-Shape Match 2 Options (SM2O) -Similar to Shape Match, except each of the four shapes is mapped to two buckets rather than one.-Quadrant Nearby 2 Free (QN2F) -Identical to Quadrant Nearby, except that pieces in two of the quadrants can be placed in any bucket.-Bottom then Top (BT) -The rule state alternates between allowing a piece in either of the bottom two buckets (2,3) and allowing a piece in either of the top two buckets (0,1).-Clockwise Alternating Free (CWAF) -Same as Clockwise, but every other move is a free move (i.e., any piece will be accepted in any bucket), per the repeating pattern 0-*-2-*.-Clockwise 2 Free (CW2F) -Same as Clockwise, except the last two moves in the 0-1-2-3 bucket pattern are free moves, i.e., following the repeating pattern 0-1-*-*.
Each rule is constructed to properly dominate a corresponding 'base rule' from the first experiment (e.g., CWAF ≻ CW, CW2F ≻ CW).When we provide two rule variations of a base rule (e.g., CWAF, CW2F), neither is more general than the other.Our aim with this experiment is to study the responses of human players and RL algorithms to increasing generality by comparing performance on more general rule variations to their respective base rules.

Experimental setup
We describe the human and RL participants in our experiments, experimental procedures, performance metrics, and statistical comparison of our results.Portions of this section closely follow [30].
Human participants Human participants in our GOHR experiments came from the Amazon Mechanical Turk platform [8], a popular tool for crowdsourcing tasks and research.Each player received a brief set of instructions about the mechanics of the GOHR and subsequently played 3-7 episodes of the same rule as part of their participation in the experiment.Players were selected such that they had no prior exposure to the GOHR.Approximately 25 participants were assigned to each rule listed in Section 4. In each episode, players received boards randomly populated with 8 or 9 pieces, depending on the rule.Additional information regarding experimental flow, subject counts, payments, and board generation parameters can be found in Appendix A. 4.

RL algorithms
To describe our two sample algorithms, we model the GOHR as a Markov Decision Process (MDP).The state observed by the player at time t, S t ∈ S, is described by the sequence of board arrangements (B i ) and associated actions (A i ) leading to the current board, (B 0 , A 0 , . . ., B t ).
The game engine generator, g, randomly generates the initial board arrangement B 0 per input parameters provided for the experiment, β, i.e., S 0 = (B 0 ) ∼ g(β).The action space A is the set of 144 action tuples (r, c, b) given by placing the piece in row r ∈ {1, . . ., 6} and column c ∈ {1, . . ., 6} into bucket b ∈ {0, . . ., 3}.If the game engine evaluates that action A t is permitted by the rule, the corresponding piece is removed in board arrangement B t+1 , otherwise B t+1 = B t .Note that state transitions are deterministic given the player's action, according to the logic of the hidden rule.A player receives reward R t (S t , A t ) = 0 if action A t ∈ A from state S t is permitted by the hidden rule and −1 otherwise.A terminal state is reached when the board is cleared, denoted by t = T .The player's objective is to find a policy π(s) that maximizes the value function , where γ ∈ (0, 1) is a discount factor.
We used two sample RL algorithms for comparison to human players.As an example of a policygradient based method, we employed a variant of the canonical REINFORCE algorithm [46].For a sample value-based method, we used a variant of epsilon-greedy DQN with experience replay [25].S is too large to deal with directly and thus we constructed feature maps ϕ i (s) to make the problem tractable.Our goal in constructing ϕ i was to faithfully represent S without artificially biasing algorithm performance for any particular subset of rules.As part of our experiments, we explored numerous feature maps and found that none universally outperformed the others across our tested rules (see Section 6).For REINFORCE, we used a neural network, parameterized by θ, to represent the learner's policy.Similarly for DQN, we used such a neural network to approximate the state-action value function Q(S, A).Complete descriptions of these algorithms, corresponding hyperparameter selections, feature maps, and experimental flow can be found in Appendix A.5.For a fair comparison to humans encountering the GOHR for the first time, we measured the performance of these RL algorithms with no pre-training.Each learning run for a given algorithm and rule consisted of initialization of the model with random network weights followed by serial play of a set number of episodes of the same rule.Learning runs used separate random seeds for the algorithm and the game-engine board generator.For each algorithm, we performed 50 independent learning runs on each rule provided in Section 4.
Performance metrics and statistical comparison Rules may permit many sufficient policies; we measured performance based on a learner's ability to exhibit any sufficient policy for the rule.Due to the limited participation time of human players, we measured a human player's understanding of a rule using streaks of consecutive correct moves that are sufficiently unlikely to occur at random.The base rules give players a 1 /4 chance of making a correct move at random; we chose a threshold of 10 correct moves in a row as it corresponds to a random probability of roughly 1 × 10 −6 .We define the point metric m * to be the index of the first move in the first streak of 10 or more correct moves that a player demonstrates.If a player never achieves such a streak, they are assigned an arbitrary placeholder value for m * that is higher than all measured m * values across the population of human players.Larger values of m * are interpreted to mean greater difficulty as they correspond to the player needing more moves to obtain an understanding of the rule.
In contrast to humans, we can prompt RL algorithms to play a fixed number of episodes large enough to exhaustively evaluate policy sufficiency.We define the point metric of terminal cumulative error (TCE) to be the cumulative error count made across all episodes in a learning run.If the algorithm has reached an understanding of the hidden rule, this error count is expected to converge to a constant, with some allowance made for algorithm stochasticity (see Appendix A.5.6 for chosen convergence criteria).If a learning run does not meet our convergence criteria, we set the TCE for that learning run to be a placeholder value larger than all convergent TCE values.We chose fixed episode horizons of 4,000 and 60,000 for learning runs of DQN and REINFORCE, respectively; learning runs typically converged well before these horizons, but use of a common horizon allowed for fair comparison of learning runs that needed more episodes to converge.As with m * , larger values of TCE are interpreted as the result of more difficult learning tasks.The metrics m * and TCE summarize the performance of each learning run, as shown in Figure 2.For each type of player (humans, DQN, and REINFORCE), we gather their associated m * or TCE values for all learning runs on all rules and we compare their distributions using the non-parametric Mann-Whitney U-Test [26].Statistical tests are performed at the α = 0.05 significance level.

Results
Comparison of feature representations As noted, our RL experiments considered different feature representations of the observed state.To study the impact of memory on performance, we tested input feature maps that included 2, 4, 6, or 8 previous board states and actions.This testing further included feature maps with different representations of both the past boards and actions themselves.For example, a past action could be represented as a 144-long one-hot vector or by three one-hot vectors (6-long for row, 6-long for column, and 4-long for bucket).We refer to the former as a sparse representation and the latter as a dense representation; we extend similar notions to representations of past boards.We found that no feature map universally outperformed the others across all tested rules; different choices of memory and board/action representation yielded different performance tradeoffs.In general, additional memory improved performance on non-stationary rules but worsened performance on stationary rules.With some exceptions, dense representations tended to outperform sparse representations.Additionally, while REINFORCE performed poorly in comparison to DQN, we noted that both algorithms showed similar trends in performance with respect to different feature representations.We provide a complete discussion of performance using different feature maps in Appendix A.7.In order to provide a single point of comparison for each method to humans in the following experiments, we select a feature map that showed a good balance of performance across our tested rules for both algorithms: dense board and action representations with 6 steps of memory.
Comparison of base rules Our first experiment introduced a set of 'base rules' for study (SM, QN, CW, BLTR). Figure 3 summarizes performance on these rules using empirical cumulative distribution function (ECDF) plots of m * for humans and strip plots of TCE for DQN and REINFORCE.Note that lower ECDF curves indicate greater difficulty as they imply that fewer players achieve an m * streak at a given move index.Higher values of TCE indicate greater difficulty as more mistakes occur before reaching a sufficient policy.See Appendix A.6.3 for a complete tabulation of p-value results from two-sided Mann-Whitney U-tests for all base rule pairs and learners.First, we note the similarity of human performance across the base rules, despite differences in learning task structure.Only one rule pairing, QN-BLTR, showed a statistically significant difference in human performance, with BLTR appearing more difficult.In contrast, both DQN and REINFORCE showed statistically significant performance differences for all rule pairs, even after applying a conservative Bonferroni correction.Although DQN outperformed REINFORCE (measured by TCE), both algorithms exhibited the same rule difficulty ordering: QN, BLTR, CW, SM (easiest to hardest, with SM noticeably more difficult).The non-stationary rules followed an expected ordering based on underlying sequence length: BLTR, a 2-long sequence, was easier than CW, a 4-long sequence.
Regarding stationary rules, we did not expect SM to be so difficult, especially compared to QN.We believe the difficulty of SM reflects a subtle interaction between our feature representations and the details of the learning task.While our boolean board representations are an intuitive way to create a static size characterization of the board, they favor the learning of position-based patterns over shapeor color-based patterns.In particular, the one-hot encoding fails to provide a notion of similarity for identical shapes or colors in different board positions.This type of finding might be easily overlooked in settings outside the GOHR, where the relevant learning tasks are more opaque and systematic investigation of task structure is not the primary focus.
These results suggest meaningful differences between HL and RL in their response to varying task structures.In particular, human performance likely depends on both the structure of the task and its relation to human priors for patterns.The plausible closeness of these rules to common priors (e.g.clockwise) would explain the similar human performance we observed across rules despite their structural differences.RL players, on the other hand, responded strongly to differences in the logical structure of learning tasks and performed identically for logically equivalent rules, such as SM and its color-based equivalent CM (see Appendix A.6.1).The GOHR serves as an ideal testbed for deeper investigations into such differences in task structure.For example, CW represents one possible instance of a 4-long repeating pattern; a dedicated experiment might explore other 4-long patterns to measure how human priors affect performance across rules of equivalent logical structure.Likewise, such an experiment could also include 2-, 3-, or 5-long patterns to precisely measure how human and RL player performance varies with respect to incremental changes in the logical structure itself.Similar approaches could be used to explore performance differences within families of stationary rules and, more generally, to identify the strengths and weaknesses of both learners with respect to fundamental elements of task structure.Further, even from this example experiment we see that the difficulty ordering of a common set of rules is not shared for human and RL players, suggesting that human-machine learning pairs might be constructed to exploit the differing strengths of each.
Impact of rule generality Our second experiment introduced more general variants of our base rules.Families with three rules (e.g., base rule A and more general rule variants B, C) offer two generality comparisons (B ≻ A and C ≻ A), while families with two rules offer one comparison (B ≻ A). Figure 4 summarizes human performance within each rule family with ECDFs of the m * distributions.DQN and REINFORCE showed uniformly better performance on more general rule variants (see Appendix A.6.2 for related plots).Table 1 shows the results of the Mann-Whitney U-Tests associated with each generality comparison.We used one-sided tests, with the null hypothesis that the more general rule is no harder than the base rule, as more general rules offer a larger number of sufficient policies.Per the U-tests, we note that RL players uniformly found more general rule variants easier than their base counterparts (the p-values greater than 0.999 would be significant under the opposite direction null hypothesis).In contrast, the response of human players to increasing generality depended on the structure of the base rule.In particular, human players appeared to find the more general forms of our non-stationary rules more difficult than their base rule counterparts (i.e., CWAF and CW2F appear more difficult than CW, BT appears more difficult than BLTR).This is surprising as more general rules offer a higher probability of achieving a streak of 10 correct moves at random.We posit that this difference between humans and our RL players reflects important differences in their respective learning strategies.While greater rule generality might plausibly assist learning by offering a larger number of sufficient policies for the player to learn, it also could hinder learning by decreasing the amount of useful, negative feedback.For primarily inductive learners, such as our sample RL algorithms, it appears that the availability of a larger number of sufficient policies dominates, making these more general rules easier.Humans, however, likely employ some combination of induction and deduction; the additional positive feedback from more general rules may complicate deduction as feedback could agree with many candidate classes of hidden rules.Future studies, with a broader set of base rules, could explore such an effect in greater detail.As in the first experiment, HL and RL did not respond identically to changes in task structure, and our results show that the parallel study of task structure for HL and RL may provide important insight into the strengths and weaknesses of each learner.Although the GOHR deals with relatively abstract task structures, we believe a systematic understanding of performance within the GOHR can provide important perspective in complex environments, where tasks are compositions of many such fundamental elements.

Conclusion
We have shown that the GOHR provides a capability for studying the performance of HL and RL in a novel and principled way.Using the GOHR's expressive rule syntax, researchers can make precise changes to learning tasks in order to study their effects on human and RL algorithm performance.
The GOHR complements existing environments by empowering researchers to perform rigorous experiments into different learning task structures.Beyond the kind of experiments presented here, the GOHR could also be used for related studies such as teaching curricula, transfer learning, or human-machine learning pairs.Task-oriented experiments augment efforts to improve the overall capabilities of RL algorithms by furthering our understanding of these methods' strengths and weaknesses.Most importantly, this type of study provides a step toward task-oriented understandings of RL and HL, both of which are needed to better inform the real-world use of RL.With this goal in mind, we are sharing the complete suite of tools with all interested researchers.We hope that researchers using the GOHR will share their findings to help this inquiry.
professionals in detecting diseases from medical imaging: a systematic review and metaanalysis.

A.1 Preliminary work
Preliminary work on the GOHR [30] introduced the environment and preliminary machine learning experiments.As noted in the body of the text, some sections of this manuscript (particularly descriptions of the environment and machine learning performance metrics) follow closely from corresponding sections in [30].This manuscript presents a broader set of experiments, includes new reinforcement learning models, adds featurization studies, and introduces comparisons to human learners in order to demonstrate the full capability of the GOHR as a research tool.

A.2 GOHR Mechanics
In this section we provide additional details, as discussed in the body of the manuscript, regarding the structure and mechanics of the GOHR.Additional details can be found in the external documentation at our public site.

A.2.1 Board generation
The board generation process (whether using pre-defined boards or done randomly) is controlled through the configuration file defined for each experiment.If using pre-defined boards, the experiment designer must specify a collection of JSON board representations as part of this file.When generating boards randomly, the experimenter must specify the minimum and maximum numbers of game pieces, colors, and shapes to appear on each new board; the game engine randomly selects values in the range [minimum, maximum] for each quantity and generates boards accordingly.

A.2.2 Rule specification and examples
This overview of GOHR rule specification closely follows [30].In Section 3, we considered the example where stars and triangles always go in the top-left (0) bucket while circles and squares always go in the bottom-right bucket (2), regardless of their color or position.As noted, this can be expressed with two atoms on a single line: For brevity, we use shorthand to refer to each row of the game board (R#) rather than individual board cells.Using this approach, an experimenter can design stationary hidden rules which are complex combinations of game piece features (shape, color, or position).
As mentioned, each atom contains a count field dictating the number of corresponding correct moves that the atom permits.When a player makes a move satisfying one (or more) of the atoms on the active rule line, the counts associated with the satisfied atoms are decremented by one.An atom with a count of 0 is considered exhausted and no longer accepts moves.At each move, the game engine evaluates the active rule line and the current board; if there are no valid moves available to the player, the engine makes the next rule line active and resets all counts in the new line.If there are no lines below the current rule line, the first rule line is made active again.This functionality can be used to encode sequences into the hidden rules.For example, the following two rules require the player to alternate between placing game pieces in the top (0,1) and bottom (2,3) buckets, with subtly different mechanics.In the strict case, the player's first move is allowed into only the top buckets and every correct move exhausts the current rule line.As a result, the active rule line will alternate with each correct move until the board is cleared.In the ambiguous case, a similar alternation occurs, but the order depends on the player's first correct move.Since both atoms are on the same rule line, the player's first move may go in any of the four buckets.After one successful move, the player may only make a move satisfying the remaining, non-exhausted atom.After two successful moves, all atoms in the rule line are exhausted and both are reset.This process repeats until the board is cleared.

Strict Alternation Ambiguous
GOHR also permits the experimenter to attach a count to each rule line.When the line is active, this count decrements each time any atom in the line is satisfied.For instance, a rule that alternates between uniquely assigning shapes and colors to buckets would look like: If the rule-line count is exhausted, the game engine makes the next line active, even if there are non-exhausted atoms on that line.For this example, the active rule line will alternate after each successful move, regardless of which atom in the active line is satisfied.If no count is provided for a given rule line, the game engine assumes that the rule line is not metered and can be used until all atoms on that line are exhausted or no valid move exists for the game pieces currently on the board.
The GOHR rule syntax allows the experimenter to write expressions in an atom's bucket field that define buckets based on previous successful moves.The game engine stores values for the bucket that most recently accepted any game piece (p) as well as the bucket that most recently accepted an item of each color (pc) and shape (ps).A simple rule expressible using these values is "objects must be placed in buckets in a clockwise pattern, beginning with any bucket": (1, *, *, *, [0,1,2,3]) (*, *, *, *, p+1) The expressions used in the buckets field are evaluated modulo 4 to ensure that the resultant expression gives values in the range 0-3.The game engine also supports the terms "nearby" and "remotest" as bucket definitions, which allow the experimenter to require that game pieces be put into either the closest or furthest bucket, respectively, evaluated by Euclidean distance.
The arrangement of atoms allows the experimenter to encode a specified order of the component tasks within a rule.For instance, the rule that "all red pieces go into bucket 1, then all blue pieces go into bucket 2" would be expressed as follows: (*, *, red, *, 1) (*, *, blue, *, 2) Since both the first rule line and associated atom are not metered, they can never become exhausted, even if there are no more red game pieces left on the board.However, as noted above, the engine evaluates if there are any valid moves available to the player given the active rule line; if there are no valid moves available, the engine makes the next line active.In this example, once the player has cleared all red game pieces from the board, the engine will make the second rule line active.

A.3 Sample experiment rules in GOHR syntax
In this section we provide the GOHR rule syntax associated with each of the rules included in our experiments.Note that the default set of four colors is red, blue, yellow, and black.Our human experiments substituted black for green to give more visual contrast between different colors.The syntax below is provided for the human experiments; substituting black for green (where applicable) gives the syntax for the versions played by RL players.Since our RL players do not process a visual representation of the board, the specific choice of the four colors used for RL play has no impact on performance.
Shape Match (SM): Note that SM2O could structurally be formed using many different 2-option bucket arrangements.This particular bucket arrangement technically does not properly dominate SM, although it does dominate a version of SM with a slightly different bucket arrangement.This disconnect was due to the fact that some experiments not discussed in this paper considered the performance of players when playing episodes of these rules in particular orders.For players who might see both SM and SM2O as part of an experiment, we wanted to ensure that players could not simply carry over their policy from a prior rule to a new rule and obtain error-free play (as would be the case if playing SM and SM2O sequentially for a version of SM2O that dominates SM).Given no compelling reason for a different arrangement of buckets to impact performance for human players or RL players, we therefore still consider SM2O to be a more general version of SM for the purposes of the sample experiments included in this paper.
Color Match (CM): Note that the \ denotes that these two atoms are on the same rule line and are only separated here due to formatting constraints.

A.4 Human experimental procedures
In this section we provide additional information regarding our experiments with human participants.

A.4.1 Overview
Our Amazon Mechanical Turk experiments were organized using the intermediary service CloudResearch [23], which allowed us to restrict participation to individuals in the United States.A small batch of participants ( 15) was recruited and played the game on December 12, 2022.The remainder of our participants were recruited and played the game on December 14, 2022.During play, human players do not have access to a log of past moves, but when the player makes a correct move a check mark is overlaid on the cleared piece (and the player can no longer click on the cleared piece).From this, a player can see the attributes of game pieces they have successfully cleared.Note that during processing of players' results, we discard instances where the player selects a piece but does not move it to a bucket; we call this situation a 'finger-slip' and attribute it to a physical error made as the player tries to click and drag an object to a bucket.These finger-slips are thus not considered mistakes as we calculate m * .

A.4.2 Experimental flow
Our experiments tasked players with playing 3 separate rules.As mentioned in the body of the manuscript, however, we only include results for the episodes of the first rule the player encounters (data from the second and third rules are not included in this manuscript).We plan to use experiments with multiple rules in later studies exploring the effects of transfer learning.Participants experienced the following order of events: 1. Player is recruited via the Amazon Mechanical Turk platform.
2. Player navigates to the link provided as part of the experiment and signs our participation consent form (see Appendix A.4.7).
3. Player receives instructions regarding the GOHR and the structure of the experiment.
4. Player plays 3-5 episodes of rule 1.Note that after 3 episodes, the player has a persistent option to proceed to the next rule.After 3 episodes they also receive a persistent option to play 2 additional episodes (called 'bonus rounds') as a chance to earn more points and a larger reward.As a result, players may play 3-7 episodes of this rule total, depending on their selections.
5. Player proceeds to rule 2 and similarly plays 3-7 episodes of that rule (data not included in this manuscript).
6. Player proceeds to rule 3 and similarly plays 3-7 episodes of that rule (data not included in this manuscript).

A.4.3 Subject counts
Subject counts for each rule can be found in Table 2.Note that our online interface requests that players provide a guess for the rule at the end of each episode.We found these guesses to be unreliable as measures of understanding, but we did use these guesses in tandem with performance as a filter for determining if a player made an honest effort during their time with the GOHR.We collected results from 24-26 participants for each starting rule (these players completed all 3 rules they were given).
From this group we excluded 5 participants from our analysis.A player was only excluded if, for all 3 rules they played, they showed no effort to learn the rule (i.e., their error rates were consistent with guessing at random for all episodes and all rules) and their inputted guesses did not relate to the game (e.g., if they provided only a random number so the button to proceed to the next episode would appear).Of the 5 excluded players, 1 was assigned to each of the following initial rules: SM1F, QN, QN2F, CWAF, BLTR.

A.4.4 Payment
Players were paid $2.50-$3.50,based on their performance, for approximately 20 minutes of time spent playing the GOHR.This amounts to an hourly wage of $7.50-$10.50.

A.4.5 Board generation parameters
In order to reduce the likelihood that players receive drastically different boards, we enforced that each board must have at least one piece of each shape and color.Boards for stationary rules were generated with 9 pieces.Boards for non-stationary rules were generated with 8 pieces so that sequences would carry over smoothly from board to board (e.g., the Clockwise cycle would not be interrupted by episode transitions).We plan future investigations into boards with larger numbers of pieces as part of dedicated curricula studies.

A.4.6 IRB
Our experiments with human players were reviewed and approved under the University of Wisconsin -Madison Minimal Risk Research IRB 2020-0781.

A.4.7 Participation consent
Prior to receiving instructions for the GOHR and beginning the experiment, players consented to participate in the study by clicking a checkbox at the conclusion of the following text: "The task you are about to do is sponsored by University of Wisconsin-Madison.It is part of a protocol titled "Human and Machine Learning: The Search for Anomalies".The purpose of this work is to compare reasoning biases in human and machine learners by testing what reasoning problems are relatively easier or more difficult for people, and for machines.More detailed instructions for this specific task will be provided on the next screen.
This task has no direct benefits.We do not anticipate any psychosocial risks.There is a risk of a confidentiality breach.Participants may become fatigued or frustrated due to the length of the study.The responses you submit as part of this task will be stored on a secure server and accessible only to researchers who have been approved by UW-Madison.Processed data with all identifiers removed could be used for future research studies or distributed to another investigator for future research studies without additional informed consent from the subject or the legally authorized representative.You are free to decline to participate, to end participation at any time for any reason, or to refuse to answer any individual question without penalty or loss of earned compensation.We will not retain data from partial responses.If you would like to withdraw your data after participating, you may send an email lupyan@wisc.eduor complete this form which will allow you to make a request anonymously.By clicking this box, I consent to participate in this task."

A.5 RL algorithms
In this section we describe the details of our two sample RL players: DQN and REINFORCE.These models come largely from canonical examples [25,46], but are modified to include action masks tailored to fit the GOHR's mechanics.

A.5.1 Experimental flow
For all rules in our experiments, RL players received boards with 9 randomly generated game pieces.
Given that RL players played many more episodes than humans, we did not similarly enforce that boards must have one game piece of each shape and color.Separate random seeds were used for each learning run (for both the game engine and the learning algorithm).As discussed in the body of the manuscript, each learning run consisted of random initialization of the model followed by serial play of a fixed number of episodes of the same rule.

A.5.2 Feature maps
As noted previously, to make the problem tractable we used feature mappings ϕ i to represent the observed state S ∈ S, i.e., ϕ i (S) and maintain a static-sized input layer for the neural networks used in each algorithm.Given the plausible impact of different feature representations on taskbased performance, we tested feature mappings that varied in the amount of memory included (i.e., the number of past boards and associated actions) as well as different representations of the past boards and actions themselves.The observed state may contain repeated board representations (i.e., B t+1 = B t if move A t is unsuccessful).Since the GOHR mechanics only update the rule state on successful moves, we only include distinct past board states and the associated successful action made from each board state in our feature maps.Repeated board states and the associated unsuccessful actions are not included.
Each featurization is the concatenation of a representation of the current board and a representation of some number of distinct past board states and associated actions.All featurizations used the same representation of the current board, defined as the concatenation of the following (assuming the use of the default set of four shapes and colors): • 4 36-long vectors, one corresponding to each shape.Each entry in the vector corresponds to one of the 36 cells on the board.An entry in the vector is 1 if a game piece of the corresponding shape is present in that cell and 0 otherwise.
• 4 36-long vectors, one corresponding to each color.Each entry in the vector corresponds to one of the 36 cells on the board.An entry in the vector is 1 if a game piece of the corresponding color is present in that cell and 0 otherwise.
Totalling a 288-long boolean vector.The remainder of the featurization represents a chosen number of past board states and actions (2, 4, 6, or 8), each encoded 'sparsely' or 'densely' as follows.
A sparse action encoding is given by: • 1 144-long one-hot vector.Each entry corresponds to one of the 144 actions in the GOHR (i.e., placing a piece from one of the 36 cells into one of the 4 buckets).An entry in this vector is 1 for the index of the corresponding successful past action taken in that time step and 0 otherwise (i.e., only one entry will be 1).
A dense action encoding is given by the concatenation of the following: • 1 6-long one-hot vector.Each entry of the vector corresponds to one of the game board's 6 rows.An entry in this vector is 1 for the index of the row associated with the past successful action taken from this time step and 0 otherwise.• 1 6-long one-hot vector.Each entry of the vector corresponds to one of the game board's 6 columns.An entry in this vector is 1 for the index of the column associated with the past successful action taken from this time step and 0 otherwise.• 1 4-long one-hot vector.Each entry of the vector corresponds to one of the game board's 4 buckets.An entry in this vector is 1 for the index of the bucket associated with the past successful action taken from this time step and 0 otherwise.
A sparse board encoding is the 288-long representation of the entire past board, following the same encoding procedure as the current board.Given that board states necessarily change by only one game piece at a time, we define a dense board encoding to be the concatenation of: • 1 4-long vector, corresponding to the length of the list of shapes.The vector is one-hot encoded corresponding to the index of the shape removed from the board in that time step.• 1 4-long vector, corresponding to the length of the list of colors.The vector is one-hot encoded corresponding to the index of the color removed from the board in that time step.
Totalling an 8-long boolean vector (with exactly 2 1's and the remainder 0).Given the board information in the present time step and the action information, a model with a dense board encoding still retains the information needed to reconstruct complete board states.
Thus, we can summarize the five types of feature maps used in our testing, where n is the number of steps of memory: We tested each of these feature maps with n = 2, 4, 6, 8 steps of memory, corresponding to 20 distinct featurizations.Featurizations range from 336 boolean inputs (BD-AD, n = 2) to 3936 boolean inputs (BSD-ASD, n = 8).

A.5.3 DQN
Our DQN player, with algorithm pseudocode shown in Algorithm 1, follows the canonical DQN structure as presented in Mnih et al. [25], modified to include an action mask.First, as described in [41], the algorithm reduces value-to-go overestimation by maintaining a policy network (parameterized by θ p ) and a target network (parameterized by θ t ) which are periodically synchronized (every T iterations).The next action a ′ used for calculation of the TD-0 target, y, is chosen as the argmax over the policy network, but the Q value used to calculate the TD-0 estimate uses the target network.Second, we employ an action mask that prevents the algorithm from selecting actions that do not Algorithm 2 REINFORCE Input : Input featurization method ϕ, episode count M , episode move limit L, learning rate α.
Initialize the network parameters θ with random weights for episode e = 0, . . ., M do Draw initial state s 0 (i.e., board state b 0 ) from the game engine, map to ϕ(s 0 ) Initialize record of episode trajectory T , reset L to inputted value for time step t = 0, . . ., L do Sample valid action a t per policy π(ϕ(s t )) Take action a t and observe reward r t and board state b t+1 Append (ϕ(s t ), a t , r t ) to T if board cleared then end for end for featurization methods; the GOHR enables a similarly granular study of performance with respect to hyperparameter selections, but this was not our focus in this paper.Hyperparameters for our DQN and REINFORCE models can be found in Table 3.

A.5.6 Convergence criteria
To account for algorithm stochasticity and give a fair comparison to humans (for whom a background rate of errors was expected and observed), we allow a deviation from error-free play for our RL players.We define a convergent learning run to be a move-based error rate of less than 0.0025 over the final 150 episodes of play.Given that boards have 9 pieces each, this amounts to the algorithm making 3 or fewer errors in the final 150 episodes of play.We note that we saw much faster convergence behavior for DQN over REINFORCE.This was likely due to a number of factors, such as the stochasticity in REINFORCE's action selection and the fact that DQN performs training updates each move rather than following each episode (and benefits from past experiences through its experience replay buffer).While shape-and color-based rules showed the slowest convergence (with some runs failing to meet our criteria), play on most rules converged well before our chosen episode limits.Ultimately, such a threshold will be inherently arbitrary; if the algorithms were permitted additional episodes of play, we could have enforced stricter convergence criteria.However, for the purposes of our sample experiments, we believe these conditions well-characterize the learning performance of our algorithms relative to humans and little insight would be gained from additional playtime (and associated stricter convergence conditions).

A.5.7 Compute resources
All experiments were performed on the University of Wisconsin-Madison's Center for High Throughput Computing cluster.Jobs were batched such that each learning run received 1 compute core (non-standardized hardware), 0.25-1.25GB of RAM, and 100MB of disk space.Learning run durations differed based on algorithm and featurization choice (along with variations in cluster resource availability), but each run generally completed in 0.5-3 hours.No GPU resources were used in our experiments.
A.6 Supplementary results

A.6.1 Color-based rule results
Human results for the Color Match (CM) rule, in context with the other base rules from our first experiment, can be found in Figure 5. Human results for the complete color-based generality family (CM, CM1F, CM2O) can be found in Figure 6.As noted in the body of the manuscript, these resemble the results from the shape family.

A.6.2 Rule generality RL results
Strip plots of TCE distributions for all rules involved in the generality comparisons included in our second experiment can be found in Figure 7 (DQN) and Figure 8 (REINFORCE).As noted in the body of the manuscript, we see uniformly better performance for more general rule variants (though we note that even longer runs would be needed for all shape/color rule learning runs to meet our convergence criteria, particularly for REINFORCE).

A.6.3 Base rule comparisons (all players)
As noted in the body of the manuscript, for our first experiment we gathered the m * and TCE distributions for all base rules and performed all possible pairwise rule comparisons (separately for each learner).Table 4 shows the results of two-sided Mann-Whitney U-Tests for all rule pairs and learners (human, DQN, and REINFORCE), including Color Match (CM).Note that the DQN and REINFORCE results are grouped together for visual compactness (no comparisons were done between the TCE distributions for DQN and REINFORCE).We see one rule pair with significantly different performance for humans (QN-BLTR).All rule pairings, except SM-CM, show significantly different performance for both RL algorithms.Note that given the equivalent logical structure of SM and CM and their equivalent treatment in our featurizations, we would expect p-values for this comparison of approximately 0.5.We see values of 0.624 and 0.908 for the DQN and REINFORCE U-tests, respectively.This deviation is due to chance differences in the number of learning runs which meet our convergence criteria.Since non-convergent runs are assigned placeholder values of TCE higher than all convergent runs, comparisons of these distributions are affected by the number of convergent runs for each.REINFORCE, in particular, would have needed an impractical number of additional episodes of play to meet convergence criteria for all runs.As a result, the strip plots in Figure 7 and Figure 8 provide a better view of the effectively identical performance of the RL algorithms on SM and CM, as we would expect.As noted in Appendix A.5, our experiments included testing with 20 distinct input featurization methods, resulting from all combinations of our five feature maps (BD-AD, BS-AD, BD-AS, BS-AS, BSD-ASD) and four levels of memory (2 steps, 4 steps, 6 steps, 8 steps).We found that no featurization outperformed all others across all tested rules; different choices of feature map and memory led to tradeoffs in performance.We found two types of plots helpful in interpreting the effects of different featurizations: line plots of episodic median performance and strip plots of TCE.Plots showing episodic median performance gather the median cumulative error value at each episode across all learning runs and are useful for comparing convergence differences between rule-featurization pairs.Strip plots of TCE capture end-of-run behavior more succinctly, but lose resolution in describing convergence behavior (in particular among runs which do not meet our convergence criteria).To observe performance trends, we create both types of plots using a grid-style layout.Each row of plots corresponds to a family of rules (e.g., shape rules SM, SM1F, and SM2O) and each column corresponds to a number of steps of memory (2, 4, 6, or 8).Each plot within the grid shows performance (either line plots of median cumulative error or TCE strip plots) for all rules in that row's rule family under all feature maps using that column's amount of memory.Figure 9 shows DQN episodic median performance and Figure 10 shows strip plots of DQN TCE.Figures 11  and 12 similarly show REINFORCE episodic mean performance and strip plots of TCE, respectively.These plots gather data from 20 learning runs per featurization-rule pair (for each algorithm).Note that we use abbreviated episode horizons (2000 for DQN and 20000 for REINFORCE) to reduce the computational burden of these experiments; the abbreviated horizons are sufficient to show relevant performance trends.We note that although we see large differences between DQN and REINFORCE in performance (measured by TCE), both algorithms show similar performance trends with respect to tested featurizations.As such, we discuss these broad trends jointly rather than for DQN and REINFORCE individually.
As noted in the body of the manuscript, increasing the amount of memory in a featurization generally improves performance on non-stationary rules.The rule CW2F highlights how an algorithm must have sufficient memory to distinguish between meaningfully different observations (i.e., the rule must be identifiable); in CW2F two moves of the 0-1-*-* sequence are free, so it is important that the model have more than 2 steps of memory in order to correctly learn from observed feedback.Gains of increasing memory appear to be diminishing, however, as CW2F illustrates.While we note a large improvement across feature maps from n = 2 to n = 4 steps of memory on CW2F, moving to n = 6 and n = 8 shows decreasing marginal performance gains.We see even earlier diminishing returns to increased memory, as expected, for rules which are identifiable with fewer steps of memory (such as other clockwise rules or the alternating rules).Unfortunately, the performance gains on non-stationary rules from additional memory generally correspond to worsened performance on stationary rules.In particular, the shape and color families show progressively worse performance as memory is added to the input featurization.Interestingly, performance on QN and QN2F remains essentially static with respect to choice of memory.This may suggest that, in contrast to the shape or color rules, the quadrant rules are so easily learned from the representation of the current board that the models are unaffected by the addition of largely useless historical board/action information.
In general, if the structure of the learning tasks an algorithm might encounter is not known a priori, the algorithm designer may need to choose an amount of memory based on their tolerance for the potential performance tradeoffs on tasks of different structure.
We also note interesting behavior among the five choices of feature maps (BD-AD, BS-AD, BD-AS, BS-AS, BSD-ASD) across our tested rules.In general, sparse board representations perform more poorly than dense representations (i.e., BD-AD tends to outperform BS-AD and BD-AS tends to outperform BS-AS).This follows our intuition, as the large number of additional features included in sparse board representations are generally not useful for our tested rules; we might note different behavior for rules with a stronger dependence on complete past board arrangements.Action representations, however, appear to bring more explicit performance tradeoffs for our tested rules.Dense action representations make the specific row, column, and bucket selection of a past move more readily available than sparse action representations.As might be expected, this improves performance on non-stationary tasks that explicitly depend on past bucket information, such as the clockwise and alternating families.For the color-and shape-based rules, though, we see some benefits to using sparse action representations over dense representations; BD-AS, for example, appears to show the most consistent convergence behavior across all featurizations for the shape/color rules, even as memory is increased.Performance on the stationary quadrant-based rules appears invariant to the chosen feature map, as was also the case with respect to choice of memory.Again, we believe this is because the quadrant rules are so easily learned from the representation of the current board that performance is unaffected by changes to the representation of historical information.We expect future work investigating a broader set of task structures to elucidate the situational value of different feature maps.Importantly, we note that the combined feature map BSD-ASD, which contains both sparse and dense representations, does not always resemble the best performing feature map (among BD-AD, BS-AD, BD-AS, BS-AS).While BSD-ASD performs well (though not best) for clockwise, quadrant, and alternating rules, it shows poor performance on the shape and color rules, similar to other feature maps with sparse board representations.While these results do not suggest a simple answer regarding featurization methods, they do highlight the importance of considering the structure of tasks that an algorithm might encounter when designing an algorithm's input featurization.Figure 12: REINFORCE TCE strip plots, where each dot summarizes the TCE of a particular learning run (TCE given on the y-axis and corresponding rule given on the x-axis).Each row of plots has the same y-axis scaling and considers the same set of rules.Each column of plots considers featurizations with a different selection of memory.With memory fixed within each plot, dot color denotes the feature map used in that learning run.Each plot contains 20 learning runs for each rule-feature map pair.'C.N.M.' denotes that the convergence criteria were not met for that learning run.

Figure 1 :
Figure 1: Game board diagram (left) and a sample board with four shapes and colors (right).

Figure 2 :
Figure 2: Sample learning runs for a human (left) and DQN (right), plotting cumulative error count against the move and episode indices, respectively.The human's learning run is summarized by an m * of 34, the first move of the first streak of 10+ correct moves.The DQN's learning run is summarized by a TCE of 305, the error count after 4000 episodes.

Figure 3 :
Figure 3: Base rule performance of humans (left) and RL players (right).ECDF curves denote the fraction of human players achieving an m * streak by a given move index on each rule ('Never' indicates player does not achieve such a streak).Strip plots of TCE distributions of each rule are provided for DQN and REINFORCE, separated due to different TCE magnitudes ('C.N.M.' indicates convergence criteria were not met for that learning run).Each dot corresponds to a learning run.

Figure 4 :
Figure 4: ECDFs of m * distributions for each rule family.Stationary rules (shape/quadrant) shown on left.Non-stationary rules (clockwise/alternating) shown on right.Base rules shown in blue.

1 . 2 . 5 .
Board dense, action dense (BD-AD): The standard representation of the current board and dense representations of the past n board states and successful actions.Total of 288 + (8 + 16)n boolean inputs.Board dense, action sparse (BD-AS): The standard representation of the current board, dense representation of the past n board states, and sparse representations of the past n successful actions.Total of 288 + (8 + 144)n boolean inputs.3. Board sparse, action dense (BS-AD): The standard representation of the current board, sparse representations of the past n board states, and dense representations of the past n successful actions.Total of 288 + (288 + 16)n boolean inputs.4. Board sparse, action sparse (BS-AS): The standard representation of the current board and sparse representations of the past n board states and successful actions.Total of 288 + (288 + 144)n boolean inputs.Board sparse-dense, action sparse-dense (BSD-ASD): The standard representation of the current board and concatenations of both sparse and dense representations of the past n board states and actions.Total of 288 + (288 + 8 + 144 + 16)n boolean inputs.

Figure 5 :
Figure 5: Empirical cumulative distribution plots denoting the fraction of players who achieved an m * streak by a given move index for all base rules (including CM).

Figure 6 :
Figure 6: Empirical cumulative distribution plots denoting the fraction of players who achieved an m * streak by a given move index for rules in the color family.

Figure 7 :Figure 8 :
Figure 7: Strip plots for DQN TCE distributions across all rules.Each dot represents a learning run for the corresponding rule.'C.N.M.' denotes that convergence criteria were not met for that learning run.

Figure 11 :
Figure 11: REINFORCE median learning curves, plotting median cumulative error versus episode.Each line summarizes the median performance of 20 learning runs for a specific rule (noted by the line color) using a specific feature map (noted by the line style).Each row of plots has the same y-axis scaling and considers the same set of rules.Each column of plots considers featurizations with a different selection of memory.Shaded regions denote 95% confidence intervals obtained from 500 bootstraps.
The * character is a wildcard.For the count field, * means the atom is always valid; for shape, color, or position, it means any value for that feature is permissible.A set of helpful rule examples, illustrating the expressiveness of the rule syntax, can be found in Appendix A.2.2.Broadly, rules within the GOHR can be divided into two categories: stationary and non-stationary.Stationary rules are those in which the rule state does not change during game play.In such rules, whether a move is permitted does not depend on the state of the board or past actions made by the player.These rules can be used to evaluate a player's ability to learn strictly feature-based patterns.The example noted above is stationary; the board state and move history do not impact where game pieces are permitted.In contrast, non-stationary rules are those in which the rule state changes during play, meaning that permitted moves will depend on the state of the board or past successful moves the player has made.Non-stationary rules embed temporal components into the logical pattern the player must learn.The rules presented in Appendix A.2.2 further describe the mechanics available to researchers in creating non-stationary rules.Importantly, we note that non-stationary rules update the rule state only after successful moves.The rule state is unchanged after an incorrect move.

Table 1 :
P-value results of one-sided U-Tests comparing each more general rule to its base rule counterpart.Tests use m * and TCE as point metrics for humans and algorithms, respectively.Null hypothesis is always that the more general rule is no harder with alternative hypothesis that the more general rule is harder.Significant results at the α = 0.05 level are highlighted with a † and columns with contrasting HL/RL behavior are shown in bold.

Table 2 :
Final count of human players for each rule in our experiments.
If you have any questions or concerns about this task please contact the principal investigator: Prof. Vicki Bier at vicki.bier@wisc.edu.If you are not satisfied with response of the research team, have more questions, or want to talk with someone about your rights as a research participant, you should contact University of Wisconsin's Education Research and Social & Behavioral Science IRB Office at 608-263-2320.

Table 3 :
DQN and REINFORCE hyperparameters.Hyperparameters that are irrelevant for a given model are marked '-'.

Table 4 :
P-value results of two-sided U-Tests comparing performance on the base rules presented in our first experiment.Each entry of the table compares the rule on the left to the rule given above (i.e., top-left entry compares SM to CM). Results are separated for humans (left) and RL algorithms (right).Tests use m * and TCE as point metrics for HL and RL, respectively.Values <0.001 are denoted by * and significant results at the α = 0.05 level are highlighted with boldface.For RL players, significance results are given in the form (p-value for DQN / p-value for REINFORCE).