agentlab.experiments.study

Functions

`get_most_recent_study`([root_dir, ...])	Return the most recent directory based on the date in the folder name.
`make_study`(agent_args, benchmark[, ...])	Run a list of agents on a benchmark.
`set_demo_mode`(env_args_list)	Set the demo mode for the experiments.

Classes

`AbstractStudy`()	Abstract class for a study.
`ParallelStudies`(studies[, parallel_servers])
`ParallelStudies_alt`(studies[, parallel_servers])
`SequentialStudies`(studies)	Sequential execution of multiple studies.
`Study`([agent_args, benchmark, dir, suffix, ...])	A study coresponds to one or multiple agents evaluated on a benchmark.

class agentlab.experiments.study.AbstractStudy

Bases: ABC

Abstract class for a study.

dir: Path = None

abstract find_incomplete(include_errors=True): Prepare the study for relaunching by finding incomplete experiments

get_results(suffix='', also_save=True): Recursively load all results from the study directory and summarize them.

make_dir(exp_root=PosixPath('/home/docs/agentlab_results')): Create a directory for the study

abstract run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3): Run the study

save(exp_root=PosixPath('/home/docs/agentlab_results')): Pickle the study to the directory

shuffle_exps(): Shuffle the experiments in the study.

suffix: str = ''

class agentlab.experiments.study.ParallelStudies(studies: list[agentlab.experiments.study.Study], parallel_servers: list[agentlab.experiments.multi_server.BaseServer] | int = None)

Bases: SequentialStudies

parallel_servers: list[BaseServer] | int = None

class agentlab.experiments.study.ParallelStudies_alt(studies: list[agentlab.experiments.study.Study], parallel_servers: list[agentlab.experiments.multi_server.BaseServer] | int = None)

Bases: SequentialStudies

parallel_servers: list[BaseServer] | int = None

class agentlab.experiments.study.SequentialStudies(studies: list[Study])

Bases: AbstractStudy

Sequential execution of multiple studies.

This is required for e.g. WebArena, where a server reset is required between evaluations of each agent.

append_to_journal(strict_reproducibility=True)

find_incomplete(include_errors=True): Prepare the study for relaunching by finding incomplete experiments

property name: The name of the study.

override_max_steps(max_steps)

run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3, exp_root=PosixPath('/home/docs/agentlab_results')): Run the study

studies: list[Study]

class agentlab.experiments.study.Study(agent_args: list[AgentArgs] = None, benchmark: Benchmark | str = None, dir: Path = None, suffix: str = '', uuid: str = None, reproducibility_info: dict = None, logging_level: int = 10, logging_level_stdout: int = 30, comment: str = None, ignore_dependencies: bool = False, avg_step_timeout: int = 60, demo_mode: bool = False)

Bases: AbstractStudy

A study coresponds to one or multiple agents evaluated on a benchmark.

This is part of the high level API to help keep experiments organized and reproducible.

agent_args

list[AgentArgs] The agent configuration(s) to run. IMPORTANT: these objects will be pickled and unpickled. Make sure they are imported from a package that is accessible from PYTHONPATH. Otherwise, it won’t load in agentlab-xray.

Type:: list[agentlab.agents.agent_args.AgentArgs]

benchmark

bgym.Benchmark | str The benchmark to run the agents on. See bgym.DEFAULT_BENCHMARKS for the main ones. You can also make your own by modifying an existing one.

Type:: browsergym.experiments.benchmark.base.Benchmark | str

dir

Path The directory where the study will be saved. If None, a directory will be created in RESULTS_DIR.

Type:: pathlib.Path

suffix

str A suffix to add to the study name. This can be useful to keep track of your experiments. By default the study name contains agent name, benchmark name and date.

Type:: str

uuid

str A unique identifier for the study. Will be generated automatically.

Type:: str

reproducibility_info

dict Information about the study that may affect the reproducibility of the experiment. e.g.: versions of BrowserGym, benchmark, AgentLab…

Type:: dict

logging_level

int The logging level for individual jobs.

Type:: int

logging_level_stdout

int The logging level for the stdout of the main script. Each job will have its own logging level that will save into file and can be seen in agentlab-xray.

Type:: int

comment

str Extra comments from the authors of this study to be stored in the reproducibility information. Leave any extra information that can explain why results could be different than expected.

Type:: str

ignore_dependencies

bool If True, ignore the dependencies of the tasks in the benchmark. Use with caution. So far, only WebArena and VisualWebArena have dependencies between tasks to minimize the influence of solving one task before another one. This dependency graph allows experiments to run in parallel while respecting task dependencies. However, it still can’t run more than 4 and, in practice it’s speeding up evaluation by a factor of only 3x compare to sequential execution. To accelerate execution, you can ignore dependencies and run in full parallel. This leads to a decrease in performance of about 1%-2%, and could be more. Note: ignore_dependencies on VisualWebArena doesn’t work.

Type:: bool

avg_step_timeout

int The average step timeout in seconds. This is used to stop the experiments if they are taking too long. The default is 60 seconds.

Type:: int

demo_mode

bool If True, the experiments will be run in demo mode, which will record videos, and enable visual effects for actions.

Type:: bool

agent_args: list[AgentArgs] = None

append_to_journal(strict_reproducibility=True)

Append the study to the journal.

Parameters:: strict_reproducibility – bool If True, incomplete experiments will raise an error.

avg_step_timeout: int = 60

benchmark: Benchmark | str = None

comment: str = None

demo_mode: bool = False

dir: Path = None

find_incomplete(include_errors=True)

Find incomplete or errored experiments in the study directory for relaunching.

Parameters:

include_errors – bool If True, include errored experiments in the list.

Returns:

The list of all experiments with completed ones replaced by a: dummy exp_args to keep the task dependencies.

Return type:

list[ExpArgs]

ignore_dependencies: bool = False

static load(dir: Path) → Study

load_exp_args_list()

static load_most_recent(root_dir: Path = None, contains=None) → Study

logging_level: int = 10

logging_level_stdout: int = 30

make_exp_args_list(): Generate the exp_args_list from the agent_args and the benchmark.

property name

override_max_steps(max_steps)

reproducibility_info: dict = None

run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3, relaunch_errors=True, exp_root=PosixPath('/home/docs/agentlab_results')): Run the study

set_reproducibility_info(strict_reproducibility=False, comment=None)

Gather relevant information that may affect the reproducibility of the experiment

e.g.: versions of BrowserGym, benchmark, AgentLab…

Parameters:

strict_reproducibility – bool If True, all modifications have to be committed before running the experiments. Also, if relaunching a study, it will not be possible if the code has changed.
comment – str Extra comment to add to the reproducibility information.

suffix: str = ''

uuid: str = None

agentlab.experiments.study.get_most_recent_study(root_dir: Path = None, date_format: str = '%Y-%m-%d_%H-%M-%S', contains=None)

Return the most recent directory based on the date in the folder name.

Parameters:

root_dir – The directory to search in
date_format – The format of the date in the folder name
contains – If not None, only consider folders that contains this string

Returns:

The most recent folder satisfying the conditions

Return type:

Path

agentlab.experiments.study.make_study(agent_args: list[AgentArgs] | AgentArgs, benchmark: Benchmark | str, logging_level_stdout=30, suffix='', comment=None, ignore_dependencies=False, parallel_servers=None)

Run a list of agents on a benchmark.

Parameters:

agent_args – list[AgentArgs] | AgentArgs The agent configuration(s) to run. IMPORTANT: these objects will be pickled and unpickled. Make sure they are imported from a package that is accessible from PYTHONPATH. Otherwise, it won’t load in agentlab-xray.
benchmark – bgym.Benchmark | str The benchmark to run the agents on. See bgym.DEFAULT_BENCHMARKS for the main ones. You can also make your own by modifying an existing one.
logging_level_stdout – int The logging level for the stdout of the main script. Each job will have its own logging level that will save into file and can be seen in agentlab-xray.
suffix – str A suffix to add to the study name. This can be useful to keep track of your experiments. By default the study name contains agent name, benchmark name and date.
comment – str Extra comments from the authors of this study to be stored in the reproducibility information. Leave any extra information that can explain why results could be different than expected.
ignore_dependencies – bool If True, ignore the dependencies of the tasks in the benchmark. Use with caution. So far, only WebArena and VisualWebArena have dependencies between tasks to minimize the influence of solving one task before another one. This dependency graph allows experiments to run in parallel while respecting task dependencies. However, it still can’t run more than 4 and, in practice it’s speeding up evaluation by a factor of only 3x compare to sequential executionz. To accelerate execution, you can ignore dependencies and run in full parallel. This leads to a decrease in performance of about 1%-2%, and could be more. Note: ignore_dependencies on VisualWebArena doesn’t work.
parallel_servers – list[WebArenaInstanceVars] The number of parallel servers to use if “webarena” in benchmark.name. Use this to dispatch agent_args on a pool of servers in parallel. If len(agent_args) > len(parallel_servers), the servers will be reused for next evaluation (with a reset) as soon as it is done.

Returns:

Study | SequentialStudies | ParallelStudies object.

SequentialStudies: if the benchmark requires manual reset after each evaluation such as: WebArena and VisualWebArena.

ParallelStudies: if the benchmark has multiple servers to run in parallel. Study: otherwise.

agentlab.experiments.study.set_demo_mode(env_args_list: list[EnvArgs]): Set the demo mode for the experiments. This can be useful for generating videos for demos.