agentlab.experiments.study

Functions

get_most_recent_study([root_dir, ...])

Return the most recent directory based on the date in the folder name.

make_study(agent_args, benchmark[, ...])

Run a list of agents on a benchmark.

set_demo_mode(env_args_list)

Set the demo mode for the experiments.

Classes

AbstractStudy()

Abstract class for a study.

ParallelStudies(studies[, parallel_servers])

ParallelStudies_alt(studies[, parallel_servers])

SequentialStudies(studies)

Sequential execution of multiple studies.

Study([agent_args, benchmark, dir, suffix, ...])

A study coresponds to one or multiple agents evaluated on a benchmark.

class agentlab.experiments.study.AbstractStudy

Bases: ABC

Abstract class for a study.

dir: Path = None
abstract find_incomplete(include_errors=True)

Prepare the study for relaunching by finding incomplete experiments

get_results(suffix='', also_save=True)

Recursively load all results from the study directory and summarize them.

make_dir(exp_root=PosixPath('/home/docs/agentlab_results'))

Create a directory for the study

abstract run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3)

Run the study

save(exp_root=PosixPath('/home/docs/agentlab_results'))

Pickle the study to the directory

shuffle_exps()

Shuffle the experiments in the study.

suffix: str = ''
class agentlab.experiments.study.ParallelStudies(studies: list[agentlab.experiments.study.Study], parallel_servers: list[agentlab.experiments.multi_server.BaseServer] | int = None)

Bases: SequentialStudies

parallel_servers: list[BaseServer] | int = None
class agentlab.experiments.study.ParallelStudies_alt(studies: list[agentlab.experiments.study.Study], parallel_servers: list[agentlab.experiments.multi_server.BaseServer] | int = None)

Bases: SequentialStudies

parallel_servers: list[BaseServer] | int = None
class agentlab.experiments.study.SequentialStudies(studies: list[Study])

Bases: AbstractStudy

Sequential execution of multiple studies.

This is required for e.g. WebArena, where a server reset is required between evaluations of each agent.

append_to_journal(strict_reproducibility=True)
find_incomplete(include_errors=True)

Prepare the study for relaunching by finding incomplete experiments

property name

The name of the study.

override_max_steps(max_steps)
run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3, exp_root=PosixPath('/home/docs/agentlab_results'))

Run the study

studies: list[Study]
class agentlab.experiments.study.Study(agent_args: list[AgentArgs] = None, benchmark: Benchmark | str = None, dir: Path = None, suffix: str = '', uuid: str = None, reproducibility_info: dict = None, logging_level: int = 10, logging_level_stdout: int = 30, comment: str = None, ignore_dependencies: bool = False, avg_step_timeout: int = 60, demo_mode: bool = False)

Bases: AbstractStudy

A study coresponds to one or multiple agents evaluated on a benchmark.

This is part of the high level API to help keep experiments organized and reproducible.

agent_args

list[AgentArgs] The agent configuration(s) to run. IMPORTANT: these objects will be pickled and unpickled. Make sure they are imported from a package that is accessible from PYTHONPATH. Otherwise, it won’t load in agentlab-xray.

Type:

list[agentlab.agents.agent_args.AgentArgs]

benchmark

bgym.Benchmark | str The benchmark to run the agents on. See bgym.DEFAULT_BENCHMARKS for the main ones. You can also make your own by modifying an existing one.

Type:

browsergym.experiments.benchmark.base.Benchmark | str

dir

Path The directory where the study will be saved. If None, a directory will be created in RESULTS_DIR.

Type:

pathlib.Path

suffix

str A suffix to add to the study name. This can be useful to keep track of your experiments. By default the study name contains agent name, benchmark name and date.

Type:

str

uuid

str A unique identifier for the study. Will be generated automatically.

Type:

str

reproducibility_info

dict Information about the study that may affect the reproducibility of the experiment. e.g.: versions of BrowserGym, benchmark, AgentLab…

Type:

dict

logging_level

int The logging level for individual jobs.

Type:

int

logging_level_stdout

int The logging level for the stdout of the main script. Each job will have its own logging level that will save into file and can be seen in agentlab-xray.

Type:

int

comment

str Extra comments from the authors of this study to be stored in the reproducibility information. Leave any extra information that can explain why results could be different than expected.

Type:

str

ignore_dependencies

bool If True, ignore the dependencies of the tasks in the benchmark. Use with caution. So far, only WebArena and VisualWebArena have dependencies between tasks to minimize the influence of solving one task before another one. This dependency graph allows experiments to run in parallel while respecting task dependencies. However, it still can’t run more than 4 and, in practice it’s speeding up evaluation by a factor of only 3x compare to sequential execution. To accelerate execution, you can ignore dependencies and run in full parallel. This leads to a decrease in performance of about 1%-2%, and could be more. Note: ignore_dependencies on VisualWebArena doesn’t work.

Type:

bool

avg_step_timeout

int The average step timeout in seconds. This is used to stop the experiments if they are taking too long. The default is 60 seconds.

Type:

int

demo_mode

bool If True, the experiments will be run in demo mode, which will record videos, and enable visual effects for actions.

Type:

bool

agent_args: list[AgentArgs] = None
append_to_journal(strict_reproducibility=True)

Append the study to the journal.

Parameters:

strict_reproducibility – bool If True, incomplete experiments will raise an error.

avg_step_timeout: int = 60
benchmark: Benchmark | str = None
comment: str = None
demo_mode: bool = False
dir: Path = None
find_incomplete(include_errors=True)

Find incomplete or errored experiments in the study directory for relaunching.

Parameters:

include_errors – bool If True, include errored experiments in the list.

Returns:

The list of all experiments with completed ones replaced by a

dummy exp_args to keep the task dependencies.

Return type:

list[ExpArgs]

ignore_dependencies: bool = False
static load(dir: Path) Study
load_exp_args_list()
static load_most_recent(root_dir: Path = None, contains=None) Study
logging_level: int = 10
logging_level_stdout: int = 30
make_exp_args_list()

Generate the exp_args_list from the agent_args and the benchmark.

property name
override_max_steps(max_steps)
reproducibility_info: dict = None
run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3, relaunch_errors=True, exp_root=PosixPath('/home/docs/agentlab_results'))

Run the study

set_reproducibility_info(strict_reproducibility=False, comment=None)

Gather relevant information that may affect the reproducibility of the experiment

e.g.: versions of BrowserGym, benchmark, AgentLab…

Parameters:
  • strict_reproducibility – bool If True, all modifications have to be committed before running the experiments. Also, if relaunching a study, it will not be possible if the code has changed.

  • comment – str Extra comment to add to the reproducibility information.

suffix: str = ''
uuid: str = None
agentlab.experiments.study.get_most_recent_study(root_dir: Path = None, date_format: str = '%Y-%m-%d_%H-%M-%S', contains=None)

Return the most recent directory based on the date in the folder name.

Parameters:
  • root_dir – The directory to search in

  • date_format – The format of the date in the folder name

  • contains – If not None, only consider folders that contains this string

Returns:

The most recent folder satisfying the conditions

Return type:

Path

agentlab.experiments.study.make_study(agent_args: list[AgentArgs] | AgentArgs, benchmark: Benchmark | str, logging_level_stdout=30, suffix='', comment=None, ignore_dependencies=False, parallel_servers=None)

Run a list of agents on a benchmark.

Parameters:
  • agent_args – list[AgentArgs] | AgentArgs The agent configuration(s) to run. IMPORTANT: these objects will be pickled and unpickled. Make sure they are imported from a package that is accessible from PYTHONPATH. Otherwise, it won’t load in agentlab-xray.

  • benchmark – bgym.Benchmark | str The benchmark to run the agents on. See bgym.DEFAULT_BENCHMARKS for the main ones. You can also make your own by modifying an existing one.

  • logging_level_stdout – int The logging level for the stdout of the main script. Each job will have its own logging level that will save into file and can be seen in agentlab-xray.

  • suffix – str A suffix to add to the study name. This can be useful to keep track of your experiments. By default the study name contains agent name, benchmark name and date.

  • comment – str Extra comments from the authors of this study to be stored in the reproducibility information. Leave any extra information that can explain why results could be different than expected.

  • ignore_dependencies – bool If True, ignore the dependencies of the tasks in the benchmark. Use with caution. So far, only WebArena and VisualWebArena have dependencies between tasks to minimize the influence of solving one task before another one. This dependency graph allows experiments to run in parallel while respecting task dependencies. However, it still can’t run more than 4 and, in practice it’s speeding up evaluation by a factor of only 3x compare to sequential executionz. To accelerate execution, you can ignore dependencies and run in full parallel. This leads to a decrease in performance of about 1%-2%, and could be more. Note: ignore_dependencies on VisualWebArena doesn’t work.

  • parallel_servers – list[WebArenaInstanceVars] The number of parallel servers to use if “webarena” in benchmark.name. Use this to dispatch agent_args on a pool of servers in parallel. If len(agent_args) > len(parallel_servers), the servers will be reused for next evaluation (with a reset) as soon as it is done.

Returns:

Study | SequentialStudies | ParallelStudies object.
SequentialStudies: if the benchmark requires manual reset after each evaluation such as

WebArena and VisualWebArena.

ParallelStudies: if the benchmark has multiple servers to run in parallel. Study: otherwise.

agentlab.experiments.study.set_demo_mode(env_args_list: list[EnvArgs])

Set the demo mode for the experiments. This can be useful for generating videos for demos.