agentlab.experiments.study
Functions
|
Return the most recent directory based on the date in the folder name. |
|
Run a list of agents on a benchmark. |
|
Set the demo mode for the experiments. |
Classes
Abstract class for a study. |
|
|
|
|
|
|
Sequential execution of multiple studies. |
|
A study coresponds to one or multiple agents evaluated on a benchmark. |
- class agentlab.experiments.study.AbstractStudy
Bases:
ABCAbstract class for a study.
- abstract find_incomplete(include_errors=True)
Prepare the study for relaunching by finding incomplete experiments
- get_results(suffix='', also_save=True)
Recursively load all results from the study directory and summarize them.
- make_dir(exp_root=PosixPath('/home/docs/agentlab_results'))
Create a directory for the study
- abstract run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3)
Run the study
- save(exp_root=PosixPath('/home/docs/agentlab_results'))
Pickle the study to the directory
- shuffle_exps()
Shuffle the experiments in the study.
- class agentlab.experiments.study.ParallelStudies(studies: list[agentlab.experiments.study.Study], parallel_servers: list[agentlab.experiments.multi_server.BaseServer] | int = None)
Bases:
SequentialStudies- parallel_servers: list[BaseServer] | int = None
- class agentlab.experiments.study.ParallelStudies_alt(studies: list[agentlab.experiments.study.Study], parallel_servers: list[agentlab.experiments.multi_server.BaseServer] | int = None)
Bases:
SequentialStudies- parallel_servers: list[BaseServer] | int = None
- class agentlab.experiments.study.SequentialStudies(studies: list[Study])
Bases:
AbstractStudySequential execution of multiple studies.
This is required for e.g. WebArena, where a server reset is required between evaluations of each agent.
- append_to_journal(strict_reproducibility=True)
- find_incomplete(include_errors=True)
Prepare the study for relaunching by finding incomplete experiments
- property name
The name of the study.
- override_max_steps(max_steps)
- run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3, exp_root=PosixPath('/home/docs/agentlab_results'))
Run the study
- class agentlab.experiments.study.Study(agent_args: list[AgentArgs] = None, benchmark: Benchmark | str = None, dir: Path = None, suffix: str = '', uuid: str = None, reproducibility_info: dict = None, logging_level: int = 10, logging_level_stdout: int = 30, comment: str = None, ignore_dependencies: bool = False, avg_step_timeout: int = 60, demo_mode: bool = False)
Bases:
AbstractStudyA study coresponds to one or multiple agents evaluated on a benchmark.
This is part of the high level API to help keep experiments organized and reproducible.
- agent_args
list[AgentArgs] The agent configuration(s) to run. IMPORTANT: these objects will be pickled and unpickled. Make sure they are imported from a package that is accessible from PYTHONPATH. Otherwise, it won’t load in agentlab-xray.
- benchmark
bgym.Benchmark | str The benchmark to run the agents on. See bgym.DEFAULT_BENCHMARKS for the main ones. You can also make your own by modifying an existing one.
- Type:
browsergym.experiments.benchmark.base.Benchmark | str
- dir
Path The directory where the study will be saved. If None, a directory will be created in RESULTS_DIR.
- Type:
- suffix
str A suffix to add to the study name. This can be useful to keep track of your experiments. By default the study name contains agent name, benchmark name and date.
- Type:
- reproducibility_info
dict Information about the study that may affect the reproducibility of the experiment. e.g.: versions of BrowserGym, benchmark, AgentLab…
- Type:
- logging_level_stdout
int The logging level for the stdout of the main script. Each job will have its own logging level that will save into file and can be seen in agentlab-xray.
- Type:
- comment
str Extra comments from the authors of this study to be stored in the reproducibility information. Leave any extra information that can explain why results could be different than expected.
- Type:
- ignore_dependencies
bool If True, ignore the dependencies of the tasks in the benchmark. Use with caution. So far, only WebArena and VisualWebArena have dependencies between tasks to minimize the influence of solving one task before another one. This dependency graph allows experiments to run in parallel while respecting task dependencies. However, it still can’t run more than 4 and, in practice it’s speeding up evaluation by a factor of only 3x compare to sequential execution. To accelerate execution, you can ignore dependencies and run in full parallel. This leads to a decrease in performance of about 1%-2%, and could be more. Note: ignore_dependencies on VisualWebArena doesn’t work.
- Type:
- avg_step_timeout
int The average step timeout in seconds. This is used to stop the experiments if they are taking too long. The default is 60 seconds.
- Type:
- demo_mode
bool If True, the experiments will be run in demo mode, which will record videos, and enable visual effects for actions.
- Type:
- append_to_journal(strict_reproducibility=True)
Append the study to the journal.
- Parameters:
strict_reproducibility – bool If True, incomplete experiments will raise an error.
- find_incomplete(include_errors=True)
Find incomplete or errored experiments in the study directory for relaunching.
- Parameters:
include_errors – bool If True, include errored experiments in the list.
- Returns:
- The list of all experiments with completed ones replaced by a
dummy exp_args to keep the task dependencies.
- Return type:
list[ExpArgs]
- load_exp_args_list()
- make_exp_args_list()
Generate the exp_args_list from the agent_args and the benchmark.
- property name
- override_max_steps(max_steps)
- run(n_jobs=1, parallel_backend='ray', strict_reproducibility=False, n_relaunch=3, relaunch_errors=True, exp_root=PosixPath('/home/docs/agentlab_results'))
Run the study
- set_reproducibility_info(strict_reproducibility=False, comment=None)
Gather relevant information that may affect the reproducibility of the experiment
e.g.: versions of BrowserGym, benchmark, AgentLab…
- Parameters:
strict_reproducibility – bool If True, all modifications have to be committed before running the experiments. Also, if relaunching a study, it will not be possible if the code has changed.
comment – str Extra comment to add to the reproducibility information.
- agentlab.experiments.study.get_most_recent_study(root_dir: Path = None, date_format: str = '%Y-%m-%d_%H-%M-%S', contains=None)
Return the most recent directory based on the date in the folder name.
- Parameters:
root_dir – The directory to search in
date_format – The format of the date in the folder name
contains – If not None, only consider folders that contains this string
- Returns:
The most recent folder satisfying the conditions
- Return type:
Path
- agentlab.experiments.study.make_study(agent_args: list[AgentArgs] | AgentArgs, benchmark: Benchmark | str, logging_level_stdout=30, suffix='', comment=None, ignore_dependencies=False, parallel_servers=None)
Run a list of agents on a benchmark.
- Parameters:
agent_args – list[AgentArgs] | AgentArgs The agent configuration(s) to run. IMPORTANT: these objects will be pickled and unpickled. Make sure they are imported from a package that is accessible from PYTHONPATH. Otherwise, it won’t load in agentlab-xray.
benchmark – bgym.Benchmark | str The benchmark to run the agents on. See bgym.DEFAULT_BENCHMARKS for the main ones. You can also make your own by modifying an existing one.
logging_level_stdout – int The logging level for the stdout of the main script. Each job will have its own logging level that will save into file and can be seen in agentlab-xray.
suffix – str A suffix to add to the study name. This can be useful to keep track of your experiments. By default the study name contains agent name, benchmark name and date.
comment – str Extra comments from the authors of this study to be stored in the reproducibility information. Leave any extra information that can explain why results could be different than expected.
ignore_dependencies – bool If True, ignore the dependencies of the tasks in the benchmark. Use with caution. So far, only WebArena and VisualWebArena have dependencies between tasks to minimize the influence of solving one task before another one. This dependency graph allows experiments to run in parallel while respecting task dependencies. However, it still can’t run more than 4 and, in practice it’s speeding up evaluation by a factor of only 3x compare to sequential executionz. To accelerate execution, you can ignore dependencies and run in full parallel. This leads to a decrease in performance of about 1%-2%, and could be more. Note: ignore_dependencies on VisualWebArena doesn’t work.
parallel_servers – list[WebArenaInstanceVars] The number of parallel servers to use if “webarena” in benchmark.name. Use this to dispatch agent_args on a pool of servers in parallel. If len(agent_args) > len(parallel_servers), the servers will be reused for next evaluation (with a reset) as soon as it is done.
- Returns:
- Study | SequentialStudies | ParallelStudies object.
- SequentialStudies: if the benchmark requires manual reset after each evaluation such as
WebArena and VisualWebArena.
ParallelStudies: if the benchmark has multiple servers to run in parallel. Study: otherwise.