ciberlabreport.preprocesing.cape module
Module for condensing CAPE sandbox reports before LLM consumption.
This module defines the PreprocessLimits dataclass and the CapeReducer class, which work together to shrink large CAPE JSON outputs to a size suitable for prompting OpenAI models. The helpers filter out noise, cap list sizes, and replace large collections with concise summaries such as counts, top values, and representative examples.
- The workflow includes:
Validating inputs and enforcing configurable truncation thresholds.
Building compact sections (meta, statistics, behavior, etc.).
Returning a dictionary with only the data required by the report engine.
- Classes:
PreprocessLimits: Container with all tunable limits for the reduction pass. CapeReducer: Applies the trimming logic section by section. ProcessTree: Generates the process tree image and a legible representation for PDF schema. ImageProcessor: Process the CAPE shots to obtain uniques
Example
>>> from ciberlabreport.preprocesing.cape import CapeReducer
>>> reducer = CapeReducer()
>>> reduced = reducer.reduce_report(raw_report)
- class ciberlabreport.preprocesing.cape.CapeReducer(limits: PreprocessLimits | None = None)
Bases:
objectTransforms verbose CAPE JSON documents into compact LLM-friendly payloads.
The reducer visits all relevant sections (meta, statistics, behavior, network, etc.), truncates oversized collections according to PreprocessLimits, and removes empty values so the downstream prompt stays small and deterministic.
- reduce_report(raw: Mapping[str, Any]) dict[str, Any]
Builds a compact representation of a CAPE analysis JSON.
- Parameters:
raw (Mapping[str, Any]) – Parsed CAPE JSON object.
- Returns:
- Aggregated dictionary with the most relevant data,
drastically smaller than the original payload.
- Return type:
dict[str, Any]
- Raises:
TypeError – If
rawis not a mapping/dict structure.
- class ciberlabreport.preprocesing.cape.ImageProcessor(tmp_path: Path, filename: str, hash_threshold: int = 9)
Bases:
object- process(raw: dict) list
Main process of the class. From the whole CAPE report, stores the unique shots to minimize the LLM processing.
- Parameters:
raw (dict) – Whole CAPE report.
- Raises:
RuntimeError – If the cleaning process got failures.
OSError – Afte checking that the images are OK, it could not store them.
- Returns:
List of tuples with pathlib.Path object of the stored shots and its base64 representation.
- Return type:
list
- class ciberlabreport.preprocesing.cape.PreprocessLimits(max_statistics_list: int = -1, max_signatures: int = 12, max_signature_description: int = 280, max_signature_examples: int = 3, max_ttps: int = -1, max_processes: int = -1, max_command_line: int = 200, max_module_path: int = 180, max_api_results: int = 10, max_category_results: int = 5, max_summary_examples: int = 5, max_anomalies: int = 8, max_enhanced_events: int = 6, max_network_entries: int = 10, max_domains: int = 10, max_dns: int = 10, max_http: int = 8, max_hosts: int = 10, max_dropped_files: int = 5, max_configs: int = 3, max_payloads: int = 3, max_yara: int = 5, max_mutexes: int = 5, max_services: int = 4)
Bases:
objectConfiguration for each trimming threshold used during CAPE reduction.
- max_statistics_list
Number of processing/signature/reporting stats to retain.
- Type:
int
- max_signatures
Maximum signature entries to keep.
- Type:
int
- max_signature_description
Character budget for signature descriptions.
- Type:
int
- max_signature_examples
Maximum families/references/examples reported per signature.
- Type:
int
- max_ttps
Maximum MITRE ATT&CK techniques to return.
- Type:
int
- max_processes
Maximum behavior processes included in summaries.
- Type:
int
- max_command_line
Character budget for process command lines.
- Type:
int
- max_module_path
Character budget for module/file paths.
- Type:
int
- max_api_results
Maximum API counts surfaced per process.
- Type:
int
- max_category_results
Maximum call category counts surfaced per process.
- Type:
int
- max_summary_examples
Maximum examples retained in high-level summaries.
- Type:
int
- max_anomalies
Maximum anomaly entries in the behavior section.
- Type:
int
- max_enhanced_events
Maximum enhanced event stats retained.
- Type:
int
- max_network_entries
Maximum URL analysis samples gathered.
- Type:
int
- max_domains
Maximum domain examples in the network section.
- Type:
int
- max_dns
Maximum DNS entries kept.
- Type:
int
- max_http
Maximum HTTP entries kept.
- Type:
int
- max_hosts
Maximum host entries kept.
- Type:
int
- max_dropped_files
Maximum dropped-file summaries produced.
- Type:
int
- max_configs
Maximum CAPE config blobs summarized.
- Type:
int
- max_payloads
Maximum payload summaries kept.
- Type:
int
- max_yara
Maximum YARA hits reported within any subsection.
- Type:
int
- max_mutexes
Maximum mutex entries listed.
- Type:
int
- max_services
Maximum service entries listed.
- Type:
int
- Special values:
Any attribute set to
-1disables reductions for that specific dimension.
- static is_unlimited(limit: int | None) bool
Checks whether a limit disables truncation.
- max_anomalies: int
- max_api_results: int
- max_category_results: int
- max_command_line: int
- max_configs: int
- max_dns: int
- max_domains: int
- max_dropped_files: int
- max_enhanced_events: int
- max_hosts: int
- max_http: int
- max_module_path: int
- max_mutexes: int
- max_network_entries: int
- max_payloads: int
- max_processes: int
- max_services: int
- max_signature_description: int
- max_signature_examples: int
- max_signatures: int
- max_statistics_list: int
- max_summary_examples: int
- max_ttps: int
- max_yara: int
- class ciberlabreport.preprocesing.cape.ProcessTree(config_path: Path, tmp_path: Path, filename: str, max_tree_depth: int = 3)
Bases:
object- get_process_tree(raw: dict) tuple[list, Path]
Extracts, normalizes, and renders the process tree from raw analysis data.
- Parameters:
raw (dict) – Full analysis report containing a behavior.processtree field.
- Returns:
The normalized and depth-limited process tree and the path where the proctree renderized is stored.
- Return type:
tuple[list, Path]
- class ciberlabreport.preprocesing.cape.SampleSignatures
Bases:
object- obtain_signatures(raw: dict) tuple[list, dict]
Normalizes CAPE signatures.
- Parameters:
raw (dict) – Report raw data.
- Returns:
- Pair with a list of dict objects, containing basic data of signatures
and dict with the table of signatures used to print in the PDF.
- Return type:
tuple[list, dict]