ciberlabreport.preprocesing.cape module

Module for condensing CAPE sandbox reports before LLM consumption.

This module defines the PreprocessLimits dataclass and the CapeReducer class, which work together to shrink large CAPE JSON outputs to a size suitable for prompting OpenAI models. The helpers filter out noise, cap list sizes, and replace large collections with concise summaries such as counts, top values, and representative examples.

The workflow includes:

Validating inputs and enforcing configurable truncation thresholds.
Building compact sections (meta, statistics, behavior, etc.).
Returning a dictionary with only the data required by the report engine.

Classes:

PreprocessLimits: Container with all tunable limits for the reduction pass. CapeReducer: Applies the trimming logic section by section. ProcessTree: Generates the process tree image and a legible representation for PDF schema. ImageProcessor: Process the CAPE shots to obtain uniques

Example

>>> from ciberlabreport.preprocesing.cape import CapeReducer
>>> reducer = CapeReducer()
>>> reduced = reducer.reduce_report(raw_report)

class ciberlabreport.preprocesing.cape.CapeReducer(limits: PreprocessLimits | None = None)

Bases: object

Transforms verbose CAPE JSON documents into compact LLM-friendly payloads.

The reducer visits all relevant sections (meta, statistics, behavior, network, etc.), truncates oversized collections according to PreprocessLimits, and removes empty values so the downstream prompt stays small and deterministic.

reduce_report(raw: Mapping[str, Any]) → dict[str, Any]

Builds a compact representation of a CAPE analysis JSON.

Parameters:

raw (Mapping[str, Any]) – Parsed CAPE JSON object.

Returns:

Aggregated dictionary with the most relevant data,: drastically smaller than the original payload.

Return type:

dict[str, Any]

Raises:

TypeError – If raw is not a mapping/dict structure.

class ciberlabreport.preprocesing.cape.ImageProcessor(tmp_path: Path, filename: str, hash_threshold: int = 9)

Bases: object

process(raw: dict) → list

Main process of the class. From the whole CAPE report, stores the unique shots to minimize the LLM processing.

Parameters:

raw (dict) – Whole CAPE report.

Raises:

RuntimeError – If the cleaning process got failures.
OSError – Afte checking that the images are OK, it could not store them.

Returns:

List of tuples with pathlib.Path object of the stored shots and its base64 representation.

Return type:

list

class ciberlabreport.preprocesing.cape.PreprocessLimits(max_statistics_list: int = -1, max_signatures: int = 12, max_signature_description: int = 280, max_signature_examples: int = 3, max_ttps: int = -1, max_processes: int = -1, max_command_line: int = 200, max_module_path: int = 180, max_api_results: int = 10, max_category_results: int = 5, max_summary_examples: int = 5, max_anomalies: int = 8, max_enhanced_events: int = 6, max_network_entries: int = 10, max_domains: int = 10, max_dns: int = 10, max_http: int = 8, max_hosts: int = 10, max_dropped_files: int = 5, max_configs: int = 3, max_payloads: int = 3, max_yara: int = 5, max_mutexes: int = 5, max_services: int = 4)

Bases: object

Configuration for each trimming threshold used during CAPE reduction.

max_statistics_list

Number of processing/signature/reporting stats to retain.

Type:: int

max_signatures

Maximum signature entries to keep.

Type:: int

max_signature_description

Character budget for signature descriptions.

Type:: int

max_signature_examples

Maximum families/references/examples reported per signature.

Type:: int

max_ttps

Maximum MITRE ATT&CK techniques to return.

Type:: int

max_processes

Maximum behavior processes included in summaries.

Type:: int

max_command_line

Character budget for process command lines.

Type:: int

max_module_path

Character budget for module/file paths.

Type:: int

max_api_results

Maximum API counts surfaced per process.

Type:: int

max_category_results

Maximum call category counts surfaced per process.

Type:: int

max_summary_examples

Maximum examples retained in high-level summaries.

Type:: int

max_anomalies

Maximum anomaly entries in the behavior section.

Type:: int

max_enhanced_events

Maximum enhanced event stats retained.

Type:: int

max_network_entries

Maximum URL analysis samples gathered.

Type:: int

max_domains

Maximum domain examples in the network section.

Type:: int

max_dns

Maximum DNS entries kept.

Type:: int

max_http

Maximum HTTP entries kept.

Type:: int

max_hosts

Maximum host entries kept.

Type:: int

max_dropped_files

Maximum dropped-file summaries produced.

Type:: int

max_configs

Maximum CAPE config blobs summarized.

Type:: int

max_payloads

Maximum payload summaries kept.

Type:: int

max_yara

Maximum YARA hits reported within any subsection.

Type:: int

max_mutexes

Maximum mutex entries listed.

Type:: int

max_services

Maximum service entries listed.

Type:: int

Special values:: Any attribute set to -1 disables reductions for that specific dimension.

static is_unlimited(limit: int | None) → bool: Checks whether a limit disables truncation.

max_anomalies: int

max_api_results: int

max_category_results: int

max_command_line: int

max_configs: int

max_dns: int

max_domains: int

max_dropped_files: int

max_enhanced_events: int

max_hosts: int

max_http: int

max_module_path: int

max_mutexes: int

max_network_entries: int

max_payloads: int

max_processes: int

max_services: int

max_signature_description: int

max_signature_examples: int

max_signatures: int

max_statistics_list: int

max_summary_examples: int

max_ttps: int

max_yara: int

class ciberlabreport.preprocesing.cape.ProcessTree(config_path: Path, tmp_path: Path, filename: str, max_tree_depth: int = 3)

Bases: object

get_process_tree(raw: dict) → tuple[list, Path]

Extracts, normalizes, and renders the process tree from raw analysis data.

Parameters:: raw (dict) – Full analysis report containing a behavior.processtree field.
Returns:: The normalized and depth-limited process tree and the path where the proctree renderized is stored.
Return type:: tuple[list, Path]

class ciberlabreport.preprocesing.cape.SampleSignatures

Bases: object

obtain_signatures(raw: dict) → tuple[list, dict]

Normalizes CAPE signatures.

Parameters:

raw (dict) – Report raw data.

Returns:

Pair with a list of dict objects, containing basic data of signatures: and dict with the table of signatures used to print in the PDF.

Return type:

tuple[list, dict]