ciberlabreport.preprocesing.cape module

Module for condensing CAPE sandbox reports before LLM consumption.

This module defines the PreprocessLimits dataclass and the CapeReducer class, which work together to shrink large CAPE JSON outputs to a size suitable for prompting OpenAI models. The helpers filter out noise, cap list sizes, and replace large collections with concise summaries such as counts, top values, and representative examples.

The workflow includes:
  1. Validating inputs and enforcing configurable truncation thresholds.

  2. Building compact sections (meta, statistics, behavior, etc.).

  3. Returning a dictionary with only the data required by the report engine.

Classes:

PreprocessLimits: Container with all tunable limits for the reduction pass. CapeReducer: Applies the trimming logic section by section. ProcessTree: Generates the process tree image and a legible representation for PDF schema. ImageProcessor: Process the CAPE shots to obtain uniques

Example

>>> from ciberlabreport.preprocesing.cape import CapeReducer
>>> reducer = CapeReducer()
>>> reduced = reducer.reduce_report(raw_report)
class ciberlabreport.preprocesing.cape.CapeReducer(limits: PreprocessLimits | None = None)

Bases: object

Transforms verbose CAPE JSON documents into compact LLM-friendly payloads.

The reducer visits all relevant sections (meta, statistics, behavior, network, etc.), truncates oversized collections according to PreprocessLimits, and removes empty values so the downstream prompt stays small and deterministic.

reduce_report(raw: Mapping[str, Any]) dict[str, Any]

Builds a compact representation of a CAPE analysis JSON.

Parameters:

raw (Mapping[str, Any]) – Parsed CAPE JSON object.

Returns:

Aggregated dictionary with the most relevant data,

drastically smaller than the original payload.

Return type:

dict[str, Any]

Raises:

TypeError – If raw is not a mapping/dict structure.

class ciberlabreport.preprocesing.cape.ImageProcessor(tmp_path: Path, filename: str, hash_threshold: int = 9)

Bases: object

process(raw: dict) list

Main process of the class. From the whole CAPE report, stores the unique shots to minimize the LLM processing.

Parameters:

raw (dict) – Whole CAPE report.

Raises:
  • RuntimeError – If the cleaning process got failures.

  • OSError – Afte checking that the images are OK, it could not store them.

Returns:

List of tuples with pathlib.Path object of the stored shots and its base64 representation.

Return type:

list

class ciberlabreport.preprocesing.cape.PreprocessLimits(max_statistics_list: int = -1, max_signatures: int = 12, max_signature_description: int = 280, max_signature_examples: int = 3, max_ttps: int = -1, max_processes: int = -1, max_command_line: int = 200, max_module_path: int = 180, max_api_results: int = 10, max_category_results: int = 5, max_summary_examples: int = 5, max_anomalies: int = 8, max_enhanced_events: int = 6, max_network_entries: int = 10, max_domains: int = 10, max_dns: int = 10, max_http: int = 8, max_hosts: int = 10, max_dropped_files: int = 5, max_configs: int = 3, max_payloads: int = 3, max_yara: int = 5, max_mutexes: int = 5, max_services: int = 4)

Bases: object

Configuration for each trimming threshold used during CAPE reduction.

max_statistics_list

Number of processing/signature/reporting stats to retain.

Type:

int

max_signatures

Maximum signature entries to keep.

Type:

int

max_signature_description

Character budget for signature descriptions.

Type:

int

max_signature_examples

Maximum families/references/examples reported per signature.

Type:

int

max_ttps

Maximum MITRE ATT&CK techniques to return.

Type:

int

max_processes

Maximum behavior processes included in summaries.

Type:

int

max_command_line

Character budget for process command lines.

Type:

int

max_module_path

Character budget for module/file paths.

Type:

int

max_api_results

Maximum API counts surfaced per process.

Type:

int

max_category_results

Maximum call category counts surfaced per process.

Type:

int

max_summary_examples

Maximum examples retained in high-level summaries.

Type:

int

max_anomalies

Maximum anomaly entries in the behavior section.

Type:

int

max_enhanced_events

Maximum enhanced event stats retained.

Type:

int

max_network_entries

Maximum URL analysis samples gathered.

Type:

int

max_domains

Maximum domain examples in the network section.

Type:

int

max_dns

Maximum DNS entries kept.

Type:

int

max_http

Maximum HTTP entries kept.

Type:

int

max_hosts

Maximum host entries kept.

Type:

int

max_dropped_files

Maximum dropped-file summaries produced.

Type:

int

max_configs

Maximum CAPE config blobs summarized.

Type:

int

max_payloads

Maximum payload summaries kept.

Type:

int

max_yara

Maximum YARA hits reported within any subsection.

Type:

int

max_mutexes

Maximum mutex entries listed.

Type:

int

max_services

Maximum service entries listed.

Type:

int

Special values:

Any attribute set to -1 disables reductions for that specific dimension.

static is_unlimited(limit: int | None) bool

Checks whether a limit disables truncation.

max_anomalies: int
max_api_results: int
max_category_results: int
max_command_line: int
max_configs: int
max_dns: int
max_domains: int
max_dropped_files: int
max_enhanced_events: int
max_hosts: int
max_http: int
max_module_path: int
max_mutexes: int
max_network_entries: int
max_payloads: int
max_processes: int
max_services: int
max_signature_description: int
max_signature_examples: int
max_signatures: int
max_statistics_list: int
max_summary_examples: int
max_ttps: int
max_yara: int
class ciberlabreport.preprocesing.cape.ProcessTree(config_path: Path, tmp_path: Path, filename: str, max_tree_depth: int = 3)

Bases: object

get_process_tree(raw: dict) tuple[list, Path]

Extracts, normalizes, and renders the process tree from raw analysis data.

Parameters:

raw (dict) – Full analysis report containing a behavior.processtree field.

Returns:

The normalized and depth-limited process tree and the path where the proctree renderized is stored.

Return type:

tuple[list, Path]

class ciberlabreport.preprocesing.cape.SampleSignatures

Bases: object

obtain_signatures(raw: dict) tuple[list, dict]

Normalizes CAPE signatures.

Parameters:

raw (dict) – Report raw data.

Returns:

Pair with a list of dict objects, containing basic data of signatures

and dict with the table of signatures used to print in the PDF.

Return type:

tuple[list, dict]