Video Tagging with AI Gen

Project Information

Category: AI for Business
Client: Pernod Ricard
Project date: 2025
Project URL: github.com/albiche/ai_video_tagging

Project Overview

The Video Tagging with AI Gen project delivers a sophisticated multi-modal analysis pipeline to automatically tag creative elements in video ads. Designed for Pernod Ricard, it addresses the challenge of subjective, inconsistent, and costly manual video annotation by using large language models and advanced frame/audio extraction strategies. This approach ensures consistent, objective tagging even for complex emotional or brand signals in 30-second advertising spots, achieving a 10x cost reduction compared to outsourcing.

Key Features & Functionalities

Advanced Frame Extraction

Supports multiple strategies including regular intervals, scene-change detection, face/person detection, and grouped sampling for richer context.

Audio Transcription & Analysis

Extracts and transcribes audio tracks for inclusion in multi-modal prompts, ensuring dialogue and sound cues are accurately tagged.

Brand Knowledge Injection

Enriches prompts with brand-specific colors, elements, and semantic cues for superior tagging accuracy and alignment with marketing strategy.

Intelligent Batching & Chunking

Handles LLM token/image/audio limits by smartly grouping tags and inputs into optimized batches for efficient processing.

Template-Driven Flexibility

Fully configurable via JSON/YAML templates, enabling easy adaptation to new tags, accepted values, and merging logic without code changes.

Scalable Automation

Supports batch processing from CSVs of video URLs, brand knowledge management, and standardized CSV output for analytics.

Architecture & Workflow

The pipeline's modular architecture supports advanced frame and audio extraction, brand-specific context injection, intelligent prompt construction, and multi-modal LLM tagging using GPT-4o. Frame extraction strategies include regular intervals, MIF (most informative frames), and person detection via YOLOv8. Audio tracks are extracted and transcribed to provide context. Batches are constructed respecting token/image/audio limits, and prompts are enriched with brand knowledge for consistent, marketing-aligned results.

The system also supports automated brand knowledge file generation and extensible logic for merging and validating results across batches. All outputs are normalized and exported in CSV format, ready for downstream marketing analytics or dataset creation.

Results & Business Impact

This solution enables Pernod Ricard to replace manual outsourcing with a highly scalable, accurate, and objective AI-based video tagging process. It ensures consistent brand and creative analysis across markets, reduces subjective interpretation, and achieves cost savings of 10x. Close collaboration with the marketing team ensures precise tag definitions and robust prompt design, delivering a strong foundation for strategic creative analysis and benchmarking.