Three-tier architecture: Multimodal Reasoning (text) + Vision Understanding (images) + Audio/Video (planned)