Analyze images and generate descriptions with bounding boxes
Generate speech from text using a reference audio sample