Gemini Live Agent Challenge
Hey builders! ๐ Stop typing, and start interacting! We are moving beyond the text box. The future isn't about just chatting with AIโit's about immersive, real-time experiences. To celebrate the power of multimodal AI, weโre challenging you to build the next generation of agents that can help you see ๐, hear ๐, speak ๐, and create in the Gemini Live Agent Challenge.
About
Entrants must develop a NEW next-generation AI Agent that utilizes multimodal inputs and outputs and moves beyond simple text-in/text-out interactions. Projects should leverage Googleโs Live API with the creative power of video/image generation to solve complex problems or create entirely new user experiences within one of these three categories: Live Agents ๐ฃ๏ธ Focus: Real-time Interaction (Audio/Vision). Build an agent that users can talk to naturally can be interrupted. This could be a real-time translator, a vision-enabled customized tutor that "sees" your homework, or a customer support voice agent that handles interruptions gracefully. Mandatory Tech: Must use Gemini Live API or the use of ADK. The agents are hosted on Google Cloud. Creative Storyteller โ๏ธ Focus: Multimodal Storytelling with Interleaved Output Build an agent that thinks and creates like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream. Leverage Gemini's native interleaved output to generate rich, mixed-media responses that combine narration with visuals, explanations with generated imagery, or storyboards with voiceover, all in one cohesive flow. Examples include Interactive storybooks (text + generated illustrations inline), marketing asset generator (copy + visuals + video in one go), educational explainers (narration woven with diagrams), and social content creator (caption + image + hashtags together). Mandatory Tech: Must use Gemini's interleaved/mixed output capabilities. The agents are hosted on Google Cloud. UI Navigator โธ๏ธ Focus: Visual UI Understanding & Interaction Build an agent that becomes the user's hands on screen. The agent observes the browser or device display, interprets visual elements with or without relying on APIs or DOM access, and performs actions based on user intent. Examples include a universal web navigator, a cross-application workflow automator, or a visual QA testing agent. Mandatory Tech: Must use Gemini multimodal to interpret screenshots/screen recordings and output executable actions. The agents are hosted on Google Cloud. All projects MUST: Leverage a Gemini model Agents must be built using either Google GenAI SDK OR ADK (Agent Development Kit) Use at least one Google Cloud service What to Submit ๐ Text Description: Summary of the Projectโs features and functionality, technologies used, information about any other data sources used, and your findings and learnings as you worked through the project. ๐จโ๐ป URL to your Public Code Repository: Let us see how you built it! Include spin-up instructions in your README for the judges to see your project is reproducible ๐ฅ๏ธ Proof of Google Cloud Deployment: You must demonstrate that the backend is running on Google Cloud with a short recording (separate from your demo) proving your Projectโs backend is running on Google Cloud. Proof would either be (1) a quick screen recording that shows the behind-the-scenes of their app running on GCP (e.g. console logs or console view of a deployment) or (2) a link to a code file in their code repo that demonstrates use of Google Cloud services and APIs (e.g. API calls to Vertex AI endpoints) ๐๏ธ Architecture Diagram: A clear visual representation of your system (e.g., how Gemini connects to your backend, database, and frontend) Pro tip: Add this to the file upload or image carousel so it's easy for judges to find! ๐น Demonstration Video: <4-minute video Demos your multimodal/agentic features working in real-time (no mockups) Pitches your project: what problem did you solve and what value does your solution bring?
This hackathon has ended