We all know the story of the first YouTube video, a great 19 -second clip of co -founder Jawed Karim in the zoo, commenting on the elephants behind him. That video was a fundamental moment in the digital space, and somehow, it is a reflection, or at least an image of inverted mirror, today as we say the arrival of I see 3.
Part of Google Gemini, I see 3 was presented on Google I/or 2025 and is the first generative video platform that, with a single message, can generate a video with synchronized dialogue, sound effects and background noises. Most of these 8 second clips reach less than 5 minutes after entering the message.
I have been playing with I see 3 for a couple of days, and for my last challenge, I tried to return to the beginning of the social video and that moment of YouTube “I at the zoo”. Specifically, I wondered if I see 3 I could recreate that video.
As I have written, the key to a good result I see 3 is the notice. Without detail and structure, I see 3 tends to make decisions for you, and generally does not end with what you want. For this experiment, I asked me how I could describe all the details I wanted to get from that brief video and deliver them to I see 3 in the form of a notice. So, naturally, I turned to another.
Google Gemini 2.5 Pro is not currently capable of analyzing a URL, but Google AI mode, the new search form that is extending rapidly in the US.
Here is the notice that I dropped in Google’s mode:
Google Ai mode returned almost instantly with a detailed description, which I took and dropped in the field immediately Gemini I see 3.
I made some editions, mostly eliminating phrases such as “The video appears …” and the final analysis at the end, but otherwise, I left most and added this at the top of the notice:
“Let’s make a video based on these details. The output must be a 4: 3 ratio and it seems that it was filmed on an 8 mm video tape.”
I see 3 took a while to generate the video (I think the service is being hammered at this time) and, because it only creates pieces of 8 seconds at the same time, it was incomplete, cutting the dialogue in the middle of the sentence.
Even so, the result is impressive. I would not say that the main character looks like Karim. To be fair, the message does not describe, for example, Karim’s haircut, the shape of his face or deep eyes. The description of the Google AI mode of your outfit was also probably insufficient. I am sure that I would have done a better job if I had fed it a screenshot of the original video.
Note for oneself: you can never offer enough details in a generative warning.
8 seconds at the same time
I see 3 zoo video is more pleasant than the one Karim visited, and the elephants are much further, although they are in motion there.
I see 3 obtained the quality of the right film, giving it a good 2005 look, but not the 4: 3. 3. I realize that I should now have eliminated the “title” bit of my warning.
The audio is particularly good. The dialogue is well synchronized with my main character and, if you listen carefully, you will also listen to the background noises.
The biggest problem is that this was only half the brief YouTube video. I wanted a complete recreation, so I decided to return with a much shorter warning:
Continue with the same video and add it looking back at the elephants and then watching the camera while saying this dialogue:
“Fronts and that is great.” “And that is more or less everything to say.”
I see 3 complied with the stage and the main character, but lost part of the plot, dropping the granulated video of the old school of the first clip generated. This means that when I present them (as I do above), we lose considerable continuity. It is like a time jump of the film crew, where they suddenly obtained a much better camera.
I am also a little frustrated that all my videos I see 3 have meaningless subtitles. I need to remember I see 3 to eliminate them, hide or place them outside the video frame.
I think of how difficult it was probably for Karim to film, edit and upload that first short video and how I essentially did the same clip without the need for people, lighting, microphones, cameras or elephants. I didn’t have to transfer tape images or even from an iPhone. I simply conjured it with an algorithm. We have really crossed the appearance glass, my friends.
I learned something else through this project. As a member of Google Ai Pro, I have two generations of videos I see 3 per day. That means I can do this tomorrow. Let me know in the comments what you would like you to believe.