GPT-4o as a promise of autonomous AI agents?
Martin Průcha, 15. 05. 2024
Martin Průcha, 15. 05. 2024
OpenAI has recently introduced GPT-4o, its latest model designed for a more natural human-computer interaction. It is capable of handling text, audio, and image inputs simultaneously and delivering outputs in any combination of these formats. This model, dubbed “omni” for its all-encompassing capabilities, significantly enhances responsiveness, matching human reaction times in conversation with audio responses averaging around 320 milliseconds. It offers improved performance in non-English languages and multimedia understanding while being faster and 50% less costly on the API than previous versions.
Notably, GPT-4o integrates functionalities that allow it to understand tone, background noises, and multiple speakers, which were limitations in earlier versions that used separate models for transcription and audio output. In the preview, it was able to communicate smoothly with developers and even aid in real time with a math task. This end-to-end model is just beginning to be explored for its full potential and limitations in diverse applications ranging from real-time translation to interactive entertainment.
The GPT-4o resembles something we may call autonomous AI agents in the tech sector, or at least it moves in this direction. An autonomous AI agent is a system designed to operate independently by making decisions and executing actions without human intervention. We may see their rise in multiple domains, such as Flitig AI, or AutoGPT. Musk’s ambitious plans with humanoid robots may fall into this category as well.
However, we should be cautious about how revolutionary the new GPT-4o is. As Czech computer designer and one of the pioneers of transformers Tomáš Mikolov once said, the progress in OpenAI is more about making smart work-arounds, and not solving the problem directly, i.e. by developing a quantitatively new model. For example, in a podcast of Dan Tržil he was explaining that when OpenAI stumbled upon the inability of ChatGPT to provide factual weather data, their team of engineers made a quick patch, connecting the model to an external API used for weather control. This approach may work, however it will not produce anything all-round revolutionary.
The question is, if engineers took the same path with the new GPT4o. Did they slowly refined the GPT-4, implementing more add-ons and functionalities so it looks more intelligent and human, or did they actually create a state-of-the-art system?
Author: Oldřich Příklenk
image: openchat