Skip to content Skip to footer

Why Can’t ChatGPT Handle This Simple Image Editing Task Despite Its AI Brilliance?

Despite advancements in AI, models like ChatGPT struggle with simple image editing tasks. This arises from limitations in multimodal capabilities, which are crucial for tasks involving both text and images.

Short Summary:

  • ChatGPT excels at generating images from text prompts but fails to modify existing images.
  • Attempts to simplify or alter images using ChatGPT often result in unmodified or incorrect outputs.
  • Other AI models like Google’s Gemini and Microsoft’s Copilot also struggle with image editing tasks, highlighting a broader industry challenge.

Artificial intelligence has made significant strides in many areas, including generating text and images. OpenAI’s ChatGPT has proven capable of creating astonishingly detailed images from mere text prompts. Yet, when it comes to editing existing images, ChatGPT and other similar models falter. This gap in capability has raised questions about the limitations of current AI technologies and the complexities involved in multimodal models, which need to integrate text, image, and even audio inputs to perform effectively.

Multimodal AI systems, like OpenAI’s ChatGPT, aim to operate across different types of input data, such as text and images. These models have shown promise; for instance, ChatGPT can generate an image of a “napkin in love with a spoon” simply from a text prompt. When asked, it can even simplify this image by creating a version with fewer details. However, when challenged to modify an image not originally created by the tool, ChatGPT, operating on the GPT-4o model, consistently fails to deliver.

Similar shortcomings are shared by other advanced AI models from leading tech giants. Google’s Gemini and Anthropic’s Claude, along with Microsoft’s Copilot—which ironically also runs on GPT-4—wouldn’t even attempt these tasks. When faced with the challenge of editing existing images, Copilot went as far as to avoid completing the task altogether, citing privacy reasons for removing the heads from a drawing I uploaded.

“It doesn’t look any different,” I wrote back to ChatGPT, commenting on the unmodified image it produced.

Starting from a simple line drawing of two people on the subway using the Concepts drawing app on an iPad, I attempted to see if ChatGPT could simplify the image. Upon uploading the drawing into the ChatGPT app on iOS and typing in the prompt, “Is there a way to simplify this drawing?”, ChatGPT responded with text-based suggestions, like “Remove Minor Details” or “Eliminate small details like intricate hair strands or minor creases in clothing.” While these were reasonable text responses, they did not translate into a simplified image when the tool attempted to do so.

When I confirmed that I wanted ChatGPT to simplify the drawing, the resulting image was an exact, unmodified copy. “It doesn’t look any different,” I pointed out via text. ChatGPT, however, continued, saying, “To further simplify the drawing, I’ll focus on reducing the line density and enhancing the primary shapes and contours more effectively. Let me refine the process.” Yet, the subsequent results were identical to the original, suggesting no actual modifications were made.

“It seems the automatic methods are not effectively simplifying the image as intended,” ChatGPT admitted. “I’ll manually remove the less significant details and retain only the essential elements. Let me refine this approach further.”

This so-called ‘manual method’ resulted in nothing more than a black-and-white inversion of the drawing, flipping black lines to white and vice versa. With this rudimentary change, ChatGPT touted its achievement and stated, “This version retains only the key contours and essential lines.” When I protested that the output was merely an inverted image, not a simplification, the AI responded with another nonsensical iteration: a completely black square without any lines at all, perhaps a form of AI humor.

Other models offered scant improvement. Google’s Gemini, after struggling, produced a vaguely simplified picture and a canned apology for not generating realistic images of people. Even more bizarrely, Perplexity and Claude, upon being asked to perform the simplification tasks, admitted their current inability to generate images, and Microsoft’s Copilot inexplicably censored the drawing’s characters by removing their heads, claiming privacy reasons.

This experimentation exposes a profound limitation. Despite their prowess in generating new content from scratch or analyzing images, today’s AI models fail to modify specific elements of existing artwork. The root cause lies in their inability to assemble pictures based on high-level visual and semantic concepts gleaned from a text prompt.

This shortcoming becomes evident when considering that most AI models, including ChatGPT, lack the capability to act on the individual pieces of a given image. Even when we adjusted our prompts, providing semantic cues and additional context for simplification, the models still could not alter the specific parts of the image as intended.

“ChatGPT cannot act on individual picture elements, such as lines,” I concluded. “That explains why it’s a poor editor for both images and text: it doesn’t know what to consider essential or what to discard.”

AI models predominantly generate content that aligns with target “probability distributions” derived from training examples but cannot distill original works into their core essence selectively. Steve Jobs once highlighted that the most critical function of software is “editing”—knowing what to retain and what to delete. At present, AI resembles an eager but unsophisticated assistant, keen on producing something but lacking the discernment to refine it meaningfully.

As AI technologies advance, understanding the shortcomings observed in ChatGPT’s handling of image simplification is crucial. Determining how AI can more precisely interpret and act on intricate tasks could bridge these gaps in capability. However, until these complex challenges are resolved, AI models will continue to excel where they are trained deeply but struggle in areas that require nuanced human judgment and selective creativity.

Understanding the Gap Between AI Potential and Practicality

While ChatGPT has marvelously demonstrated the potential of large language models to generate text and images, it is clear that the current capabilities have limitations. Its success, similar to most AI models, hinges on pre-trained data and the application of statistical probabilities rather than an understanding of the tasks at a fundamental level.

The limitations of some AI models in realizing specific tasks invite further debate and research on AI Ethics and the Future of AI Writing. The challenges observed in image modification tasks underline the need for advancements beyond statistical modeling and demand enhancements that address context, conceptual understanding, and nuanced improvements.

For now, tools like ChatGPT will continue to transform content generation. But understanding their limitations helps pave the way for developing more advanced models and technologies. Follow developments and insights into AI capabilities and applications at