Why AI fails at creative writing and how to deal with it

If you want to use a tool effectively, it is important to first understand its capabilities and limitations. AI is no different. I even find it exciting to learn more about these systems that are able to do so many things but also fail at other tasks completely. It is often less than obvious what an AI can and cannot do well.

This becomes much clearer when you know more about their inner workings.

In this article, I’ll explain to you the basics of how a modern AI model is brought into existence. You’ll see which tools and processes are used and which steps are taken.

Additionally, I will point out which problems are introduced at each step, how you can mitigate them, and what other concepts and ideas are being developed to improve future AI models.

At the end, you will know much better why an AI behaves the way it does. And that in turn will help you to decide when and how to use it and which quirks and problems you have to look out for.

Basics: guessing the next word

At their core, modern AI models like Gemini, Claude or ChatGPT work like highly advanced autocomplete systems. To understand them, you have to look at the underlying design that makes them work, known as the Transformer.

Before the Transformer was invented in 2017, text-generating software read sentences word by word. By the time it reached the end of a long paragraph, it had often forgotten how the paragraph started. The Transformer solved this with a mechanism called Self-Attention.

What is the difference? Instead of reading linearly, the system looks at all words in a text simultaneously. It assigns mathematical weights to figure out which words refer to each other, regardless of how far apart they are. For example, if the word “bank” appears, the mathematical weights of the other words present shape its meaning, making it hopefully clear whether the context implies a river feature or a financial institution. It weaves a dense mathematical web connecting every piece of a sentence to every other piece.

When you send a prompt, the AI uses this massive web to calculate exactly which word most likely follows next to generate its answer.

Actually, the AI doesn’t see or use words, but tokens. More about that in a minute.

The advantage: coherent text

This mechanism enables fluent text generation, because the Transformer design tracks context over long passages. If you mention a specific detail at the beginning of a document, the system’s attention mechanism assigns a high probability to related words paragraphs later. This is the scientific magic behind an AI-generated text that is coherent and reads naturally.

Problem 1: hallucinations

At the same time, the probability web is also the root cause of a major problem that is often called a hallucination: The AI works with statistical likelihood and not factual truth. If a false statement is somehow mathematically the most probable next string of words, the system will generate it. The software does not consult a database of verified facts for its reply unless it was explicitly instructed to do so.

If you want to dive deeper into this topic: I have written another article about how to spot and avoid AI hallucinations in your writing.

Problem 2: missing comprehension

Another problem originates from here: The software can simulate human comprehension, but doesn’t truly possess it. Because it has no real understanding of the world, it cannot apply common sense or evaluate the actual meaning of the text it produces. Because of this, you have to provide the logic, the strategy, and the critical thinking. You must guide the AI step by step and provide context, rather than assuming it grasps your underlying intent.

Problem 3: limited attention span

The Transformer’s ability to connect every word with every other word also comes at a computational cost: The calculations grow quadratically as the text gets longer. This requires copious amounts of computer memory. To prevent the system from slowing down or crashing, engineers impose a limit on how much text the AI can process at once. This limit is known as the context window. You can see it as the AI’s short-term memory or its attention span.

When a conversation or a document exceeds this limit, the AI starts to forget the earliest parts of the prompt or “compresses” them, which can make the system seem erratic or less capable. I have written another article explaining why an AI sometimes seems suddenly “dumber”, and managing this context window is part of the solution. You can learn more about how to spot this phenomenon and ways to deal with it in that article.

Alternative approaches

Engineers are testing other designs to overcome the limitations of the Transformer model.

  • Diffusion-based text models: Some image generators use a technique called diffusion to turn random noise into a picture. The image is refined step by step, but as a whole. Researchers are applying this concept to text generation as well: Such a model plans the whole text at once and refines it iteratively. This approach can improve the overall structure of a document. But training these models for text is currently very difficult.
  • State Space Models (SSMs): The Transformer needs vast amounts of memory when reading very long texts, because, as we discussed above, it connects every word with every other word. Models based on the Mamba design process data linearly instead. They require far less memory for extensive documents. This approach solves the context window bottleneck.

Basics: there are no words or letters

When you type a word, you see individual letters. But an AI system doesn’t. Before the software can process your prompt or read a text, a translator mechanism chops all words into smaller chunks. Engineers call these tokens and the tool used is a tokenizer.

The process relies on a method called Byte Pair Encoding: This groups frequent letter combinations together. Short words often become a single token. Longer words are split into syllables or arbitrary fragments. In the end, the AI only calculates and understands these numerical tokens.

The advantage: speed and efficiency

This chunking process gives the software a performance boost and allows the AI to process text faster. Instead of calculating the mathematical relationships for thousands of individual letters, the system only has to process a few hundred tokens.

You might wonder why the AI does not just use whole words instead to be even faster. The reason is the “infinite dictionary problem”: If the software used whole words, it would need a separate entry for “run”, “runs”, “running”, for every single typo, every piece of modern slang, and every word in every language. To stay fast, an AI model can only handle a limited dictionary of around 50,000 to 100,000 chunks.

Tokens are the perfect middle ground. The AI keeps its dictionary small and highly efficient, but because it has versatile building blocks, it can still construct or read any word in existence. This efficiency enables the software to handle multiple languages, complex programming code, and comprehensive vocabularies.

Problem: spelling blindness

The token system creates a surprising weakness: The AI is blind to spelling. This leads to the famous “strawberry problem.” If you ask a standard AI model to tell you how often the letter “r” comes up in the word “strawberry,” it might give you the wrong answer.

Knowing about tokens makes this surprising mistake obvious: The software does not see the individual letters. It sees two or three separate tokens. This is also the reason why AI tools struggle with precise rhyming, creating poetry, or formatting text to an exact character count.

Interestingly, if you instruct the AI to go step by step, it can count the letters in “strawberry” just fine. It’s not dumb. It just experiences and sees the world differently. This shows why it is so useful to have some basic knowledge of how AI works: You can understand, prevent or circumvent these kind of limitations.

Alternative approaches

Researchers are exploring ways to process text without tokens to overcome such blind spots.

  • Character level models: Some experimental designs read raw text strictly letter by letter. This completely solves spelling issues and improves mathematical logic. But as mentioned above: Reading letter by letter requires way more computing power. This makes the approach too slow and expensive for large models today.

Making an AI, step 1: pre-training and ingesting the web

Before an AI can answer your prompts, it needs an understanding of language. Engineers build this during a phase called pre-training.

During this step, the software ingests large amounts of text, hundreds of millions of content pieces or more. This training material includes digital books, Wikipedia articles, news websites, and a general snapshot of the public internet like Common Crawl. Supercomputers process this data for several months. The software analyzes the mathematical relationships between the tokens as discussed above.

The advantage: broad world knowledge

This data consumption gives the model its baseline capabilities. The AI learns the rules of grammar and syntax. It acquires multilingual abilities and basic reasoning skills. The software essentially learns how human language works and how different concepts connect. This creates the mathematical web of probabilities that allows the AI to also generate responses on many topics that you have already heard about.

Problem 1: absorbing human flaws

The internet contains high-quality information alongside spam, hate speech, and deep-seated human biases. The AI absorbs all of it. Because the software only looks at statistical probabilities, it does not distinguish between a peer-reviewed scientific paper and a toxic forum post. If biased language appears frequently in the training data, the AI will mirror that bias. Engineers spend a lot of time filtering the training material to mitigate this, but completely cleaning the entire internet is impossible. Because of this, you need to pay attention to biases, stereotypes or common misconceptions in the AI’s output.

Problem 2: frozen knowledge and cut-off dates

This pre-training requires thousands of specialized computer chips running at full capacity. This process costs millions of dollars and consumes large amounts of electricity. Because of this, the training phase has a strict end date.

Once the pre-training is complete, engineers lock the mathematical web of probabilities. This also turns the model into a stable software product that behaves consistently for every user. Early experimental chatbots learned continuously from users and were quickly manipulated into producing toxic output. A “frozen” model prevents this.

But this stability creates a limitation: Such an AI cannot learn on the fly. If an event happens after the training stops, the AI remains unaware of it. This creates the cut-off date. Because of this, you must be careful when asking an AI about current events, recent software updates, or new cultural trends. The software might confidently give you an outdated answer or hallucinate a fabricated one, simply because the correct information does not exist in its outdated mathematical web.

Alternative approaches and workarounds

Since retraining a large language model every week is too expensive, developers use workarounds to give the AI access to current information or look for more efficient ways to build these foundations.

  • Retrieval-Augmented Generation (RAG): This is the most common solution to bypass the cut-off date. The system connects the AI to a search engine or a private database. When you ask a question about a recent event, the system might first search the internet for relevant articles. Or you can instruct it to only use internal company data for its research.
  • Continual learning: Researchers are experimenting with designs that give the AI a persistent external memory. These systems use a separate database to store important facts from your past conversations. The AI can check this database to remember your preferences or ongoing projects without needing to retrain the model.
  • Small Language Models (SLMs): Instead of scraping the entire internet, some engineers train much smaller models on strictly curated data. They might use high-quality textbooks as well as synthetic data generated by larger AI systems. These smaller models require less computing power to train and run. They cannot answer questions about every topic, but they perform well on specific, focused tasks.

Making an AI, step 2: supervised fine-tuning

After the pre-training phase, the AI is knowledgeable, but it is practically useless for you and me. At this stage, it is a sophisticated document completer: If you type a question, it might generate five more related questions, because it learned that questions often appear in lists on the internet.

To turn this into a helpful assistant, engineers use a process called supervised fine-tuning: They feed the software thousands of curated examples. Human experts write specific prompts and pair them with the perfect responses. The AI analyzes these pairs to learn the mathematical patterns of a helpful conversation.

The advantage: following instructions

This step unlocks the true utility of the AI: The software learns the classic question-and-answer format and it gains the ability to (mostly) follow your instructions. Because of the human examples, it understands how to format an output as a bulleted list or a data table. It transforms from a passive text predictor into an active conversational partner.

Problem 1: catastrophic forgetting

This focused training creates a new issue: Engineers call it catastrophic forgetting. The software has to adjust its web of probabilities to prioritize the new conversational rules. By forcing the model to learn a specific helpful behavior, it might overwrite some of its broad foundational knowledge.

Problem 2: less creativity

This process also narrows the creative portfolio of the AI. The human examples teach the software to answer in a very specific, predictable way. For a content professional, this means the AI becomes very good at following strict formatting rules, but it loses a significant amount of its raw, unpredictable creativity.

This is just one example of many that shows: Gemini, Claude, ChatGPT, and others are built as general-purpose AI assistants first. That they can also be used for text creation is just a byproduct. And that shows.

Alternative approaches

Curating thousands of perfect prompt-response pairs requires a large amount of expensive human labor. Tech companies are trying to automate this.

  • Synthetic data and self-play: Instead of relying on human writers, engineers use larger, already finished AI models to generate the training examples for new models. In some experimental setups, the AI generates its own training pairs and learns through automated trial and error.

Making an AI, step 3: alignment and safety

Even after fine-tuning, the AI can still be unpredictable. It might generate dangerous instructions or use offensive language. To prevent this, tech companies put the model through yet another phase: alignment.

The most common method is called Reinforcement Learning from Human Feedback (RLHF): Teams of human testers interact with the AI and grade its answers. They rank multiple responses to the same prompt, rewarding the software for being polite, helpful, and safe, while penalizing it for harmful content. Through this feedback system, the AI adjusts its internal probabilities to favor the answers that humans seem to prefer.

The advantage: safety and usability

This process makes the model safer for public use. It learns to refuse illegal or dangerous requests. It becomes more polite and prioritizes helpfulness above all else. This reliability is for example important to businesses that want to integrate the software into their customer service or daily workflows without risking damage to their reputation.

Problem 1: the generic AI tone

The alignment process is also the birthplace of the generic “AI tone.” When you average out the preferences of thousands of human graders, the result is the ultimate middle ground. The software leaves behind more of its stylistic edge, strong opinions, and minority voices. It defaults to a diplomatic, often boring, corporate style. This means for writers that you have to actively prompt against this default behavior if you want engaging, opinionated, or creative text. You probably know this struggle very well.

Problem 2: over-refusal and sycophancy

To avoid any risk, the AI might develop a tendency for over-refusal. It then blocks completely harmless prompts because a single word triggered a safety rule. Another side effect is sycophancy: The software is trained to be helpful and please the user. Therefore, it might happily agree with you, even if you state something factually incorrect. This makes it a flawed tool if you want to use it as a critical sounding board without explicitly instructing it to disagree with you.

Problem 3: unintended consequences

A behavior marked as favorable in one instance can be misguided in another. One curious example from OpenAI: Its latest models started to obsess over ogres, goblins, and other creatures. These references had been liked by human guides under some specific circumstances, but they started to show up in many other places as well. Another example is the sycophancy crisis OpenAI found itself in a few months ago when it over-emphasized positive user feedback. The problem: Many people like it better when they are confirmed in their beliefs and viewpoints. This led to the AI becoming so uncritical that it could result in ill-informed or even dangerous decisions by a user.

Alternative approaches

Because human grading is slow and expensive, researchers are developing other ways to align models.

  • Direct Preference Optimization (DPO): This is a mathematical shortcut that aligns the AI directly, without needing a separate reward system judging it. It achieves similar safety results but is faster and cheaper to execute.
  • Constitutional AI: Pioneered by companies like Anthropic, this method replaces human graders with a set of written rules (a constitution). The AI is instructed to review and correct its own outputs based on these rules, automating the alignment process.

Making an AI, step 4: deductive processing and reasoning

For a long time, the text generation process was seen as finalized after alignment. The software received a prompt and immediately started predicting the next tokens for its response. But engineers started to introduce a new phase during the generation of the text. They call this “inference-time compute” or “reasoning.” You will also see the term “thinking.”

These AI models use a hidden digital scratchpad to plan their next moves. The software is trained to break down complex prompts into smaller logical steps. It essentially talks to itself and checks its own work, catches logical errors, tries different approaches, and only then outputs the final answer.

The advantage: logic and self-correction

This hidden step improves the performance in complex tasks. Standard models often fail at mathematics, coding, or deep logic puzzles because they cannot plan ahead. By verifying its own steps, a reasoning model is able to correct a wrong path before it provides an answer. This evolves the AI from a fast talker into a more deliberate problem solver.

Problem 1: slow and expensive

The hidden scratchpad requires more computational power. Before the software shows you the first word, it might have already generated thousands of tokens in the background. This makes these models slower. It also increases the cost per query.

Problem 2: overthinking simple tasks

Reasoning models are built for logic, not for creative writing. If you ask a reasoning model to draft a simple email or write a creative paragraph, it tends to overthink the task. This can result in text that feels even more robotic and stiff than usual. This is why I recommend to use the quicker models for creative tasks. They are often called something like “Flash”, “Fast”, or “Instant.”

Problem 3: getting confused

Even reasoning or thinking doesn’t guarantee a correct or even logical answer. It can still overlook details, get caught up in thinking loops, or come to the wrong conclusion after almost arriving at the correct one. Therefore, you still need to apply your own knowledge, skills, and common sense to check an AI’s reply. Keep your own brain engaged and in general: take it slow with your AI.

Alternative approaches

To balance speed, cost, and logic, developers are working on intelligent distribution systems.

  • Dynamic routing: Instead of sending every prompt through a reasoning process, the system uses an automated router. If you ask for a simple text translation, the router sends the prompt to a fast, standard model. If you upload a big spreadsheet for analysis, the router directs the task to the reasoning model. This is often called “Auto” mode.

Final word

So this is how the sausage AI is made. These are the main tools, processes, and steps.

I hope you found this article interesting, illuminating, and helpful. You and I don’t need to become AI researchers or get a Computer Science degree. But I find it useful to have at least a general understanding of how these tools that I use every day work.

That way, I know more clearly how, when, and for what I want to use them.

As you’ve also seen: The current generation of AI models is not made for writers. You probably already guessed that from some frustrating interactions trying to use them for your work.

To improve this situation, we would need an AI built from the ground up for our needs. It would start with highly curated content for its pre-training as well as fine-tuning and alignment tailored for a creative AI that has a deep understanding of the different formats, styles, and quirks of well-written text.

We’ll see if any of the AI vendors will come out with such an AI model. Until then, we have to wrangle with what we have.

Stay up to date

AI for content creation: the latest tools, tips and trends. Every two weeks in your inbox:

More info …

About the author

Related posts:

Advertisement

×