Why DIY AI for IBM i Application Modernization Looks Easy Until It Doesn’t

A pattern is emerging across IBM i shops in nearly every industry right now. A developer or architect decides to test generative AI on a small RPG-to-Java or .NET conversion task. They grab some RPG source code, feed it into Claude or another general-purpose LLM, and within a few hours they have something that compiles, renders a screen, and appears to work. They schedule time with their manager, deliver an impressive demo, and the results look promising.

Then the hard part begins.

What happens in that demo represents the best-case conditions for generative AI: a contained program, a motivated developer who can guide and correct the output, and an audience that doesn’t yet know what to look for.

What it doesn’t represent is production-ready, enterprise-scale modernization. The gap between those two things is where organizations lose months, compromise code quality, and accumulate technical debt they won’t fully understand until they are deep into the problem.

This article reflects the collective experience my team has built across large-scale IBM i application modernization projects, including converting over 10 million lines of RPG code using our agentic AI approach. It focuses on the operational realities that tend to surface after the demo stage and is intended to help technical leaders make clear-eyed decisions, not to discourage AI adoption.

The goal is to help you understand exactly where the risks are and what it takes to manage them effectively.

Why DIY Feels Like the Right Call

The impulse to handle this in-house is completely understandable. Developers and architects want to be the ones solving the problem. It’s technically interesting work, and there’s genuine professional satisfaction in pulling it off.

Beyond individual motivation, there’s organizational pressure. CTOs and IT leaders are being asked to show AI-driven progress. Procurement friction for external solutions is real. Running an LLM against your own source code has an extremely low barrier to entry, and the early results can be genuinely impressive.

There’s also a financial logic that seems sound on the surface: API costs for a general-purpose LLM are relatively modest, and if you have internal developers willing to take this on, the initial investment looks much smaller than a structured engagement with a specialized toolset or team.

What this calculation often misses is the downstream work: verification, remediation, consistency enforcement, coexistence planning, and the cost of uncovering issues late in the process.

What IBM i Application Modernization at Scale Actually Means

Before diagnosing the problems, it’s worth being specific about what “scale” means in this context. We’re not talking about converting a handful of self-contained RPG programs. We’re talking about hundreds of thousands of programs, including large ones with tens of thousands of lines, deeply integrated with other programs through CL, data areas, database interactions, and direct program calls. Copybooks are nested. Database files are shared across systems. Business logic is distributed across layers in ways that aren’t always obvious from looking at a single source file.

When you modernize one RPG program in isolation, you’re improving one part of the application, not modernizing the system as a whole.

The rest of your system still exists, still runs the way it always has, and the newly converted programs have to coexist with it seamlessly. That coexistence problem alone is one of the most underestimated challenges in any IBM i modernization effort, but we’ll come back to that.

The Core Pitfalls of DIY AI for IBM i Modernization

General-Purpose LLMs Skip What They Don’t Understand

Generative AI is not programmed to fail loudly. When it encounters code or context it cannot confidently handle, it will often proceed anyway, sometimes silently dropping logic or substituting an approximation. The output compiles, the screen renders, and nothing obviously signals that something is missing.

In direct comparisons our team conducted, a general-purpose LLM was given the same RPG program as a specialized IBM i modernization tool. The source contained 11 conditional logic paths that needed to be preserved in the converted output. The LLM identified and converted 3 of them. The output was plausible. It ran. It looked like a reasonable conversion. The other 8 conditions were simply gone. If you weren’t looking for them, you wouldn’t know.

This is not an edge case. General-purpose models are built to produce output that appears correct and satisfies the immediate request. They are not built to flag what they missed, because from their perspective, they didn’t miss anything.

The burden of verification falls entirely on you.

Hallucinations and Silent Omissions Are Hard to Catch

The subtler problem is that poor output doesn’t look like poor output. Hallucinated business logic looks like business logic:

Incorrect date handling compiles cleanly.
A missing validation doesn’t announce itself.

You find it when something breaks in production, or when an auditor looks at the behavior of a converted program against its original specification.

Generated test suites compound this problem. When you ask a general-purpose LLM to write tests for code it just generated, it will often write tests that pass. But those tests are validating the generated code’s behavior, not the original program’s intended business behavior. It’s a closed loop that confirms the AI’s own work, not correctness against requirements. Building end-to-end functional tests is not something AI can do for you on its own.

Having 50 passing tests tells you very little if the tests themselves were written to match a flawed conversion.

Inconsistency Across Programs Creates Hidden Technical Debt

Run a general-purpose LLM against 20 programs and you will likely get 20 different approaches to the same recurring problems.

Date conversions handled differently in each one. UI validation logic reimplemented three ways. CSS that looks fine until someone has to maintain it, at which point they discover that what appears to be a small arrow icon is actually 40 lines of dynamically generated SVG coordinate code, because the AI decided to solve it that way in that program, on that run.

This is not the kind of technical debt you’re used to seeing. It doesn’t look messy. The code can be elegant in isolation. The problem is that you have re-implemented the same business logic six times across your application, each time slightly differently. When that logic needs to change, which it will, there is no single place to change it.

Maintainability in this context doesn’t mean whether you can read the code. It means whether your organization can evolve it. Inconsistent generation undermines that at a structural level.

Context Size is Not Context Management

There is a common misconception that giving an AI agent access to a large context window solves the information problem. It doesn’t. Effective context management for enterprise code modernization means giving the agent the right information, structured appropriately, with clear instructions on where to find what it needs, and without polluting the prompt with everything that has come before.

Copybooks are a clear example.

How does your agent handle nested copybooks?
Does it pull them in appropriately so the agent can reason about them accurately, or does it ask the agent to figure out the nesting structure itself?
Does your prompt carry forward every edge case and correction from prior programs, gradually degrading quality as the context grows larger and noisier?

Bigger prompts don’t automatically produce better results. Beyond a certain point, they produce worse ones.

Getting context management right for IBM i at scale is a genuine engineering problem. It requires prompt engineering, structured payloads, and purpose-built retrieval and instruction mechanisms. It’s not something you configure in an afternoon.

Coexistence is Often the Last Thing Anyone Plans For

Here’s a scenario worth thinking through: you convert 20 programs. They work. Now how do you deploy them?

Those programs don’t exist in isolation on your IBM i system. They call other programs that haven’t been converted. They interact with database files that are shared with unconverted applications. They depend on data areas and service programs that are still running in their original form.

Incremental modernization is not just a technical strategy; it’s a deployment and integration challenge that requires deliberate planning. Without a coexistence strategy, you aren’t modernizing your application. You’re creating an integration problem on top of a modernization effort.

When DIY Stops Being DIY

There’s a moment that tends to hit around the six-to-eight month mark of a DIY modernization effort. Someone on your team has spent serious time building prompt templates, context management logic, validation layers, and remediation workflows. They’ve knocked out dozens of edge cases.

At that point, they haven’t done DIY modernization. They’ve built a product. And they’ve built it from scratch—without having processed millions of lines of IBM i code, without the accumulated knowledge from prior modernization efforts, and without the surrounding tooling that analyzes existing application structure to make conversion more accurate.

Any one of those gaps is solvable given enough time and the right people. The problem is they don’t exist in isolation. Fix the prompt for one category of issue and it affects another. Handle a new edge case and it changes how the agent behaves with something it was already getting right. It really is death by a thousand cuts, and the cuts don’t stop coming. Teams who’ve been through this don’t count edge cases in dozens. They count them in thousands.

The question isn’t whether your team is capable; it’s whether this is the best use of that capability.

A Checklist for Evaluating Any IBM i AI Modernization Approach

Whether you are assessing a DIY approach, a general-purpose AI tool, or a purpose-built solution, ask these questions before committing:

How are omissions detected? What mechanism exists to identify logic that was present in the original source but absent in the converted output? Is this automatic, or does it rely on manual code review?
How is output validated against original business behavior? Are tests written against the original program’s intent, or against the generated code’s behavior? Are real test cases used to verify functional equivalence?
How are copybooks and complex dependencies handled? Does the approach account for nesting levels, shared files, and cross-program dependencies, or is each program treated as self-contained?
How is consistency enforced across programs? Is there a framework that ensures common patterns, date handling, UI conventions, and validation logic are implemented the same way every time?
What is the coexistence strategy? Before any converted code reaches production, how will it interact with unconverted programs, shared database files, and existing system infrastructure?
How is context managed at scale? When processing large programs or large volumes of programs, how does the approach ensure the AI agent receives the right information without prompt degradation?
What does the verification and quality assurance process look like end to end? Beyond compilation, what confirms that converted code is production-ready and maintainable?

Getting honest, specific answers to these questions will tell you more about an approach’s readiness for enterprise use than any demo will.

What a Purpose-Built Approach Looks Like

The limitations described throughout this article are not arguments against AI in modernization. They are arguments for using the right AI. Fresche’s X-Modernize AI was built specifically to address these gaps, handling the dependency mapping, logic preservation, and context management that general-purpose LLMs consistently miss. It’s the engine behind the Fresche proof of concept and the tool our team has used to convert over 10 million lines of RPG code.

For teams evaluating where AI fits in their IBM i modernization strategy, X-Analysis AI and IBM Bob play complementary roles alongside X-Modernize AI: X-Analysis AI helps teams understand dependencies, business rules, and downstream impact before changes are made, while IBM Bob supports coding and refactoring tasks.

Explore how IBM Bob and X-Analysis AI work better together.

If your team is evaluating how to use AI in IBM i modernization, the most productive next step is usually not a platform decision. It’s a structured assessment of your own dependencies, validation requirements, and coexistence constraints. The goal is a clearer, more honest picture of what modernization at scale actually requires.

For a candid look at these tradeoffs, including the limits of DIY AI and the differences between DIY, vendor-led, and hybrid approaches, watch the Fresche Talks session on “Strategic IBM i Modernization: Using AI at Scale.”

And if you want to see what this looks like in your own environment without committing to a large initiative, Fresche offers a free IBM i modernization proof of concept (POC) using X-Modernize AI to help teams validate productivity, integration, and modernization opportunities in a more concrete way.

Modernization

Managed Services

Cloud Services

Data Analytics & AI

Software Products

Modernization

KTLO

Cloud Services

Data Analytics & AI

Software Products

Application Modernization Factory

In-House vs. Managed Services: An IBM i Decision-Maker’s Guide

IBM i Cloud - The Smarter Way to Clear Decades of Technical Debt

Modernize D365 Reporting

AI-Powered IBM i Application Intelligence

Resources

About

How IT Teams Are Modernizing Analytics with Microsoft

IBM i and Microsoft Systems, Reimagined for Tomorrow