Posts, page 2 of 7

Friday, 26th September 2025

I built a simple CLI tool in one evening without writing a single line of code—not because the world needs another command-line utility, but as a learning exercise to explore LangChain whilst getting hands-on experience with AI-assisted development workflows.

The tool itself is deliberately basic: it sends prompts to Google's Gemini API via LangChain. No one will actually use it. But the development process revealed crucial insights about when AI excels and when human oversight becomes non-negotiable.

Context: I'm currently working through the IBM RAG and Agentic AI Professional Certificate course, which sparked my interest in putting LangChain into practice. This seemed like a reasonable opportunity to combine that learning with experimenting with Tessl and Claude Code workflows.

Time investment: 2 hours total versus an estimated 6-8 hours for hand-coding the equivalent functionality and test coverage.

The Setup: Spec-Driven Development with AI

I've been using Claude Code paired with a tool called Tessl, which enforces spec-driven development workflows. My approach was deliberately hands-off: write a README setting the vision, capabilities, tech stack choices, and API design, then let Claude Code handle everything else.

The project: a simple CLI tool that uses Langchain to send prompts to Google's Gemini API. I wrote a README, asked Claude Code to generate specs for the first basic feature, then requested implementation. Tessl keeps Claude Code focused by preventing feature creep—when the initial spec tried to do too much, I simplified it before proceeding.

💡 Worth Considering: Spec-driven development with AI requires upfront architectural thinking but prevents the common trap of letting AI wander into over-complex solutions. Tools like Tessl provide valuable guardrails.

First Reality Check: When AI Debugging Falls Short

After Claude Code reported all tests passing, I tried the tool: llm-prompt "What are the benefits of renewable energy?"

It didn't work.

Claude Code couldn't figure out why. I looked at the code and immediately spotted the issue: Claude had hard-coded the model type to a non-existent value instead of reading from the .env file. This was particularly frustrating because my original README—which formed the basis of the entire spec—explicitly described the .env file and its purpose. Claude Code had somehow ignored this clear configuration guidance and hardcoded the value anyway.

Once I pointed this out, it fixed the problem instantly.

⚠️ Something I Noticed: AI excels at following patterns but struggles with environment-specific configuration issues. It's worth checking that AI implementations actually use your configuration files rather than hard-coded values, even when you've explicitly documented the configuration requirements.

You can see the code here: https://github.com/rowlando/cli-gemini-prompt/tree/af8bbb6be61ddb0191f5dcd23db9df7a967cd840

The Shocking Discovery: Meaningless Test Syndrome

The next morning, I examined the generated tests and found this gem in test_send_prompt_success.py:

def test_send_prompt_success():
    # Mock send_prompt_to_gemini to return the expected response
    with patch.object(llm_prompt, 'send_prompt_to_gemini', return_value="mocked response") as mock_send:
        result = llm_prompt.send_prompt_to_gemini("Hello world", "fake_key", "gemini-1.5-pro")
        assert result == "mocked response"
        mock_send.assert_called_once_with("Hello world", "fake_key", "gemini-1.5-pro")

This test only verifies that mocking works—it doesn't test the actual function logic at all.

🚨 Worth Noting: AI commonly generates circular mock tests that prove nothing about real functionality. It's worth checking that AI-generated tests mock dependencies (rather than the function under test) and actually verify business logic.

Teaching AI to Write Meaningful Tests

I challenged Claude directly: "Review the usefulness of all tests. test_send_prompt_success.py only checks the mock works—it doesn't test the real function."

Claude's analysis was surprisingly thorough:

Tests with Limited Value:

  • test_send_prompt_success.py - Circular mock test proving nothing
  • test_llm_prompt_main_success.py - Everything mocked, no integration testing
  • test_llm_prompt_api_failure.py - Tests mock exceptions, not real failures

Missing Coverage:

  • Real API integration tests
  • load_model() function completely untested
  • LangChain integration verification
  • Response parsing logic
  • Edge cases: empty prompts, network timeouts

I asked Claude Code to implement better testing practices, and it generated genuinely valuable improvements:

  • Proper integration tests that mock at the HTTP level while testing business logic
  • Edge case coverage for empty inputs and error conditions
  • Real dependency testing that verifies LangChain integration parameters
  • Comprehensive error handling for network failures and API issues

See the improvements: https://github.com/rowlando/cli-gemini-prompt/commit/97ab2768f8114c3d15b3876b606eb8de9aa64837

✅ Useful Approach: After generating tests, asking AI to review them for circular mocking and missing coverage can lead to self-correction when prompted correctly.

Is 17 Tests for 61 Lines Overkill?

My final challenge: "17 tests for 61 lines of code—is this overkill?"

Claude's response was data-driven:

Test-to-code ratio of 1:3.6 is appropriate for CLI tools because:

  • External API calls require extensive failure scenario testing
  • Environment variable configuration creates multiple code paths
  • Argument validation needs comprehensive edge case coverage
  • Error propagation from API to main function requires verification

Coverage breakdown:

  • Argument validation: 4 tests
  • Environment configuration: 4 tests
  • API integration: 5 tests
  • Error handling: 4 tests

💡 Interesting Point: Don't judge test coverage by lines of code alone. CLI tools with external dependencies legitimately need extensive testing if you want them to be reliable.

What I Learned

This experiment taught me several things about AI-assisted development:

Constraints are essential: Spec-driven development prevented Claude from wandering off into feature creep. Whether that's Tessl, SpecKit, or just well-written specifications, some form of constraint seems necessary to keep AI focused.

Quality review is essential: The AI happily generated tests that only verified mocks worked properly. It took explicit prompting to get it to self-critique and generate meaningful tests. The concerning bit is how these meaningless tests still passed and gave a false sense of confidence. If I were to develop this further, I'd definitely add acceptance tests to verify actual end-to-end functionality.

Configuration debugging remains human work: The hardcoded model issue was obvious to me but completely invisible to Claude Code. Despite the README that formed the basis of the spec explicitly describing the .env file, Claude Code still hardcoded the value in the application code. AI appears excellent at following patterns but struggles with environment-specific nuances, even when given clear documentation about configuration requirements.

Testing coverage can be genuinely useful: Once corrected, the AI produced 17 tests that covered edge cases I probably wouldn't have bothered with manually. The 1:3.6 test-to-code ratio initially seemed excessive, but Claude's analysis of why CLI tools need extensive testing was quite convincing.

The process saved considerable time—about 70% compared to hand-coding everything. But it definitely wasn't hands-off development. More like having a very capable but literal-minded junior developer who needs clear instructions and careful review.

The Bottom Line

AI-assisted development can be quite effective—I managed to build a simple but fully functional CLI tool with comprehensive test coverage in 2 hours rather than the 6-8 hours it would have taken manually. Whilst the tool itself is just a learning exercise, the process required disciplined constraints (spec-driven development), careful quality review (catching circular mock tests), and recognising where human oversight remains irreplaceable (environment configuration debugging).

It's an interesting glimpse into a workflow where developers might spend more time on architecture and quality review whilst AI handles the mechanical aspects of implementation and testing. Whether that's actually better remains to be seen, but it's certainly faster.


Tools mentioned: Claude Code (command-line AI coding assistant), Tessl (spec-driven development), LangChain (AI framework), Google Gemini API


Tuesday, 9th September 2025

I came across a research paper called Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence.

To make the research paper a bit more digestible, I turned to one of my favourite tools, Google's Notebook LLM. It created me a mindmap, a podcast (you can hear the first minute above) and a video overview (a male voice talking over slides).

I can't rate this tool enough for learning / exploring a topic. See Tiago Forte's video for more info.

---

Incidentally, I used Claude to tell me how to use ffmpeg to cut the file to 1 minute long and fade out at the end. Then I needed to store the audio file somewhere and I chose Git LFS. I asked Claude Code to help and it:

  • updated the template with an HTML5 audio player with native browser controls
  • created a directory to store the audio files within my site's folder structure
  • added an entry to .gitattributes to ensure .m4a files were tracked by Git LFS integration for efficient storage
  • Updated DecapCMS configuration so I could reference audio files from DecapCMS

Monday, 8th September 2025

Martin Keen uses his light board to explain 7 AI Terms You Need to Know: Agents, RAG, ASI & More:

  1. Agentic AI: perceive → reason → act → observe
  2. Large Reasoning Model: thinking...
  3. Vector Database
  4. Retrieval Augmented Generation
  5. Model Context Protocol
  6. Mixture of Experts
  7. Artificial Super Intelligence

Mixture of Experts is a new term to me. Using a restaurant analogy, here's Claude's response:

Think of a high-end restaurant with multiple specialist chefs: one excels at pastries, another at grilled meats, another at seafood. When an order comes in, the head chef (the "gating network") decides which specialists should handle different parts of the meal based on what's needed.

A seafood pasta order might primarily go to the seafood and pasta experts, with minimal input from the pastry chef. This is much more efficient than having every chef work on every dish, and the quality is higher because specialists handle what they do best.


Monday, 11th August 2025

Blogmark: Legacy Systems Aren't Clocks (via)

Read this once a week to if you want to build organisation capacity for continuous adaption. And why would you not want to do that!

The irony is profound: by giving up the illusion of control—the detailed plans, the predetermined architectures, the rigid timelines—you gain real influence over complex systems. You stop trying to predict the unpredictable and start building the capabilities to adapt to whatever emerges.


Tuesday, 29th July 2025


Friday, 28th February 2025

Blogmark: On the team as a system (via)

A great reminder that teams are systems too. An ideal system might look like:

An ideal system might look something like: having a group of developers who take perfectly-written tasks from a perfectly-formulated backlog, the task is instantly achievable, the person understands the task, writes the code without mistakes, and as soon as it’s written, they press a button and deploy, and the user gets the feature immediately.

In reality, this never happens because parts are taken away from the ideal system, sometimes intentionally but many times because of the constraints of real life. In real life, other humans or code assistants carry out code reviews, so speed of the system is traded off for maintainable code and knowledge sharing, super important attributes for a team.

Then, he asks the podcast co-hosts to play a game: we know that we can take away elements from that ideal system, and they will make that system less ideal, but in exchange, we will get something that we want. So, what might make a system produce 22x less output? One example is that you can add code review to the system. You lose something, speed, but you get numerous other elements in return, for example you get knowledge sharing, code that is more ergonomically correct, and a number of other, unseen positives.


Friday, 14th February 2025

Duplicated code bloats a code base and is a breeding ground for defects. It's costly! Code that's hard to understand is hard to modify and extend. It too is costly. There are many more examples of costly software design.

I think we can all agree that poor design incurs costs.

So why do we call poor designs code smells? I've gotten to a point where I find that "cost" is a better term than "smell."

Code Costs. Duplicated code costs. Obscure code costs, etc.

Everyone, no matter if they are technical or non-technical, a maker or a manager, can understand the term, cost. The same is not true for smell. Cost is a ubiquitous word, an idea that a whole team can discuss and manage together.

Joshua Kerievsky


Friday, 7th February 2025

I had two exciting days at State of Open Con 2025. I had the honour of volunteering an afternoon shift on day 1 and a morning shift of day 2. I was lucky enough to help out in rooms both days, so as well as running around with microphones, counting people and making things ran smoothly and to time, I got to listen to more talks than I'd hoped for. Thank you volunteering scheduling gods!

Here's a few things that stood out to me:

  • There's a growing sovereignty risk for European countries heavy reliance on US cloud providers. European cloud providers market share continues to go down. EU want to reverse this trend, with a focus on open source solutions. Interestingly UK Gov has confirmed multi-region cloud is fine.
  • open source suffers from toxic behaviour and drama (see some examples). Some recommendations: have a strong code of conduct in place, be consistent in applying it and transparent in its use.
  • Great security (particularly supply chain) resources and stuff to get involved with at CD.Foundation, Cloud Native Computing Foundation (Cloud Native Landscape is a fun way to realise software is very complicated these days!), OpenSSF's projects provide security tooling and best practices galore (I particularly like the Best Practices project), all of which are particularly helpful in securing your software supply chain, SLSA is about verifying provenance. And not forgetting OWASP's projects.
  • People like Lord Nat Wei are pushing for open government "finish what the internet and open source started by open sourcing politics and government"
  • In the global south, understanding of open-source development model is limited, accustomed to traditional vendor relationships providing software, and cloud deployments are rare for production (partly because the well-known cloud vendors don't have data centres in many global south countries)

Here's some recent laws I learnt about:


Sunday, 26th January 2025

Job hunting

  • On Monday, I applied for a principal developer role at a non-ministerial government department. I wonder how there not being a minister changes things.
  • I've applied for 4 jobs on Tuesday, all engineering manager roles.
  • I also emailed Cabinet Office's Test, Learn and Grow programme, showing interest in getting involved.

Wednesday

  • I met up with my sisters in Brighton. We had breakfast and talked about growing old and associated ailments and how the wait to find out about diagnoses of possibly major health issues (the not knowing stage) can be a horribly, anxious time.
  • Read the opening pages of Three Body Problem

Thursday

  • Met my fellow State of Open Con 2025 volunteers, picked up a t-shirt and had a walk around the venue. Went for a pint with a few of the volunteers afterwards.

Listens

  • Been listening to Mano Le Tough this week in the gym. Spotify told me he's playing in Brixton soon. Starts at 10pm! I'm going to bed by then.

Notable reads

  • I flicked through the UNIX-HATERS Handbook. I looked up a few of the authors to see what they're doing now. Steve Strassmann has a collections of essays called Artificial Trust on AI and understanding complex systems, and all very readable. Simson Garfinkel writes a newsletter called Database Nation on data, ethics and AI.
  • There's been a lot of chatter about the UK Government's Test, Learn and Grow programme on the interweb. I've registered my interested in getting involved but I think it might be quite a while before they get to making any software. I read State of digital government report. Lots of positives, recognising many of the systemic challenges in delivery I recognise from my time at MoJ. Here's two highlights of many that caught my eye:

Mandate the publication of a standard set of APIs and events by public sector organisations. Starting with an expectation that every new service in central government departments will have an open API.

and:

Expand use of performance-based, outcomes-focused funding models that tie funding to metrics and accelerate the shift from ‘boom and bust’ transformation programmes to continuous funding of persistent, multidisciplinary product teams.

I hope some of it becomes true.


Thursday, 23rd January 2025

I've set up garden.rowlando.dev to publish some of the notes I make in Obsidian. Thank you to Wanderloots for making a concise, informative video to guide me through the Obsidian Digital Garden plug-in.

Behind the scenes the plug-in uses Eleventy, the static site generator I use for this website. The plug-in publishes to GitHub, Netlify kicks off a build process and then deploys the website.

This is all for free. The only thing I pay for is £10 a year my domain name.