Striking Out with LLMs: Or how I tried to use AI and why it failed.

Share
Striking Out with LLMs: Or how I tried to use AI and why it failed.
Photo by Mick Haupt / Unsplash

This is a post I've been kicking around for a while and why I decided to start up a newsletter. I think it's good to read about how people try to actually do things, and in that spirit, here's my experience in trying to use GenAI tools for a task.

I will preface this by acknowledging I'm kind of an AI hater. I share Ed Zitron's skepticism about the AI Revolution and I recommend The AI Con to everybody I can. One of my biggest frustrations with the AI hype is that very little of the discourse actually explains why anybody would actually need to use generative AI other than to avoid other humans. (And other humans are friction.) This is one of the reasons why AI is part of the fabric of techno fascism.

It's frustrating and disappointing to hear people brag about offloading basic things to a GenAI tool that reflect a lack of creativity. I've had people brag about using an LLM or an image generator to make birthday cards or writing their personal evaluations, and it just makes me sad because it seems to miss the point of doing those things but reflects a shift to transactional relations focused on productivity regardless of substance. There also seems to be a considerable amount of make work, where tasks have AI foisted into them - not for exploration or experimentation, but to seem modern and seemingly more innovative. Which is why workslop is a thing taking over many organizations and sadly too many people in power think it's a necessary part of business. (It's not.)

With all of that said, I did try a good faith effort to offload some work to ChatGPT, Claude and Gemini - making a schedule for youth baseball.

I thought this would be easy! It's basically a sorting algorithm. I wanted to set out to randomly create a schedule for a division in a youth baseball league. This seemed like an easy task that these LLMs could help with and I liked the idea of outsourcing the target of complaints from coaches to an LLM. "Oh, you don't like the schedule? Blame the GPTs." To do this I fed in these parameters: game dates and times (for our assigned field times), known black out dates for specific teams, teams can only play 1 game a day, and limiting teams to 2 games in 3 days (to balance pitching load).

I started with ChatGPT and the results seemed fine at first glance, but then I realized the blackout dates weren't accounted for. Upon further inspection of the output, it was clear none of the parameters were really followed. So I would refine it, and ChatGPT's output would repeat prompt back with a frustrating, "OK. I've got it this time," and then generate a schedule breaking another parameter. So then I jumped to Claude, thinking it might handle it better, but ultimately the results were the same (just a different flavor of wrong). So I thought I would try Gemini, and it also didn't make a balanced schedule. Was it user error or the tool? Probably both.

I documented some of this on Bluesky and got some helpful suggestions. My approach was wrong and I should have asked for the LLM to help craft the algorithm instead. So I tried that by introducing a parameter at a time to refine the algorithm, but all three quickly reverted to trying to feed up the answer for me and giving me the same busted results. So then I tried to take the output of one (say ChatGPT) to feed another (say Claude) as if that would somehow help - it didn't. The issues all three had included: adding extra games to the schedule, having double headers, having teams play on their blackout dates, and not balancing teams playing each other equal number of times. The base dates and times were enough for each team to play each other twice, and despite telling the LLMs to stick to the slots, they all struggled.

I shared the schedule with my coaches and they all pointed out the errors, frustrated with the pace of me making the schedule. And they were right. I cursed the GPTs. Eventually I just made a very complex table of COUNTIF to make sure each team had the right number of games, played each other twice, and manually made sure blackout dates were accounted for. I messed around with the LLMs for a week before brute forcing it in a couple of hours. It was a bit easier because I had a faulty baseline to start with, but I could have saved a l0t of time if had just moved to refining it myself after the first afternoon of failure. The schedule was made.

So while this exercise didn't really yield a schedule, it did give me a better understanding of the limitations of these tools. When people told me, "that's not the right tool for the job," they were right! But also that's not the point when everybody is being told GenAI can do everything you want it to. I tried in good faith to roll with the hype and was burned. Did I like any of the tools more than the others? Not really. Will I use them to help write sorting algorithms? No. My skepticism just has new dimensions and I am even more weary of vibe coding.

If you have suggestions for other things I should try, let me know! I am definitely doing it wrong.