Egregoros

Signal feed

jonny (nonvenomous)

@jonny@neuromatch.social

known or reasonably foreseeable hazard

Digital infrastructure 4 a cooperative internet. social/technological systems & systems neuro as a side gig. writin bout the surveillance state n makin some p2p.

information is political, science is labor.

science/work-oriented alt of @jonny
This is a public account, quotes/boosts/links are always ok <3.

Posts

Latest notes

[CW]

Content warning

re: how'd this get so long it was supposed to be a quick thought, waiting for tests to run

Show

@sun i think it would be a real shame if the outcome of the alienation tech was to make us all self-alienate and hate each other. i find most binaries exhausting and unenlightening. like even with monster trucks, i can acknowledge how they're unbelievably wasteful and often enmeshed in some pretty hateful shit, while still being able to go like "whoa holy shit" when they are doing wheelies and flips and whatnot.

[CW]

Content warning

how'd this get so long it was supposed to be a quick thought, waiting for tests to run

Show

@sun i figure the next stage and longer tail of the commercial end after the outrageous debt and speculation in the consumer market pops is latching onto workflow data to provide company- or domain-specific finetunes or strapon autoencoders. i'm sort of surprised on a longer term scale that isn't already the case, that the foundation models can be 'good enough', the economics restrictive enough where the ai giants are not licensing weights, the ui problem where people gravitate towards their favorite box, etc. haven't made that more of a thing yet. it's all still selling prefilled context windows at best, and all those businesses are doing terribly. i do wonder if the consumer market has been too poisoned by the chatbot modality to be able to adopt the 'whole life surveillance/whole life product surface' shift that google and microsoft have been trying to push, and these kinds of economically viable domain-specific products will always just look like Hated Work SaaS App.

if you stretch the boundaries of what you consider "AI" to include an assemblage of purpose-built models and algorithms, then at the outer limits there you start to get back to "normal computing." the language | program interface is still, as far as i can see, an intractable one in the general sense, where the leap from text generation to tool calling can land you somewhere but it can also yeet you off into space. i would say on latest model anthropic and openapi models i get about a 50/50 chance of whether the thing is capable of spawning a subagent or whether it fails and makes excuses about it (and sometimes even simulates the output of a subagent but in the debug logs you can see it failed the tool call). That's a non-negotiable barrier that doesn't have an easy technological solution in sight that prevents the wildest subdomain takeover scenarios, but in the meantime a lot of stuff sort of works and for most people that seems to be alright.

but yeah if one gets past all the sentient god-machine occultism, it's not hard to imagine this generation of tech giants going the same way at the prior (current?) gen, spinning off their magic product into a thousand derivatives that capture a lot of cash, displace a lot of labor, provide some useful services as a byproduct, etc.

most of my serious reservations come from the economic and epistemic violence of it all, and i definitely do active research into the failure modes, but of course they do some things, and as you say it's important to stay appraised of the real capabilities both to calibrate criticism and appreciate what can be done.

@sun @david_chisnall definitely agreed on that, i am a filthy monolingual and so am the last person who can speak on translation, but as david is saying above it definitely can get to the point where it does the trick most of the time, up to a point when the communicative context escapes the space covered by the training data. that probably includes most of "business-relevant communication" that makes up most of the moneymaking end, but also i can imagine easily leaves people in e.g. complex legal situations navigating asylum or immigration proceedings in a really bad place, particularly when the very non-neutral language surface of the LLM starts to rear its head. like at once it is a miracle that i can get "pretty good" translation on demand any time, and not to be underrated, but as you say some things will be worse not better, and unfortunately most of that "worst" is going to fall on people who already have it bad.

So, look. One shot rewriting the whole test suite in another language is probably not great to do, but what happened here is so much worse than you are expecting.

https://github.com/RsyncProject/rsync/pull/903/

This does not "translate tests into pytest" or a unit testing framework, it writes its own testing framework where tests are whole python scripts that redefine basic test functions in every script. Surely there would be a single way to "run rsync and get the results" - nope, well, there is, but then every test file will randomly redefine its own _run_and_capture function. So like now rsync needs a test suite for its test suite.

If instead of telling an LLM to "rewrite the tests in python" you just searched "python testing" you would find the pytest docs. And then you would find examples. And then you could write fixtures to deduplicate all the prior shell script setup and teardown stuff, and so on. But since it was just "rewrite the tests in python" its now worse than before, and the odds of the rewrite actually being a 100% faithful translation are close to 0.

edit: more on the social element of "what if you just asked people how to rewrite in python" vs. Being trapped with the Orb I incompletely nodded to here: https://neuromatch.social/@jonny/116671260017373441

@elebertus
Ive read so much LLM code at this point, there are still patterns that are present but elude my understanding, but one thing that's clear is that there are foundational flaw categories that are not improved upon by model version and appear in wildly different projects using wildly different models and harnesses. Testing is a big nexus of those flaws. I am not close to what would be a satisfying explanation of the dynamics, but every project suffers fucked testing problems.

@feld
"How the code is written" is more than aesthetics, unfortunately, and even though it feels good to get high on massive token counts, needing to spend a week churning your agents on a problem that is a byproduct of previous bullshit decisions that in turn generates a dozen more bullshit decisions that need to be undone later is not more productive than "just doing it right"

When I'm up to look for work again, I am just going to have to counter pitch every "AI accelerated agent flow" position with "show me your AI code and if I can tell you 2 things that are structurally preposterous about it that have made you sink more time than you thought you saved, then you hire me and I don't have to do that."

I have a range of expertise and diagnosing the many ways stirring a vat of hogwater can feel like jet fueled rocket skates is increasingly one of them.