Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Which one is more important: more parameters or more computation? (2021) (self.__VINEXT_RSC_CHUNKS__=self.__VINEXT_RSC_CHUNKS__||[];self.__VINEXT_RSC_CHUNKS__.push("2:I[\"aadde9aaef29\",[],\"default\",1]\n3:I[\"6e873226e03b\",[],\"Children\",1]\n5:I[\"bc2946a341c8\",[],\"LayoutSegmentProvider\",1]\n6:I[\"6e873226e03b\",[],\"Slot\",1]\n7:I[\"3506b3d116f7\",[],\"ErrorBoundary\",1]\n8:I[\"a9bbde40cf2d\",[],\"default\",1]\n9:I[\"3506b3d116f7\",[],\"NotFoundBoundary\",1]\na:\"$Sreact.suspense\"\n:HL[\"/assets/index-BLEkI_5r.css\",\"style\"]\n")"_blank">parl.ai)

59 points by jxmorris12 4 days ago | 13 comments

vorticalbox 3 days ago [-]

This reminds me of https://dnhkng.github.io/posts/rys/

David looks into the LLM finds the thinking layers and cut duplicates then and put them back to back.

This increases the LLM scores with basically no over head.

Very interesting read.

renticulous 2 days ago [-]

Jeff Dean says models hallucinate because their training data is "squishy."

But what's in the context window is sharp, the exact text or video frame right in front of them.

The goal is to bring more of the world into that context.

Compression gives it intuition. Context gives it precision.

Imagine if we could extract the model's reasoning core and plug it anywhere we want.

2ndorderthought 2 days ago [-]

LLMs "hallucinate" because they are stochastic processes predicting the next word without any guarantees at being correct or truthful. It's literally an unavoidable fact unless we change the modelling approach. Which very few people are bothering to attempt right now.

Training data quality does matter but even with "perfect" data and a prompt in the training data it can still happen. LLMs don't actually know anything and they also don't know what they don't know.

https://arxiv.org/abs/2401.11817

self.__VINEXT_RSC_CHUNKS__=self.__VINEXT_RSC_CHUNKS__||[];self.__VINEXT_RSC_CHUNKS__.push("e:I[\"6ed23a8ff0cc\",[],\"default\",1]\n")>electroglyph 2 days ago [-]

> they also don't know what they don't know

they sort of do tho:

https://transformer-circuits.pub/2025/introspection/index.ht...

2ndorderthought 2 days ago [-]

I won't quibble even though I likely should. Have to remember this is HN and companies need to shill their work otherwise ... Yes.

I will play along and assume this is sound. 10-40% +/- 10% is along the lines of "sort of" in a completely unreliable, unguaranteed and unproven way sure.

chongli 2 days ago [-]

That’s not the only issue. They also have the problem that they’re built to always give an affirmative answer and to use authoritative wording, even when confidence is low. If they were trained to answer “I don’t know” instead of guessing, they’d hallucinate a lot less, but nobody seems to want that.

It calls to mind the issue of search engines that refuse to return “0 results found” anymore. Now they all try to give you related but ultimately incorrect results.

To me, that feels like gaslighting. It’s like if you ask someone to buy cheddar cheese at the store and they come back with mozzarella, and instead of admitting that the store was out of cheddar, they try to convince you that you actually really want mozzarella.

parineum 2 days ago [-]

> If they were trained to answer “I don’t know”

If they were trained that an answe of "I don't know" was an acceptable answer, the model would be prone to always say "I don't know" because it's a universally acceptable answer.

It's a better answer even if it does "know".

chongli 2 days ago [-]

That could be fixed with the right scoring scheme in training. The SAT exam (for college-bound high school students in the US) used a scheme like this for multiple choice questions. Correct answers are awarded 3 points (with choices a,b,c,d), incorrect answers are penalized with -1 point, and leaving the answer blank (equivalent to "I don't know") is worth 0 points. This way, the expected value of guessing a random answer when the student doesn't know is 0 points so you might as well leave it blank if your confidence in the answer is no better than a random guess.

majormajor 2 days ago [-]

That just sounds like a very fancy/marketing way of saying "models will hallucinate because you cannot compress all the facts in the world into the model size." (Without even getting into any other things that could cause plausible-but-incorrect output.)

>Imagine if we could extract the model's reasoning core and plug it anywhere we want.

Aren't a lot of the latest model variants doing something very similar? Stuff more domain-relevant knowledge into the model itself on top of a core generally-good reasoning piece, to reduce need to perfectly handle giant context?

vessenes 2 days ago [-]

Interesting little bit of history; this pre-Chinchilla paper proposed MoE training longer would improve performance. Good idea. They also proposed using a hash function to choose experts rather than training a routing layer and showed it marginally better at the time than existing routing techniques.

I’d guess that the hash function worked better because by definition it does not collapse; a modern training run of an MoE model will include careful attention to usage of experts, and expect some to be more ‘hot’ than others — e.g. totally flat percentage choice is a bad sign, and also look for unused or radically underutilized experts as well.

kang 2 days ago [-]

The answer should be obvious that its both.

Zurada was one of our AI textbook that makes it visual that right from a simple classifier to a large language model, we are mathematically creating a shape(, that the signal interacts with). More parameters would mean shape can be curved in more ways and more data means the curve is getting hi-definition.

They reach something with data, treating neural network as blackbox, which could be derived mathematically using the information we know.

anon373839 2 days ago [-]

Well both aren’t “more important”, since that’s illogical. I think recent strides in high performance small LLMs have shown that the tasks LLMs are useful for may not require the level of representational capacity that trillion-parameter models offer.

However: the labs releasing these high-intelligence-density models are getting them by first training much larger models and then distilling down. So the most interesting question to me is, how can we accelerate learning in small networks to avoid the necessity of training huge teacher networks?

mskogly 2 days ago [-]

Selective training data, lora fine tuning or MOE are other solutionsZ Sure, creating a model with 100 billion parameters will yield good results, but it’s sort of like employing a million random people to play darts. Or shooting sparrows with A nuclear bomb.

l4tq3 3 days ago [-]

[dead]

34ylsh 3 days ago [-]

[flagged]

Rendered at 08:15:32 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.