My Anti-SOP

I've been underwater, focused myopically on my research, over the last few months. Now I want to catch my breath and reflect a bit on the lay of the land in deep learning. I'm currently in the process of applying to graduate school in computer science, and most of these musings I can't put into an Statement of Purpose (SOP) since they will make me look too scatterbrained (which I am). So let's call this my anti-SOP.

LM = language model; FM = foundation model.

The role of knowledge vs reasoning in LMs

I used to think the storage of knowledge in model weights (subject-relation-fact) was a bug, not a feature
We have databases for keys and values, surely we want FMs to be big-brain reasoning machines instead? My naïve dream was a ~1b model that can reason and plan like GPT4 and just use the internet/RAG for knowledge
Then I started paying attention to how I use LMs myself on a day to day basis and realized that I mostly use them as a soft interpolants of internet content, ie. an even softer version of search than keyword/semantic web search
Also, MMLU (knowledge) seems to be the major axis that determines general capabilities, including reasoning. Knowledge and reasoning seem intertwined in FMs, unlike in humans where for instance people who can memorize a lot may not be the best at hard math problems.
The long tail of the data distribution is interesting. The Allen-Zhu line of work shows exposure to a fact O(10) times isn't enough for a LM to be able to reason about it. One way to get around this is to synthetically amplify documents with long-tail facts during pretraining time so models see those facts thousands of times. But this feels too bashy to me. How can we do better?

Architecture vs data
- Used to think architecture was the first order concern, now I think it's the data. In some sense this is obvious in hindsight because in an estimation problem, the target function usually affects the learned function more than choice of estimator
- Have started some work on synthetic data; mixtures of real-synthetic usually outperform real on most downstream evals. This means there are some statistical properties the models "want" in their data that web text doesn't fully have
- What is the platonic data distribution models "want" to learn world representations on? Is there any definition for "quality" and "compute" such that the statement "the quality per token of synthetic data improves predictably with compute used to generate it" is true?
I really want to learn about HCI in the few weeks after grad school apps.
- How models are post and thus pretrained -- and thus what is most important scientifically for me to work on in the next few years -- is determined by how consumers and enterprises find it most intuitive to interact with them. The "GUI moment" for FMs still hasn't happened and I really wonder what it will be.
- Maybe it's agents, maybe that's the wrong mental model? I'm sure they will play a role but I don't think it's the full story, or at least specific enough for what I'm looking for.
- I think the design space for FM interfaces is extremely underexplored. But I also know nothing about HCI, so who am I to say. Some things that might be extremal points of this space:
- I should go to an elementary school and see how kids interact with these models. Alan Kay wanted the GUI so kids could interact with computation. Everyone is so obsessed with automating the enterprise. What about automating the elementary school? (this is satire)
I also really want learn about distributed systems

The fact I don't know what this NCCL bullshit means or how it works under the hood annoys me.
Despite the fact I've done multi-node pretraining many times by now, I still don't think I fully understand all the details of FSDP. I really should read the source code/try implementing a simple version from scratch, it seems important to master.

Hardware vs math for mental models of what neural networks actually are

Over the last year my mental model of what a neural network *is* (like, in its bones) has oscillated between "parametric function class" and "spicy matmuls on tensor cores"
Learning about GPUs, performance engineering, CUDA, and more from an awesome collaborator (shoutout BFS) as a reformed theory person has been revelatory
"Deep learning theory" sometimes feels like a mirror where you get out (theorems) a slightly transformed version of what you put in (assumptions you deliberately made to be able to prove things). I'm more than happy to use my own work as an example:

In No Free Prune, we used arguments from high-dimensional probability to reason about why pruning networks at initialization is doomed. I think it contains clean and cool theorems, but -- since it relies on overparameterization and isoperimetric data distributions -- what it says about FMs is unclear.

More confused about optimization than ever before - how do we get anything done here?

Does anybody alive understand high-dimensional non-convex optimization? How do optimizers for pretraining keep improving if we don't? (Adam -> Shampoo -> SOAP, etc)
Do adaptive/preconditioned methods really even share the texture of vanilla gradient descent? Are they even phenomenologically the same class of object? Are second-order methods inevitable? I am a naïve tourist in the wonderland of optimization

One thing that is amazing about deep learning is how unusually tight the link is between new research techniques and downstream products they enable

Often, scientific breakthroughs are touted to the general public as "potentially enabling world-changing technologies in just a few decades"
With deep learning, as soon as a new technique is invented in a preprint, startups start build products around it if it's important enough
After the TRAK paper came out, some folks I know started to work on a data attribution company
The literal day after the ReFT finetuning paper came out, friends at agentic startups were trying to see if it could be useful as a method to finetune agentic pipelines
When super long context started to finally work, audio companies started quickly cropping up using state space techniques to power voice apps
Quantum mechanics was cool because it shook the foundations of a scientific field, then over a span of decades enabled a ton of cool products to be built using our newfound understanding (electronics, lasers, MRI, etc).
- Deep learning is the same, except the products built around it crystallize on the timescale of weeks to months after a discovery! It is as if the double slit experiment and invention of the iPhone happened within the same year!

Will SSMs win? From no chance to maybe
- Inference really matters, and constant factor improvements/just making ASICs cannot be the end of the story
- Even if traditional SSMs (LTI) don't win, I think hybrid models have a real chance. The notion of clamping n-gram heads into an SSM in the ICLL paper was hilarious and brilliant.
- Inference is the speed of thought for agentic workflows and reasoning with LMs. If inference-time compute gains are loglinear and inference speed goes up overnight when we switch to SSMs, eval performance within a fixed wall clock inference time will shoot up out of nowhere.
- Table 8 in this paper is amazing, hybrid models beat Transformer++ by 3-5 MMLU accuracy points at large scale. It's hard to overstate how big a gap that is!
I'm starting to believe some of the cool phenomena theorists reason about in toy models may actually sometimes be similar what is driving similar phenomena in frontier models

I wrote my lazy to rich grokking paper last year in part to try show mech interp people it wasn't that complicated
In particular, the use of the phrase "phase transition" to describe grokking felt misguided to me, because it has a very specific meaning in math/physics (the free energy of a system, or its derivatives, being non-analytic) and I felt it was being abused for ideological reasons to make these toy mathematical models look like sentient beings or what-not.
But over time, some work has accumulated showing that grokking may on some level be a genuine phase transition (Noa Rubin's excellent paper, etc). This, coupled with my own recent work showing evidence for similar phenomena in cortex raises the extremely tantalizing prospect that "eureka moments" in humans may be driven by similar dynamics. I still don't take this literally, but I'm more open to it than before.

This is a less sciency comment, but the fact that different cultures use deep learning very differently is exciting to me

For instance, China has a huge on-demand delivery market, and robust last-mile robotics and mobile app infrastructure built around this. A key use case of AI there is powering this flourishing ecosystem of food delivery and shopping apps.
I think generative video models will be huge in India, maybe even uniquely so; the culture is enormously reliant on video and film and TV as a focal point of communication and there's a lot of latent creative energy I suspect can be released. I can't wait for kids messing around after school in a dhaba to be making viral clips and short movies that genuinely compete with Bollywood when diffusion models become open and cheap for all.
Japan has really bought into AI as an emotional companion. My Japanese friends tell me people there swear by their pet chatbots on Character.ai and wax lyrical about their intimate bonds. I know Japan historically has a culture of framing AI as by default helpful and human-like in contrast to the typical depiction in Western sci-fi, maybe this made adoption easier.
This makes me optimistic every culture will make AI its own. The idea that AI can be useful to small companies aiming to build "culture-aware" products and win market share this way is amazing to me. It means AI is accelerating heterogeneity of preferences and products instead of the opposite. I think this is a very good thing.