Coming up for air
I've been underwater, focused myopically on my research, over the last few months. Now I want to catch my breath and reflect a bit on the lay of the land in deep learning.
Here, I reconsider some opinions I had early on this year, and generally think about what feels important
to work on in the coming months and years. Most of these I
can't put into an SOP since they will make me look too scatterbrained (which I am).
And yes, the title is a reference to one of my favorite novels.
LM = language model; FM = foundation model.
- The role of knowledge vs reasoning in LMs
- I used to think the storage of knowledge in model weights (subject-relation-fact) was a bug, not a feature
- We have databases for keys and values, surely we want FMs to be big-brain reasoning machines instead? My naïve dream was a ~1b model that can reason and plan like GPT4 and just use the internet/RAG for knowledge
- Then I started paying attention to how I use LMs myself on a day to day basis and realized that I mostly use them as a soft interpolants of internet content, ie. an even softer version of search than keyword/semantic web search
- Also, MMLU (knowledge) seems to be the major axis that determines general capabilities, including reasoning. Knowledge and reasoning seem intertwined in FMs, unlike in humans where for instance people who can memorize a lot may not be the best at hard math problems.
- The long tail of the data distribution is interesting. The Allen-Zhu line of work shows exposure to a fact O(10) times isn't enough for a LM to be able to reason about it. One way to get around this is to synthetically amplify documents with long-tail facts during pretraining time so models see those facts thousands of times. But this feels too bashy to me. How can we do better?
- Architecture vs data
- Used to think architecture was the first order concern, now I think it's the data. In some sense this is obvious in hindsight because in an estimation problem, the target function usually affects the learned function more than choice of estimator
- Have started some work on synthetic data; mixtures of real-synthetic usually outperform real on most downstream evals. This means there are some statistical properties the models "want" in their data that web text doesn't fully have
- What is the platonic data distribution models "want" to learn world representations on? Is there any definition for "quality" and "compute" such that the statement "the quality per token of synthetic data improves predictably with compute used to generate it" is true?
- In some sense, model distillation is a trivial example where bigger generators are better, but this is less interesting to me because you can't distill your way to a frontier model. I think the most interesting synthetic corpora are those that are "rephrases" or augmentations of real seed text (see eg. the WRAP paper).
- I really want to learn about HCI in the few weeks after grad school apps.
- How models are post and thus pretrained -- and thus what is most important scientifically for me to work on in the next few years -- is determined by how consumers and enterprises find it most intuitive to interact with them. The "GUI moment" for FMs still hasn't happened and I really wonder what it will be.
- Maybe it's agents, maybe that's the wrong mental model? I'm sure they will play a role but I don't think it's the full story, or at least specific enough for what I'm looking for.
- I think the design space for FM interfaces is extremely underexplored. But I also know nothing about HCI, so who am I to say. Some things that might be extremal points of this space:
- Will websites reduce to being just endpoints that can be hit by agentic API calls, and will humans stop interfacing directly with the web via URLs as a result?
- Will voice/language models become the primary medium through which we interface with computation of any sort (e.g., like the desktop of a home computer is for many now)? Many revolutions in interface design came from one thing doing many things (eg. iPhone).
- Will we ever interact with single language model forward passes like we do now, or will each response be a sort of ensembled output from the deliberation of a committee of models? If inference-time compute truly works, I see no reason most outputs shown to a user will not be the result of lots of low-latency MCTS/repeated sampling/multi-agent debate.
- Will we have dashboards on the models we interact with, telling us real-time information about how honest or sycophantic they are being by monitoring their internals, like we have for cars or stoves?
- I should go to an elementary school and see how kids interact with these models. Alan Kay wanted the GUI so kids could interact with computation. Everyone is so obsessed with automating the enterprise. What about automating the elementary school? (this is satire)
- I also really want learn about distributed systems
- The fact I don't know what this NCCL bullshit means or how it works under the hood annoys me.
- Despite the fact I've done multi-node pretraining many times by now, I still don't think I fully understand all the details of FSDP. I really should read the source code/try implementing a simple version from scratch, it seems important to master.
- Hardware vs math for mental models of what neural networks actually are
- Over the last year my mental model of what a neural network *is* (like, in its bones) has oscillated between "parametric function class" and "spicy matmuls on tensor cores"
- Learning about GPUs, performance engineering, CUDA, and more from an awesome collaborator (shoutout BFS) as a reformed theory person has been revelatory
- "Deep learning theory" sometimes feels like a mirror where you get out (theorems) a slightly transformed version of what you put in (assumptions you deliberately made to be able to prove things). I'm more than happy to use my own work as an example:
- In No Free Prune, we used arguments from high-dimensional probability to reason about why pruning networks at initialization is doomed. I think it contains clean and cool theorems, but -- since it relies on overparameterization and isoperimetric data distributions -- what it says about FMs is unclear.
- More confused about optimization than ever before - how do we get anything done here?
- Does anybody alive understand high-dimensional non-convex optimization? How do optimizers for pretraining keep improving if we don't? (Adam -> Shampoo -> SOAP, etc)
- Do adaptive/preconditioned methods really even share the texture of vanilla gradient descent? Are they even phenomenologically the same class of object? Are second-order methods inevitable? I am a naïve tourist in the wonderland of optimization
- One thing that is amazing about deep learning is how unusually tight the link is between new research techniques and downstream products they enable
- Often, scientific breakthroughs are touted to the general public as "potentially enabling world-changing technologies in just a few decades"
- With deep learning, as soon as a new technique is invented in a preprint, startups start build products around it if it's important enough
- After the TRAK paper came out, some folks I know started to work on a data attribution company
- The literal day after the ReFT finetuning paper came out, friends at agentic startups were trying to see if it could be useful as a method to finetune agentic pipelines
- When super long context started to finally work, audio companies started quickly cropping up using state space techniques to power voice apps
- Quantum mechanics was cool because it shook the foundations of a scientific field, then over a span of decades enabled a ton of cool products
to be built using our newfound understanding (electronics, lasers, MRI, etc).
- Deep learning is the same, except the products built around it crystallize on the timescale of weeks to months after a discovery! It is as if the double slit experiment and invention of the iPhone happened within the same year!
- Will SSMs win? From no chance to maybe
- Inference really matters, and constant factor improvements/just making ASICs cannot be the end of the story
- Even if traditional SSMs (LTI) don't win, I think hybrid models have a real chance. The notion of clamping n-gram heads into an SSM in the ICLL paper was hilarious and brilliant.
- Inference is the speed of thought for agentic workflows and reasoning with LMs. If inference-time compute gains are loglinear and inference speed goes up overnight when we switch to SSMs, eval performance within a fixed wall clock inference time will shoot up out of nowhere.
- Table 8 in this paper is amazing, hybrid models beat Transformer++ by 3-5 MMLU accuracy points at large scale. It's hard to overstate how big a gap that is!
- I'm starting to believe some of the cool phenomena theorists reason about in toy models may actually sometimes be similar what is driving similar phenomena in frontier models
- I wrote my lazy to rich grokking paper last year in part to try show mech interp people it wasn't that complicated
- In particular, the use of the phrase "phase transition" to describe grokking felt misguided to me, because it has a very specific meaning in math/physics (the free energy of a system, or its derivatives, being non-analytic) and I felt it was being abused for ideological reasons to make these toy mathematical models look like sentient beings or what-not.
- But over time, some work has accumulated showing that grokking may on some level be a genuine phase transition (Noa Rubin's excellent paper, etc). This, coupled with my own recent work showing evidence for similar phenomena in cortex raises the extremely tantalizing prospect that "eureka moments" in humans may be driven by similar dynamics. I still don't take this literally, but I'm more open to it than before.
- This is a less sciency comment, but the fact that different cultures use deep learning very differently is exciting to me
- For instance, China has a huge on-demand delivery market, and robust last-mile robotics and mobile app infrastructure built around this. A key use case of AI there is powering this flourishing ecosystem of food delivery and shopping apps.
- I think generative video models will be huge in India, maybe even uniquely so; the culture is enormously reliant on video and film and TV as a focal point of communication and there's a lot of latent creative energy I suspect can be released. I can't wait for kids messing around after school in a dhaba to be making viral clips and short movies that genuinely compete with Bollywood when diffusion models become open and cheap for all.
- Japan has really bought into AI as an emotional companion. My Japanese friends tell me people there swear by their pet chatbots on Character.ai and wax lyrical about their intimate bonds. I know Japan historically has a culture of framing AI as by default helpful and human-like in contrast to the typical depiction in Western sci-fi, maybe this made adoption easier.
- This makes me optimistic every culture will make AI its own. The idea that AI can be useful to small companies aiming to build "culture-aware" products and win market share this way is amazing to me. It means AI is accelerating heterogeneity of preferences and products instead of the opposite. I think this is a very good thing.