Welcome to Once a Maintainer, where we interview open source maintainers and tell their story.
This week we’re talking to Ralf Gommers, Co-Director of Quansight Labs and leading contributor to NumPy, the fundamental package for scientific computing in Python, as well as SciPy, meson-python, and the Array API Standard. NumPy published the first pre-release version of their upcoming 2.0 release in public beta this week. This is the first new major version of NumPy in 16 years.
Once a Maintainer is written by the team at Infield, a platform for managing open source dependency upgrades. Ralf spoke to us from Norway.
How did you get into software engineering?
To begin, I trained as an experimental physicist. During my degree I had one course in Pascal and that was about it. So really I'm self-taught from doing research, learning Python for data analysis, MATLAB first and then labVIEW and C for writing control code for experiments, things like that. When I started my PhD I decided you know, I have three or four years, I’m going to do it right and go straight for everything open source.
So I started with Linux and Vim and Python all at the same time, and I had a very rough month. This was before NumPy existed. I had to join the mailing lists to figure out what was going on, because nothing worked with each other. It was very immature at that point, zero documentation. So I just stayed subscribed and I kind of gradually figured out how this whole open source thing worked. And then after a few years when I was actually a decent programmer and had a few things of my own to share, I had a break after my PhD and I thought why not try and get started on this? I started with smaller contributions, some documentation, some features that I contributed to scikit-image. And at that point the release manager of NumPy quit. He wrote to the mailing list like, I'm sorry, I'm too busy, I quit, does anyone want a job? And for five days nobody wanted it.
This was early 2010. NumPy was also very infrequently released at that point. So after five days I'm like, well, I don't really know what I'm doing, but you know, if nobody else wants to do it, I'll give it a try. And then I had to do things like, you know, build Windows binaries for releases on Linux via Wine with undocumented scripts. The whole thing was a big learning experience. But, you know, it remained interesting and it's a very friendly community. So I stayed. And then after six months the guy came back and he's like, well, what about SciPy? I did SciPy too. Are you going to do SciPy? And I said ok, I'll do SciPy too. And I've been one of the leads for NumPy and SciPy since then.
In my experience, people who become an open source maintainer, especially of a large, widely used project, they have a certain mindset and they don't mind doing the dirty work. They like helping others and hopefully they learn something and have some fun in the process. But they tend to be the people that don't like saying no and they like to be helpful. It works like that for me too.
How much time per week or per month did you devote to these projects?
I'd say I spent 10 years doing it as a volunteer, and probably spent, I don't know, 10 to 15 hours a week on average outside of a pretty busy job. And then in 2019, it got to the point where AI had really taken off. And SciPy was big when I started, but that was hundreds of thousands of users, and now it's 20 million. It got to the point that it was really not doable as a volunteer in the evenings. I'm either going to make it my job or at some point I’m going to quit.
So at that point I went to the SciPy conference for the first time and I met Travis Oliphant, was the original author of NumPy. I knew him reasonably well. But when we talked in person, he said I'm just starting a new company, what do you think about joining me? It's a consulting company where 3/4 is consulting around the PyData space and then 1/4 is the labs department, which is directly contributing back and employing maintainers. So I started leading Quansight Labs, and half of my time is basically management, getting funding, being on some projects, and the other half I still get to contribute, but now it's part of my job.
Can you speak to the differences in the way that the academic or government world uses open source software versus the commercial side? This seems to come up a lot in the Python ecosystem especially, curious about your take on it.
That's a good question. So, industry is a very broad term. I think one thing you have in academia is that everything is custom. It wasn’t take a thing in pandas, run a scikit-learning thing over it, and then I'm done. It was really thinking about data structures and what you want to do and building your own code usually from the ground up. But academia is usually something you do by yourself or in a small group.
There’s also not just how people use the project but also how they interact with open source. I often hear that there's an overrepresentation of academia and smaller individual users, hobbyists. What we find is people in industry who have large deployments and things like that, they never really show. They come and maybe talk to me now because I'm working in a consulting company. But they have the type of request that you never see on an open source project issue tracker. It's more like this whole module is wrong or you know, we've we've already rewritten it and we found that everything here in your project is suboptimal. And you can end up with months or even years worth of work.
How do you think about that for such a long running project? Stability versus you know, maybe I would go back and change some things about how this works?
Yeah. I think that the lower you go into the stack, the more stable it has to be by necessity. You have more users and every change has more impact. For NumPy specifically, there was an extra constraint in that NumPy isn’t only a Python package, but it has a very large C API, and so everything is kind of built on top and all the binaries depend on each other.
So actually right now, over the coming weeks, we’re splitting off NumPy 2.0 and doing the first release candidates, and that's the first time in 16 years that we’re breaking API compatibility. The effort to keep that for 16 years, including all the things we don't really want and exposing too many internals that we don't want people to use and all that kind of stuff, it's just the cost we've had to pay. I've been a co-author on some of the proposal documents, we call them NumPy Enhancement Proposals. And there's also now a SciPy version of that where we define support windows for a set of Python versions and NumPy versions and some of the other key libraries.
How does that work on the research side? If someone wrote a paper 15 years ago with some NumPy code, should someone else be able to run that code now?
I would say no. It's really hard to get people to create environments where they know what versions they used to begin with. But I think that's the only correct way of doing it. We try to be careful to make sure that if something used to work, and now it gives an error, that's a lot better than if it used to work and now it still works but it gives you a different answer. But you can't keep stability at the individual API level for that long.
Other than that, I think in industry what does happen a lot is that they deploy applications and they use a certain Python version, certain NumPy version, the environment gets frozen and then it has to run for five years or something like that. In extreme cases, maybe even ten years, but that's not really relevant to the development of the project because they know how to lock their environment and it's actually not changing. So it doesn't matter if you release new versions.
With the release of 2.0, what do you see as the focus? What’s the direction you’re taking the package?
Roadmaps in open source are really challenging, especially in such a diverse project. You have the people who really care about static typing, or people who care about performance, usability, etc.
From my perspective, it was an opportunity to really rethink what the Python API looks like because that's still what 99% of our users use. And I think the big change that we're landing there is that first of all, we made it smaller and easier to understand, introducing a very clear split between what's public and private.
The other big change that I've been working on for a few years is to introduce the Python Array API standard, which basically is like the core 150 functions or so that make up an array library and now all the changes that made it hard for the NumPy API to run on GPU for example, have now been fixed. And they were fixed by the design of that standard. So with my SciPy maintainer hat on, NumPy is never going to run on GPU, right? But there's also PyTorch, there's JAX, there's all these libraries that are newer and way faster and I want SciPy and higher level libraries to make it as easy as possible to run on GPU or to use PyTorch or Jax's automatic differentiation and things like that. For me that’s probably going to continue to be the main theme for the next few years.
To suggest a maintainer, write to Allison at allison@infield.ai.
To learn more about keeping your open source software up to date using Infield, write to hello@infield.ai.