Welcome to Once a Maintainer, where we interview open source maintainers and tell their story.
This week we’re talking to Sofie Van Landeghem, core maintainer of the advanced NLP library spaCy. Sofie has been working in machine learning and NLP since 2006, and has worked on practical use cases in the pharmaceutical and food industries. She spoke with us from her home in Belgium.
Once a Maintainer is written by the team at Infield, a platform for managing open source dependency upgrades.
How did you get into software development?
It was a long time ago. I had always loved mathematics in high school, and I studied computer science at university. I guess I have quite a logical brain. So I loved doing mathematics, but I didn’t want to just stay in the theoretical. When I learned about computer science, I thought this is something I can apply maths to. After my degree, I did a masters in software engineering and my masters thesis was around data integration and text mining. So I went into data science from there and have been in the field basically ever since.
What was it about data science in particular that led you to want to pursue graduate studies?
It was this feeling of, well, I know all this theory and algorithms and stuff, but now I can actually apply it to a domain. At the time of my masters, this was the biomedical domain. This was years before big data was even invented as a term. So that was very compelling to me.
My thesis was about developing novel machine learning algorithms to process biomedical text. And throughout my PhD and then a little bit in my postdoc work as well, I collaborated with a research team in Finland where we ran machine learning algorithms on biomedical texts, millions of them, and stored the results in a database so that users could query it. All of this is outdated now. At the time it was technically challenging to run text mining on such a large scale, and not many people were doing this yet.
What have you observed over the last 10 years or so as NLP and AI have exploded, coming from an academic background?
I mean, we've been on a crazy ride in the NLP field, right? So back in the day when I was doing my masters and then my PhD, I was using support vector machines and manual feature engineering. You know, this was even before we had word2vec, before we had transformers. The progress has been crazy. And I have been in an interesting position because I left academia around the time that transformers started coming up.
I joined Johnson & Johnson, the pharmaceutical company. And we really had to work on very basic, almost low hanging fruit at the time, introducing text mining to business processes that hadn't been using it at all. So there was quite a bit of a disconnect between all of the new research that was coming out and you know, the practical things that we could or could not do in this sort of setting of a large company. Having to think about privacy, and you know, transformers often needed GPUs - sometimes you couldn’t afford this if you need to have the results quickly. And we still have that today with LLMs. You know, they are awesome. And I think everybody has been amazed by the progress that we've been able to make with LLMs. But at the same time, I'm still very much thinking about everyday use cases and how people can use these in production and whether this will solve an actual business case. There's often still a disconnect between the two.
How did you first get involved in open source?
It started pretty small. Back in the day when I was at university we would try and open source the code behind the research papers that we would publish. This was in Java back then. So that tells you a little bit about how old I am. And then when I was working in industry, in the years after that, I wasn't able to do a lot of open source work, but then as the Python ecosystem grew I started using a lot of these Python libraries like spaCy. It was one of the tools we were using in one of my previous jobs when I was working at a startup. That’s how I first got into the field. And then I got to know Matt Honnibal and Ines Montani, the founders of Explosion, the company behind spaCy, and, and started collaborating with them, and that's how I got more involved as a maintainer.
How do you think about the roadmap for spaCy? For example we spoke with a maintainer from the NumPy team, which is a huge project and quite formally run, I would say. We spoke to Ralf Gommers about NumPy 2.0, which was just released a few days ago. Whereas we’ve spoken with other projects where the roadmap is quite individual driven, like what does this person feel like working on this year? So I’m curious where you think spaCy falls on that spectrum.
I think for us it is more individually driven. The founders of Explosion, Matt and Ines, have a vision of what the library should be. And I think we've stuck to that vision which has been good because it helps us to keep the library more stable and users know what to expect. Other than that, we do have a never ending task list internally where you know, whenever somebody thinks of a feature, small or big, and they add it to the board. So the question is what to prioritize, right? Sometimes we have a consulting project that might require building out some functionality, or somebody wants to maybe do a blog post or a tutorial.
I wouldn't say that there's a grand plan, like we'll do exactly this in three months’ time, but we often have work going on on different branches. So we always have master that we're using for quick fixes and small things that can just go in the next bug fix release. And we have development where the larger things, the things that may break other people’s code are kept. And then we have a v4 branch as well, where the really major features are being added so that we know if we release from that branch, then we have to bump to v4.
What are the features you’re working on implementing right now?
We’re working on the NumPy fix right now (to support NumPy 2.0). That should be out relatively soon. We also made a plugin which is called spacy-llm. This was something that we couldn't really plan for last year. We wanted to make sure that you can also work with LLMs in a spaCy pipeline, so we created spacy-llm as a sort of optional additional plugin. We're working on a major refactor and corresponding documentation and getting that polished up so that we can release the v1 version. And then the other one is actually getting v4 out - I think we started working on v4 two or three years ago already, but it required an update of thinc as well, our underlying machine learning library.
I think sometimes we all just want to keep on pushing new features, but we have to make sure that at some point we wrap up, publish what we have, and continue. This is the main reason why v4 hasn't been released more quickly, because we always think of something new to add. And at some point you just have to say, you know, it doesn't have to be a huge release every time. Let's just make sure that the community has what we've already created and continue from there.
Why do you think that Python gets such a bad rap in terms of its dependency management? It’s interesting to me how how often people say dependency management in Python is just absolutely hellish. What’s your take on that? Why is it?
I'm pretty sure I've made the statement myself in the past few years. It’s always difficult, right? In theory, there should be a proper way to do this and you know, minor versions shouldn't really break things. So you should be able to pin it to the next version that shouldn't break things. I think in reality, though, it’s just messy. We've definitely had cases where we would publish a release that we would assume was not breaking at all and we wouldn't document any breaking changes. And then it turns out that some sort of usage by a few users is in fact broken by something that we published. Often you don't know all of the different ways your code is being used or its interdependence with other libraries.
If every maintainer could promise every time that when they do a small bug fix release it wouldn't actually break anything, then we would all be able to pin it correctly. But sometimes when this does happen, when an external library for instance publishes something that is breaking that you didn't expect, then you might become more careful with this external dependency and pin it more strictly. Which then means that in a month's time users will be complaining that they're locked into this version. So there's no ideal way of dealing with that, right? Either you're too strict and you're locking people in or you're too lenient and your software might break tomorrow if they publish a breaking change. This is true for all open source libraries or all Python packages really.
So tell me about Typer.
So Typer is being maintained by Sebastián Ramírez, or as people know him, tiangolo on GitHub. He's the creator of FastAPI as well. And basically Sebastián used to work at Explosion as well some years ago. That's where I got to know him. I used to send him cat pictures. He also lives in Berlin. He's just an all around awesome guy basically. When he was sort of leaving the company to make more time for developing FastAPI and friends, as he calls it, he made me promise to keep sending him cat pictures, and that's mostly how we stayed in touch.
Typer is just a small little library that makes your Python functions into CLI commands using type hints that you add there with your functions. I've always enjoyed adding type hints to Python. You know, me coming from a Java background, a heavily typed compiled language, I sort of enjoy having the types in Python again and being able to run type checkers. So just seeing that he couldn't give the love and attention to Typer that it needed, Sebastián asked me whether I could get involved a little bit with the maintenance there as well. I’ve been just cleaning up some of the user contributions, getting them in good shape and making them up to date with master, making sure the tests pass, that they're all adhering to the standards that I know Sebastián wants so that he can review them more easily and get them over the line more quickly. That’s my role.
To suggest a maintainer, send a note to Allison at allison@infield.ai.
Infield is hiring full stack engineers!