Open Source AI Definition Erodes the Meaning of “Open Source”
by
on October 31, 2024This week, the Open Source Initiative (OSI) made their new Open Source Artificial Intelligence Definition (OSAID) official with its 1.0 release. With this announcement, we have reached the moment that software freedom advocates have feared for decades: the definition of “open source” — with which OSI was entrusted — now differs in significant ways from the views of most software freedom advocates.
There has been substantial acrimony during the drafting process of OSAID, and this blog post does not summarize all the community complaints about the OSAID and its drafting process. Other bloggers and the press have covered those. The TLDR here, IMO is simply stated: the OSAID fails to require reproducibility by the public of the scientific process of building these systems, because the OSAID fails to place sufficient requirements on the licensing and public disclosure of training sets for so-called “Open Source” systems. The OSI refused to add this requirement because of a fundamental flaw in their process; they decided that “there was no point in publishing a definition that no existing AI system could currently meet”. This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit from OSI's retroactive declaration that their systems are “open source”. The OSI should have refrained from publishing a definition yet, and instead labeled this document as ”recommendations” for now.
As the publication date of the OSAID approached, I could not help but
remember a fascinating statement that Donald E. Knuth, one of the founders
of the field of computer
science, once
said: [M]y role is to be on the bottom of things. … I try to
digest … knowledge into a form that is accessible to people who don't
have time for such study
. If we wish to engage in the
highly philosophical (and easily politically corruptible) task
of defining what terms like “software freedom” and
“open source” mean, we must learn to be on the “bottom of
things”. OSI made an unforced error in this regard. While they could
have humbly announced this as “recommendations” or “guidelines”,
they instead formalized it as a “definition” — with equivalent authority to their
OSD.
Yet, OSI itself only turned its attention to AI only recently, when they announced their “deep dive” — for which Microsoft's GitHub was OSI's “Thought Leader”. OSI has responded too rapidly to this industry ballyhoo. Their celerity of response made OSI an easy target for regulatory capture.
By comparison, the original OSD was first published in February 1999. That was at least twelve years after the widespread industry adoption of various FOSS programs (such as the GNU C Compiler and BSD). The concept was explored and discussed publicly (under the moniker “Free Software”) for decades before it was officially “defined”. The OSI announced itself as the “marketing department for Free Software” and based the OSD in large part on the independently developed Debian Free Software Guidelines (DFSG). The OSD was thus the culmination of decades of thought and consideration, and primarily developed by a third-party (Debian) — which provided a balance on OSI's authority. (Interestingly, some folks from Debian are attempting to check OSI's authority again due to the premature publication of the OSAID.)
OSI claims that they must move quickly so that they can counter the software companies from coopting the term “open source” for their own aims. But OSI failed to pursue trademark protection for “open source” in the early days, so the OSI can't stop Mark Zuckerberg and his cronies in any event from using the “open source” moniker for his Facebook and Instagram products — let alone his new Llama product. Furthermore, OSI's insistence that the definition was urgently needed and that the definition be engineered as a retrofit to apply to an existing, available system has yielded troublesome results. Simply put, OSI has a tiny sample set to examine, in 2024, of what LLM-backed generative AI systems look like. To make a final decision about the software freedom and rights implications of such a nascent field led to an automatic bias to accept the actions of first movers as legitimate. By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems as “open source” by definition!
OSI also disenfranchised the users and content creators in this process. FOSS activists should be engaging with the larger discussions with impacted communities of content creators about what “open source” means to them, and how they feel about incorporation of their data in the training sets into these third-party systems. The line between data and code is so easily crossed with these systems that we cannot rely on old, rote conclusions that the “data is separate and can be proprietary (or even unavailable), and yet the system remains ‘open source’”. That adage fails us when analyzing this technology, and we must take careful steps — free from the for-profit corporate interest of AI fervor — as we decide how our well-established philosophies apply to these changes.
FOSS activists err when we unilaterally dictate and define what is ethical, moral, open and Free in areas outside of software. Software rights theorists can (and should) make meaningful contributions in these other areas, but not without substantial collaboration with those creative individuals who produce the source material. Where were the painters, the novelists, the actors, the playwrights, the musicians, and the poets in the OSAID drafting process? The OSD was (of course) easier because our community is mostly programmers and developers (or folks adjacent to those fields); software creators knew best how to consider philosophical implications of pure software products. The OSI, and the folks in its leadership, definitely know software well, but I wouldn't name any of them (or myself) as great thinkers in these many areas outside software that are noticeably impacted by the promulgation of LLMs that are trained on those creative works. The Open Source community remains consistently in danger of excessive insularity, and the OSAID is an unfortunate example of how insular we can be.
Meanwhile, I have spent literally months of time over the last 30 years trying to make sure the coalition of software freedom & rights activists remained in basic congruence (at least publicly) with those (like OSI) who are oriented towards a more for-profit and corporate open source approach. Until today, I was always able to say: “I believe that anything the OSI calls ‘open source’ gives you all the rights and freedoms that you deserve”. I now cannot say that again unless/until the OSI revokes the OSAID. Unfortunately, that Rubicon may have now been permanently crossed! OSI has purposely made it politically unviable for them to revoke the OSAID. Instead, they plan only incremental updates to the OSAID. Once entities begin to rely on this definition as written, OSI will find it nearly impossible to later declare systems that were “open source” under 1.0 as no longer so (under later versions). So, we are likely stuck with OSAID's key problems forever. OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal defintion.
I truly don't know for sure (yet) if the only way to respect user rights in an LLM-backed generative AI system is to only use training sets that are publicly available and licensed under Free Software licenses. I do believe that's the ideal and preferred form for modification of those systems. Nevertheless, a generally useful technical system that is built by collapsing data (in essence, via highly lossy compression) into a table of floating point numbers is philosophically much more complicated than binary software and its Corresponding Source. So, having studied the issue myself, I believe the Socratic Epiphany currently applies. Perhaps there is an acceptable spot for compromise regarding the issues of training set licensing, availability and similar reproducibility issues. My instincts, after 25 years as a software rights philosopher, lead me to believe that it will take at least a decade for our best minds to find a reasonable answer on where the bright line is of acceptable behavior with regard to these AI systems. While OSI claims their OSAID is humble, I beg to differ. The humble act now is to admit that it was just too soon to publish a “definition” and rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many headlines or raise as much money as the OSAID did, but it's the moral and ethical way out of this bad situation.
Finally, rather than merely be a pundit on this matter, I am instead today putting myself forward to try to be part of the solution. I plan to run for the OSI Board of Directors at the next elections on a single-issue platform: I will work arduously for my entire term to see the OSAID repealed, and republished not as a definition, but merely recommendations, and to also issue a statement that OSI published the definition sooner than was appropriate. I'll write further about the matter as the next OSI Board election approaches. I also call on other software rights activists to run with me on a similar platform; the OSI has myriad seats that are elected by different constituents, so there is opportunity to run as a ticket on this issue. (Please contact me privately if you'd like to be involved with this ticket at the next OSI Board election. Note, though, that election results are not actually binding, as OSI's by-laws allow the current Board to reject results of the elections.)
Please email any comments on this entry to info@sfconservancy.org.