Bonum Certa Men Certa

Microsoft GNU-Hub (Part 3: Methodology)

Article by figosdev

GNUHub

Summary: "Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"

IN part 1 and part 2 we showed the dangers of outsourcing development to a Microsoft platform. Today, figosdev explains the methods. There will be at least a couple more parts after this one. This is a very important if not critical subject and it is completely overlooked by the media, which is too busy promoting GitHub as if it's a champion of "Open Source" (Wired has just done that again) when it is in fact proprietary software controlled by a company that attacks Software Freedom in a lot of ways. Another new article perpetuates the myth that Microsoft contributes the most to "Open Source", based on Microsoft's very own site, which is proprietary software by the way. Facts don't seem to matter these days (the leading "Linux" story this week is about a Microsoft bounty, associating "Linux" with security issues; was a $100,000 bounty granted to the press to amplify Microsoft in the context of "Linux"?).



Without further ado, figosdev:




I didn't start the DeleteGithub meme, though when I found out about the upcoming purchase I quickly worked to migrate -- within about a week.

I had already been locked out of my first GitHub account (which an associate had created for me, and I lost the password to both the account and the associated email) and created a new GitHub repo, which I deleted after migrating.

Obarun also worked very quickly to migrate from GitHub, which I know because I was following the distro at the time.

"When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself."After the first few weeks of Microsoft owning GitHub, I didn't pay very much attention to it at all. #DeleteGithub was created without me.

All the while, I've expected more comments like this one, which Part 2 received:

"Perhaps you should actually research where projects host their code a bit more and you would find that most of them only use github for a mirror.

As a case in point check this link and you will see that the canonical source for perl is not github."

First of all, this comment is ABSOLUTELY wrong. Most of them do NOT only use GitHub for a mirror. I know, because I check that quite often. In fact most of this article was already written the other day to assist a colleague, and I've addressed the issue of mirrors several times already. But I suppose now it will be better to share that information here.

A Free software license means anybody can create a mirror; that also means that the existence of a mirror means nothing whatsoever. It was suggested once that I consider making mirrors their own category -- I do my utmost to ignore them entirely, because they're a meaningless metric that would only be a waste of time to count. They signify nothing; maybe you could argue they imply that someone cares enough about the project to mirror it. That's still not very interesting to me.

"While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub."When this started, I was looking to find alternatives to certain applications that remained on GitHub, as a sort of boycott / protest / public awareness campaign. After about 20 minutes of looking stuff up, I was shocked at some of the examples I found. I figured that some new applications here and there were on GitHub, no big deal. We've got lots of alternatives.

The first thing I was wrong about (GLADLY) was OpenBSD. When I started, this was a more casual endeavour. It was just a sort of poke around out of curiosity. Wikipedia was a primary source of information, and having done more to verify and audit listings since then, I can say that it's pretty good in practice, even as more than a starting point. But I wasn't paying enough attention to mirrors at first.

If you go to the GitHub page for OpenBSD it says right at the top:

"Public git conversion mirror of OpenBSD's official cvs www repository"

That's plenty clear. When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself.

I thought the first effort was best suited to a wiki, vs. an article. While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub.

The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website. The GNU website still mirrors these plans, and GitHub looks a lot like them:

"The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website.""One 'blue sky' avenue that should be investigated is if there is any way to turn Linux into an opportunity for Microsoft." https://www.gnu.org/software/fsfe/projects/ms-vs-eu/halloween2.html

This quote is more than 20 years old, and there are countless others from the same group of documents that read a lot like "old news" from 2010 to the present. So my first concern is that GitHub not be used to help Microsoft take "ownership" (aka control of) Free software. That control can be taken gradually with or without non-GitHub mirrors, though non-GitHub mirrors would probably help.

For the person says "Perhaps you should actually research where projects host their code a bit more", here is a glimpse into what I actually do, written just 2 days ago and not planned for the article:

"It starts simple -- because at first this was all I did. I don't do every part of this process for every single package. I sort of follow my nose, going deeper when I feel it's important. The end result is that nothing is undeniable proof, but across the board you get a good fine-resolution picture with some occasional faulty pixels. I like to think this process would at least be more accurate than turning machine learning on it."

"So first, I go to the project's Wikipedia page. If it doesn't have one, it's probably not important enough to be on delete_github in the first place. But if Wikipedia lists GitHub as the repo, it's probably on GitHub."

"I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub."This was how it started, and for certain I could have paid more attention to details, but I was only dipping my toes in. I am very happy that I was wrong about OpenBSD, and eventually did an audit of the hundred or so projects I'd listed, finding that at least FOUR of those hundred or so were actually mirrors. I happily listed the items that were cleared: FFmpeg, OpenBSD, QEMU, Kali (GNU/) Linux

Apache Server was cleared much later. ASF was infiltrated by Microsoft, and they are using GitHub for some things, but Apache Server was taken off the list because it is in fact a mirror. Slitaz was also taken off the list -- not because it was an error, but because they moved off GitHub again.

I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub. Did I check the thousands of apps? Of course not. For a ballpark figure, I checked how many listed GitHub as the source or website for the app. If they say it's GitHub, that's good enough for a ballpark figure. But I didn't stop there.

I used GitHub data to figure out which of those thousands of apps were the most popular, and then went through that list and hand-picked the most familiar apps. I don't use Android much, though I've always used F-droid when I need Android apps. I haven't done much to verify the F-droid list, and nobody has complained.

Tom pointed me to a couple of LISP library collections, and just to find out if the 70-80% figure came up again, I checked all of those -- and came up with a similar ratio. I'm happy to say that since then, this 4:5 ratio doesn't come up all the time.

"Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with."I keep looking for large data sets to play with, having the data for all packages in the Tiny Core repo (I've also figured out how to recursively parse dependencies in the package information for Debian and Trisquel) but these are not sets that I'm always checking and verifying by hand. The purpose is to figure out what to look at next, and that's how I found myself examining the GNU project itself -- carefully.

Most, not all of the things I've checked on, have an initial phase and sometimes a deeper check involved. I've tended to document what I know, and it's usually pretty obvious from context (from the description) when the level of detail is superficial. A statistic without examples? It's just a cursory check. Examples and explanations? I spent more time. As I said 2 days ago:

This has yielded very few false positives, but Apache httpd is still a mirror. Mirrors don't count, not to me at least, because per the license, Microsoft is allowed to mirror every Free software package that exists... A GitHub mirror really means that the "real repo" is somewhere else, that's what we want. We are looking at people using GitHub for development:

* Bug tracking * Pull requests (even worse than bug tracking) * "Official repo" (not mirror)

In order from least to worst, those are the things we are looking for. Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with. [People actually say things like] "We can't lose our bug tracker! There's so much valuable data there!" Michele says Git is distributed so it's a non-issue. "But Gitea is migrating anyway" that's good news then. But I think [that's been said] for a year or two, so take it with a grain of salt until there's more evidence.

"For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed."Anyway, the most important criteria is:

* Wikipedia or better yet the project's own homepage links to GitHub * It's not a mirror

At some point I started checking project websites too. Seems obvious, but you have to realise that when this thing started I was being very casual. Just poking around, not being serious about it.

For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed.

Dependency lists, whether we are talking about Tiny Core or Debian, typically only cover immediate deps -- not deps of deps or deps of deps of deps. So I literally write a recursive routine to turn the dep data for Tiny Core into a FULL list of packages that require each item. It takes 45 minutes or more (to write and work the bugs out) though after it's done you kick yourself for the time wasted doing it manually. I probably spent 2 hours trying to do a fraction of the work that way.

"Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so."Now I can say "which packages pull in glib2" and get a full list.

I can do that FOR EACH PACKAGE then run wc -l on each list, getting a count of how many packages need each thing -- like how many need glib, how many need libffi, etc., run the list through sort -n and you know which deps are the most needed. Libffi is right at the top. In fact Roy tells me people were complaining about libffi a day or two prior to this discovery, but I pulled the fact right from the data I had cached. So now it's telling me things other people won't -- aka verifying things other people know.

All well and good, so I know the usual suspects when I download Trisquel. But I get bored with Trisquel (fig spent about a week processing all the source code so its easier to search) and started looking at GNU.

I start here http://savannah.gnu.org/search/?Search=Search&words=*&type_of_search=soft&exact=0&max_rows=500&type=1#options

"It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel."Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so.

For the GNU stuff, I finally had the "server" search all the code for things like bffi, perl, .pl, .py, ython, ithub, png, flex. This is all [output to] a single text file, and it shows the path/name of the file (project name/path/actual file/ line of text found) so if I grep this file for example, "gperf" i get stuff like this:

89/gcc-9.3.0/libsanitizer/sanitizer_common/sanitizer_procmaps_mac.cc:// Google Perftools, https://github.com/gperftools/gperftools. 108/global-6.6.4/reconf.sh:prog='autoconf automake bison flex gperf libtool m4 perl'^I# required programs 123/gperf-3.1/ChangeLog: when the -n option is used. Previously, it didn't 123/gperf-3.1/ChangeLog: I'm too busy to fix it , right now. The problem 123/gperf-3.1/ChangeLog: they weren't being entered into the hash table . 123/gperf-3.1/ChangeLog: * Added the -D option that handles keyword sets that 123/gperf-3.1/ChangeLog: * Modified Key_List::print_hash_function so that it

"I told bash to make a numbered folder for each project, that way I can cd 123/ TAB TAB instead of spelling out the project folder name and so I can iterate/grep using seq instead of folder names. That's a convenience, so if I'm babbling its not important."

"I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.""I look for png files, perl, python, libffi, glib. I write it down. Then I tally it up later."

"That's how I do it. All in all, there's logic but I try to do whats logical / convenient / efficient. And other than that I just play it by ear."

TL,DR; It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel.

Sometimes I get into the includes in the C code and note that include libpng is in an #ifdef. I'm not really a C coder but at least I get the concept of "you can configure this to compile with optional dependencies."

But even when I wasn't trying, the first hundred or so entries were mostly accurate. I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.

And the data changes, too. But someone had asked me about Perl specifically: "I took a quick look at the perl site. They do not cite GitHub..."

"They're developing Perl 6 on GitHub, so it fits the methodology."That's not entirely true. So I told them what I knew so far:

"Some [of the most important] things I check over and over and over. Perl is one. Let's do it again, it's useful exercise..."

https://en.wikipedia.org/wiki/Perl nothing.

Let's take a detour towards "Perl 6": https://en.wikipedia.org/wiki/Raku_(programming_language)

https://raku.org/

"Language Design"

"Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.""Specification - Official Raku language specification test suite" https://github.com/Raku/roast

[Roy is likely to turn these into links. It's very possible the link he creates will not match the exact link from the website -- the url will be the same, the text will be the same, but the actual overlap of text and underlined link may vary. In the original writing, I kept the text and urls separate.]

* Issues 78 * Pull requests 25

"Raku is GitHub."

See what I did there? They're developing Perl 6 on GitHub, so it fits the methodology.

You may not agree with this methodology, for various reasons. You might have a good reason -- or you could be a shill or marketing person. You could just be an ordinary fanboy. BUT, you might also have a point! Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.

By all means, I expect other people to make a case for/against considering it "captured or controlled by Microsoft." My concern is Early Warning. You may have a "better" criteria to offer that you consider more useful.

"Puppy Linux, to cite a different example, is developed all over the place."The whole idea of this project is to get the conversation going. But I've gone to great lengths to provide useful data overall, and I do try harder when there is a key project like Perl involved. It continues:

"Back to perl.org:"

"Contribute" https://www.perl.org/contribute.html

"Contribute to Perl Core" http://dev.perl.org/perl5/

"Perl" (these things are quoted as headings to look for, if they're listed more than once there's a reason)

"Perl" "Production-ready, under active development"

This part says:

"Some people have expressed an interest in getting Puppy away from GitHub.""Perl 5.30.2 is the current stable version of Perl. Perl is actively maintained and developed (git repository) by a large group of dedicated volunteers."

That links to this url:

https://github.com/Perl/perl5

So we go to that url, and it says:

* Issues 1,865 * Pull requests 33

Whatever the "Canonical url" is, what matters to me and what I'm actually looking for is that

* The official website tells people to go to GitHub. * The GitHub repo is being used for Issues and Pull requests

If there aren't Issues and Pull requests, I make the decision based on other factors.

If the README.md says "please don't use this for issues or pull requests" then that certainly counts for something.

Puppy Linux, to cite a different example, is developed all over the place. Packages are strewn across countless hobbyists websites. There are too many derivatives to even count -- literally hundreds of fan-based ISOs exist online that people have made over the years. Several active derivatives exist -- Many are based on Woof-CE, which is developed on GitHub. Some are not.

"Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project."Some people have expressed an interest in getting Puppy away from GitHub. Puppy, DSL (a predecessor to Tiny Core) and Xubuntu were the three distros that helped me finally delete my last copy of Windows more than 10 years ago, and I learned as much about GNU/Linux from Puppy as the others.

But the Woof team has made it clear they are unlikely to migrate. The work they've done that keeps them "locked in" to GitHub isn't as easy to "fork" as the codebase, and this is by design.

Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project.

We can debate that, but the purpose of what I'm doing is to find the projects to worry about, so something can be done. By all means, help me get some of the items off the list. But you probably won't find anybody who has gone to more effort than I have, unless it's Roy. And Roy is interested. And I've heard from fewer critics than people who are interested as well.

"Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"If you have are a fan, user or developer of a vital project like Perl, I am certainly interested in (and continually looking for) criteria that would allow it to be taken off this list. The whole point of putting it on the list is the hope of getting it back off the list again.

And if you missed them, here are some disclaimers that I've already made:

"I don’t always trust Debian dependencies, but they’re certainly illustrative" (part 1)

"If there are obvious mistakes or less obvious misconceptions I’m presenting when I talk about some of the details, I hope you’ll mention it in the comments. I’m sure there will be a few differences of opinion as well." (part 1)

"Python is worth watching for, but only proves to be a GitHub hostage sometimes." (part 1)

"This isn’t just about where the code is, but where the development takes place and who controls access." (part 2)

"This isn’t to admonish the author for not following a rule that doesn’t exist, but to highlight the more-than-hypothetical threat that the GNU project faces" (part 2)

"But do, if you're interested, please help get these projects off this endangered species list."At each new chapter of this research, I tend not to rely exclusively on previous research. Certainly I learn more as I do this, but I lean towards using each new focus as an opportunity to redundantly check things I've checked already -- that's how I discovered that Slitaz had moved. So at each pivot, I often get fresh data to confirm or update previous data.

Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!

But do, if you're interested, please help get these projects off this endangered species list. I'm generally quite interested in the evidence that other people can bring to the table, a good deal of the feedback received has proven useful. I still consider Perl to be endangered, and I think the criteria are relevant.

Long live rms, and happy hacking.

Licence: Creative Commons CC0 1.0 (public domain)

Comments

Recent Techrights' Posts

"A single witness shall not rise up against a person regarding any wrongdoing or any sin that he commits; on the testimony of two or three witnesses a matter shall be confirmed." (Deuteronomy 19-21)
The spouse of Garrett repeatedly points out that Garrett can barely code or can only do so very poorly
Rust People Sabotage Stability for the Sake of a Falsely-Promised 'Security'
Set aside severe performance issues, poor handling of "edge cases", general bugs, lack of compatibility, and even crashes
Huge Strike at the European Patent Office (EPO) This Coming Friday (May 1st)
International Worker’s day
 
Links 25/04/2026: "Horrible Economics of AI Are Starting to Come Crashing Down", More Restrictions Placed on Social Control Media
Links for the day
Getting Aggressive Suggestive of Loss - Part IV - Shutting Down My Existence
Would anyone out there tolerate such messages sent from burner accounts?
Gemini Links 26/04/2026: Gemini Movie Database (or GeminiMDB) and Star Trek III
Links for the day
Weeks Before Linux Removed Over 100,000 Lines of Code Due to Slop 'Bug Reports' Microsoft Paid 'Linux' Foundation to Advance Slop in the Name of 'Security'
What can possible go wrong? Both for security and for stability.
Tracking Ages of People
To stay "safe" tell us your age
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Saturday, April 25, 2026
IRC logs for Saturday, April 25, 2026
SLAPP Censorship - Part 57 Out of 200: 5RB and Brett Wilson LLP Made the Garrett and Graveley Particulars of Claims a Lot Like Photocopies!
They seem very much irritated that I speak about this
Links 25/04/2026: Nokia Wins Embargo in Kangaroo Court Where Judges Are Salaried Nokia Staff (UPC), Allison Pearson Defamation Case (UK) Succeeds, Smokey Robinson and "Puff Daddy" (US) Fail
Links for the day
Gemini Links 25/04/2026: Weekly Echoes, Gemtext Tables, and Using Offpunk
Links for the day
Corporate Media Did Not Specify What Microsoft Means by "Buyouts" (Layoffs), It May Be Hardly Different From Severance
Time will tell, but investigative journalism hardly exists anymore, so we won't hold our breath
The Corrupt Lecture the Non-Corrupt - Part V - "Diversity" and "Inclusion" at EPO Means Sleeping With Sister of "Cocaine Communication Manager" and Making Them Millionaires
Remember that top applicants or key stakeholders of the EPO are already complaining about a lack of quality
Links 25/04/2026: Fake GAFAM Valuations (Gripping the Market Based on False Accounting), "Evidence Isn't Just for Research", and "Putin Defends Mobile Internet Outages"
Links for the day
Dr. Andy Farnell on Why Calling Slop or Chaff "Hey Hi" (AI) Harm Us All, Except for "Ten or Twenty Rich Industrialists"
"words to avoid"
Internet Trolls Likely Trying to Distract From the Demise of IBM, Problems With Red Hat
there seems to be trolling online aimed at suppressing discussion
Debian Upgrade Coming Up (Soon)
Yesterday we contacted the datacentre staff about it
Getting Aggressive Suggestive of Loss - Part III - Threats From Burner Accounts Formally Treated as a Crime
Countries that cannot preserve freedom from self-censorship are countries where free press ultimately cannot prevail
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Friday, April 24, 2026
IRC logs for Friday, April 24, 2026
Gemini Links 25/04/2026: 3.4k+ Capsules, Microsoft Layoffs, Call for Nuclear Disarmament, "Internet is Sad and Lonely"
Links for the day
Links 24/04/2026: Zelenskyy Says Ukraine's War Position "Most Stable", Samsung Workers on Strike Due to Pay
Links for the day
Recent Happenings at IBM Reaffirm Rumours About the CEO; He Might be Resigning (or Pushed Out) Soon
If the rumours are true (no, we did not check those tax records for ourselves), it's not unthinkable that IBM is already doing what Apple did months ago
Gemini Links 24/04/2026: Public Reticulum Gateway Node, Smol Computers, and Old E-mail
Links for the day
Links 24/04/2026: Intel Abandoning Computer Freedom (Even Further), Iran Reports That American Software and Hardware Remotely Sabotaged/Hijacked During War
Links for the day
24/7 Wall St. Editor-In-Chief and CEO Calls IBM Is "America’s Worst Big Tech Company", Talent is Leaving, Supposedly Strategic Units Culled
21 hours ago by Douglas A. McIntyre
The Great Wonders of Slop "Efficiency"
Thankfully nothing was lost in the transmission and lots of work (datacentre emissions) got "done"
IBM's Debt Increased Over $5 Billion in 3 Months While IBM Laid Off Many in Europe, US, Confluent, HashiCorp, and Red Hat
An increase of $5,000,000,000+ in debt in just 3 months!
IBMers Expect Another Giant Wave of Layoffs, Talk (and Sing) About the PIPs
The media won't be covering the key facts
Drama at the European Patent Office (EPO) This Week
We'll be covering the EPO quite a lot this weekend and next week
As We Predicted, Francophonie Countries in the EU and Outside the EU Dumping Microsoft for National Security Reasons
We expected Belgium or some other Francophonie place to do so next
Even to Microsoft Insiders It Seems Like XBox Has Already Died or Surrendered to the Japanese Companies
Now the Microsoft layoffs are evident for people to see
EPO Cocainegate Escalates - Part VI - The Strikes Go On and On (Major Strike Today)
We'll be covering this later today in relation to what the Office dubs "ethics"
Absolutely Terrible Journalism About Microsoft Layoffs This Week
7 hours ago by Leila Sheridan
SLAPP Censorship - Part 56 Out of 200: 5RB and Brett Wilson LLP's Copy-Paste Machination for Garrett and Graveley
Here is another straightforward example of their junior barrister overusing copy-paste on his Mac
Getting Aggressive Suggestive of Loss - Part II - Lawyers Are Not "Hired Guns" (and Should Never Act Like Ones)
The matter is being investigated
Nadella is Killing Microsoft. Slop Kills It Even Faster.
A decade from now we'll look back at slop like we look back at skateboards
Huge Microsoft Layoffs Coming Shortly (With Financial Report)
There will be lots of slop layoffs. Be ready. It's a bubble.
Gemini Links 24/04/2026: Data Breaches and Unofficial Gemini Protocol Specification Archive
Links for the day
Microsoft Offers About 10,000 of Its Senior American (Read: Expensive) Workers to be Laid Off
How many slopfarms and media parrots play along?
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Thursday, April 23, 2026
IRC logs for Thursday, April 23, 2026