Bonum Certa Men Certa

Microsoft GNU-Hub (Part 3: Methodology)

Article by figosdev

GNUHub

Summary: "Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"

IN part 1 and part 2 we showed the dangers of outsourcing development to a Microsoft platform. Today, figosdev explains the methods. There will be at least a couple more parts after this one. This is a very important if not critical subject and it is completely overlooked by the media, which is too busy promoting GitHub as if it's a champion of "Open Source" (Wired has just done that again) when it is in fact proprietary software controlled by a company that attacks Software Freedom in a lot of ways. Another new article perpetuates the myth that Microsoft contributes the most to "Open Source", based on Microsoft's very own site, which is proprietary software by the way. Facts don't seem to matter these days (the leading "Linux" story this week is about a Microsoft bounty, associating "Linux" with security issues; was a $100,000 bounty granted to the press to amplify Microsoft in the context of "Linux"?).



Without further ado, figosdev:




I didn't start the DeleteGithub meme, though when I found out about the upcoming purchase I quickly worked to migrate -- within about a week.

I had already been locked out of my first GitHub account (which an associate had created for me, and I lost the password to both the account and the associated email) and created a new GitHub repo, which I deleted after migrating.

Obarun also worked very quickly to migrate from GitHub, which I know because I was following the distro at the time.

"When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself."After the first few weeks of Microsoft owning GitHub, I didn't pay very much attention to it at all. #DeleteGithub was created without me.

All the while, I've expected more comments like this one, which Part 2 received:

"Perhaps you should actually research where projects host their code a bit more and you would find that most of them only use github for a mirror.

As a case in point check this link and you will see that the canonical source for perl is not github."

First of all, this comment is ABSOLUTELY wrong. Most of them do NOT only use GitHub for a mirror. I know, because I check that quite often. In fact most of this article was already written the other day to assist a colleague, and I've addressed the issue of mirrors several times already. But I suppose now it will be better to share that information here.

A Free software license means anybody can create a mirror; that also means that the existence of a mirror means nothing whatsoever. It was suggested once that I consider making mirrors their own category -- I do my utmost to ignore them entirely, because they're a meaningless metric that would only be a waste of time to count. They signify nothing; maybe you could argue they imply that someone cares enough about the project to mirror it. That's still not very interesting to me.

"While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub."When this started, I was looking to find alternatives to certain applications that remained on GitHub, as a sort of boycott / protest / public awareness campaign. After about 20 minutes of looking stuff up, I was shocked at some of the examples I found. I figured that some new applications here and there were on GitHub, no big deal. We've got lots of alternatives.

The first thing I was wrong about (GLADLY) was OpenBSD. When I started, this was a more casual endeavour. It was just a sort of poke around out of curiosity. Wikipedia was a primary source of information, and having done more to verify and audit listings since then, I can say that it's pretty good in practice, even as more than a starting point. But I wasn't paying enough attention to mirrors at first.

If you go to the GitHub page for OpenBSD it says right at the top:

"Public git conversion mirror of OpenBSD's official cvs www repository"

That's plenty clear. When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself.

I thought the first effort was best suited to a wiki, vs. an article. While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub.

The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website. The GNU website still mirrors these plans, and GitHub looks a lot like them:

"The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website.""One 'blue sky' avenue that should be investigated is if there is any way to turn Linux into an opportunity for Microsoft." https://www.gnu.org/software/fsfe/projects/ms-vs-eu/halloween2.html

This quote is more than 20 years old, and there are countless others from the same group of documents that read a lot like "old news" from 2010 to the present. So my first concern is that GitHub not be used to help Microsoft take "ownership" (aka control of) Free software. That control can be taken gradually with or without non-GitHub mirrors, though non-GitHub mirrors would probably help.

For the person says "Perhaps you should actually research where projects host their code a bit more", here is a glimpse into what I actually do, written just 2 days ago and not planned for the article:

"It starts simple -- because at first this was all I did. I don't do every part of this process for every single package. I sort of follow my nose, going deeper when I feel it's important. The end result is that nothing is undeniable proof, but across the board you get a good fine-resolution picture with some occasional faulty pixels. I like to think this process would at least be more accurate than turning machine learning on it."

"So first, I go to the project's Wikipedia page. If it doesn't have one, it's probably not important enough to be on delete_github in the first place. But if Wikipedia lists GitHub as the repo, it's probably on GitHub."

"I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub."This was how it started, and for certain I could have paid more attention to details, but I was only dipping my toes in. I am very happy that I was wrong about OpenBSD, and eventually did an audit of the hundred or so projects I'd listed, finding that at least FOUR of those hundred or so were actually mirrors. I happily listed the items that were cleared: FFmpeg, OpenBSD, QEMU, Kali (GNU/) Linux

Apache Server was cleared much later. ASF was infiltrated by Microsoft, and they are using GitHub for some things, but Apache Server was taken off the list because it is in fact a mirror. Slitaz was also taken off the list -- not because it was an error, but because they moved off GitHub again.

I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub. Did I check the thousands of apps? Of course not. For a ballpark figure, I checked how many listed GitHub as the source or website for the app. If they say it's GitHub, that's good enough for a ballpark figure. But I didn't stop there.

I used GitHub data to figure out which of those thousands of apps were the most popular, and then went through that list and hand-picked the most familiar apps. I don't use Android much, though I've always used F-droid when I need Android apps. I haven't done much to verify the F-droid list, and nobody has complained.

Tom pointed me to a couple of LISP library collections, and just to find out if the 70-80% figure came up again, I checked all of those -- and came up with a similar ratio. I'm happy to say that since then, this 4:5 ratio doesn't come up all the time.

"Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with."I keep looking for large data sets to play with, having the data for all packages in the Tiny Core repo (I've also figured out how to recursively parse dependencies in the package information for Debian and Trisquel) but these are not sets that I'm always checking and verifying by hand. The purpose is to figure out what to look at next, and that's how I found myself examining the GNU project itself -- carefully.

Most, not all of the things I've checked on, have an initial phase and sometimes a deeper check involved. I've tended to document what I know, and it's usually pretty obvious from context (from the description) when the level of detail is superficial. A statistic without examples? It's just a cursory check. Examples and explanations? I spent more time. As I said 2 days ago:

This has yielded very few false positives, but Apache httpd is still a mirror. Mirrors don't count, not to me at least, because per the license, Microsoft is allowed to mirror every Free software package that exists... A GitHub mirror really means that the "real repo" is somewhere else, that's what we want. We are looking at people using GitHub for development:

* Bug tracking * Pull requests (even worse than bug tracking) * "Official repo" (not mirror)

In order from least to worst, those are the things we are looking for. Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with. [People actually say things like] "We can't lose our bug tracker! There's so much valuable data there!" Michele says Git is distributed so it's a non-issue. "But Gitea is migrating anyway" that's good news then. But I think [that's been said] for a year or two, so take it with a grain of salt until there's more evidence.

"For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed."Anyway, the most important criteria is:

* Wikipedia or better yet the project's own homepage links to GitHub * It's not a mirror

At some point I started checking project websites too. Seems obvious, but you have to realise that when this thing started I was being very casual. Just poking around, not being serious about it.

For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed.

Dependency lists, whether we are talking about Tiny Core or Debian, typically only cover immediate deps -- not deps of deps or deps of deps of deps. So I literally write a recursive routine to turn the dep data for Tiny Core into a FULL list of packages that require each item. It takes 45 minutes or more (to write and work the bugs out) though after it's done you kick yourself for the time wasted doing it manually. I probably spent 2 hours trying to do a fraction of the work that way.

"Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so."Now I can say "which packages pull in glib2" and get a full list.

I can do that FOR EACH PACKAGE then run wc -l on each list, getting a count of how many packages need each thing -- like how many need glib, how many need libffi, etc., run the list through sort -n and you know which deps are the most needed. Libffi is right at the top. In fact Roy tells me people were complaining about libffi a day or two prior to this discovery, but I pulled the fact right from the data I had cached. So now it's telling me things other people won't -- aka verifying things other people know.

All well and good, so I know the usual suspects when I download Trisquel. But I get bored with Trisquel (fig spent about a week processing all the source code so its easier to search) and started looking at GNU.

I start here http://savannah.gnu.org/search/?Search=Search&words=*&type_of_search=soft&exact=0&max_rows=500&type=1#options

"It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel."Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so.

For the GNU stuff, I finally had the "server" search all the code for things like bffi, perl, .pl, .py, ython, ithub, png, flex. This is all [output to] a single text file, and it shows the path/name of the file (project name/path/actual file/ line of text found) so if I grep this file for example, "gperf" i get stuff like this:

89/gcc-9.3.0/libsanitizer/sanitizer_common/sanitizer_procmaps_mac.cc:// Google Perftools, https://github.com/gperftools/gperftools. 108/global-6.6.4/reconf.sh:prog='autoconf automake bison flex gperf libtool m4 perl'^I# required programs 123/gperf-3.1/ChangeLog: when the -n option is used. Previously, it didn't 123/gperf-3.1/ChangeLog: I'm too busy to fix it , right now. The problem 123/gperf-3.1/ChangeLog: they weren't being entered into the hash table . 123/gperf-3.1/ChangeLog: * Added the -D option that handles keyword sets that 123/gperf-3.1/ChangeLog: * Modified Key_List::print_hash_function so that it

"I told bash to make a numbered folder for each project, that way I can cd 123/ TAB TAB instead of spelling out the project folder name and so I can iterate/grep using seq instead of folder names. That's a convenience, so if I'm babbling its not important."

"I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.""I look for png files, perl, python, libffi, glib. I write it down. Then I tally it up later."

"That's how I do it. All in all, there's logic but I try to do whats logical / convenient / efficient. And other than that I just play it by ear."

TL,DR; It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel.

Sometimes I get into the includes in the C code and note that include libpng is in an #ifdef. I'm not really a C coder but at least I get the concept of "you can configure this to compile with optional dependencies."

But even when I wasn't trying, the first hundred or so entries were mostly accurate. I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.

And the data changes, too. But someone had asked me about Perl specifically: "I took a quick look at the perl site. They do not cite GitHub..."

"They're developing Perl 6 on GitHub, so it fits the methodology."That's not entirely true. So I told them what I knew so far:

"Some [of the most important] things I check over and over and over. Perl is one. Let's do it again, it's useful exercise..."

https://en.wikipedia.org/wiki/Perl nothing.

Let's take a detour towards "Perl 6": https://en.wikipedia.org/wiki/Raku_(programming_language)

https://raku.org/

"Language Design"

"Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.""Specification - Official Raku language specification test suite" https://github.com/Raku/roast

[Roy is likely to turn these into links. It's very possible the link he creates will not match the exact link from the website -- the url will be the same, the text will be the same, but the actual overlap of text and underlined link may vary. In the original writing, I kept the text and urls separate.]

* Issues 78 * Pull requests 25

"Raku is GitHub."

See what I did there? They're developing Perl 6 on GitHub, so it fits the methodology.

You may not agree with this methodology, for various reasons. You might have a good reason -- or you could be a shill or marketing person. You could just be an ordinary fanboy. BUT, you might also have a point! Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.

By all means, I expect other people to make a case for/against considering it "captured or controlled by Microsoft." My concern is Early Warning. You may have a "better" criteria to offer that you consider more useful.

"Puppy Linux, to cite a different example, is developed all over the place."The whole idea of this project is to get the conversation going. But I've gone to great lengths to provide useful data overall, and I do try harder when there is a key project like Perl involved. It continues:

"Back to perl.org:"

"Contribute" https://www.perl.org/contribute.html

"Contribute to Perl Core" http://dev.perl.org/perl5/

"Perl" (these things are quoted as headings to look for, if they're listed more than once there's a reason)

"Perl" "Production-ready, under active development"

This part says:

"Some people have expressed an interest in getting Puppy away from GitHub.""Perl 5.30.2 is the current stable version of Perl. Perl is actively maintained and developed (git repository) by a large group of dedicated volunteers."

That links to this url:

https://github.com/Perl/perl5

So we go to that url, and it says:

* Issues 1,865 * Pull requests 33

Whatever the "Canonical url" is, what matters to me and what I'm actually looking for is that

* The official website tells people to go to GitHub. * The GitHub repo is being used for Issues and Pull requests

If there aren't Issues and Pull requests, I make the decision based on other factors.

If the README.md says "please don't use this for issues or pull requests" then that certainly counts for something.

Puppy Linux, to cite a different example, is developed all over the place. Packages are strewn across countless hobbyists websites. There are too many derivatives to even count -- literally hundreds of fan-based ISOs exist online that people have made over the years. Several active derivatives exist -- Many are based on Woof-CE, which is developed on GitHub. Some are not.

"Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project."Some people have expressed an interest in getting Puppy away from GitHub. Puppy, DSL (a predecessor to Tiny Core) and Xubuntu were the three distros that helped me finally delete my last copy of Windows more than 10 years ago, and I learned as much about GNU/Linux from Puppy as the others.

But the Woof team has made it clear they are unlikely to migrate. The work they've done that keeps them "locked in" to GitHub isn't as easy to "fork" as the codebase, and this is by design.

Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project.

We can debate that, but the purpose of what I'm doing is to find the projects to worry about, so something can be done. By all means, help me get some of the items off the list. But you probably won't find anybody who has gone to more effort than I have, unless it's Roy. And Roy is interested. And I've heard from fewer critics than people who are interested as well.

"Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"If you have are a fan, user or developer of a vital project like Perl, I am certainly interested in (and continually looking for) criteria that would allow it to be taken off this list. The whole point of putting it on the list is the hope of getting it back off the list again.

And if you missed them, here are some disclaimers that I've already made:

"I don’t always trust Debian dependencies, but they’re certainly illustrative" (part 1)

"If there are obvious mistakes or less obvious misconceptions I’m presenting when I talk about some of the details, I hope you’ll mention it in the comments. I’m sure there will be a few differences of opinion as well." (part 1)

"Python is worth watching for, but only proves to be a GitHub hostage sometimes." (part 1)

"This isn’t just about where the code is, but where the development takes place and who controls access." (part 2)

"This isn’t to admonish the author for not following a rule that doesn’t exist, but to highlight the more-than-hypothetical threat that the GNU project faces" (part 2)

"But do, if you're interested, please help get these projects off this endangered species list."At each new chapter of this research, I tend not to rely exclusively on previous research. Certainly I learn more as I do this, but I lean towards using each new focus as an opportunity to redundantly check things I've checked already -- that's how I discovered that Slitaz had moved. So at each pivot, I often get fresh data to confirm or update previous data.

Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!

But do, if you're interested, please help get these projects off this endangered species list. I'm generally quite interested in the evidence that other people can bring to the table, a good deal of the feedback received has proven useful. I still consider Perl to be endangered, and I think the criteria are relevant.

Long live rms, and happy hacking.

Licence: Creative Commons CC0 1.0 (public domain)

Comments

Recent Techrights' Posts

Jamie Zawinski Complained About Wayland, Then Decided to Give It a Go, Now Complains Again About Wayland
Ask IBM (Red Hat) why it's worth throwing so much away just for Wayland fanaticism
Russia Set to Ban Facebook?
If WhatsApp is made to "leave", that means Facebook or "Meta".
Taking Stock of a Good and Productive Week
We shall now be taking a break, unpacking the new hard drive (8 TB), and making backups of everything
Ageism in Tech
Your protocol is "old"...
 
In Defence of "Spinning Rust"
Just because something is "old" (or older) doesn't mean it ought to become extinct
Using Free Software to Prepare Legal Documents
LibreOffice is openly complaining about OOXML as an obstacle
Tech and Technology Are Not the Same Anymore
"Are you into tech, Sir?"
Our Articles About SLAPPs Receive Recognition and Interest
This week we shall continue writing about the 3 lawsuits we filed
Are You Served?
For many people, advocacy of Free software and GPL enforcement are assumed to be happening
Conspiracy or grooming? Alex Jurado, Voice of Reason compared to Outreachy
Reprinted with permission from Daniel Pocock
Links 20/07/2025: Security Breaches and Former 'Open' 'AI' Engineer on Hype and Culture Issues
Links for the day
Links 20/07/2025: Fending Off BRICS and US Government Attacks Its Own Media (Like China and Russia)
Links for the day
Framed by social control media: Alex Belfield, Voice of Reason
Reprinted with permission from Daniel Pocock
Gemini Links 20/07/2025: Summertime and OCC25 Wrap-up
Links for the day
Slopwatch: Planet Ubuntu, LinuxSecurity, and More
former "Linux" blogs which basically became slopfarms
Links 20/07/2025: More GAFAM Lawsuits, Layoffs, and SLAPPs
Links for the day
Nice Recovery (From Actual Fire) by PCLinuxOS, New Version of PCLinuxOS Released, Now Top of DistoWatch
PCLinuxOS is a community-driven distro
More Microsoft Shutdowns That Mostly Slipped Under the Radar
Remember what happened to books 'sold' by Microsoft?
Microsoft Lunduke Still Fighting Cancel Culture With... Cancel Culture
There will be no "winners" in such 'debates'
The History of Daily Links and Politics
"I support Wayland, but I also support abortion..."
Microsoft is at 0% "Market Share" in Most Areas
Depending on the taxonomy chosen, there may be dozens of categories other than desktops and laptops
"The moment MSFT stock fails to start tumbling, that’s the beginning of another corporate giant going under."
There are far more layoffs at Microsoft than at Intel, but you would not get this impression based on Wall Street media
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Saturday, July 19, 2025
IRC logs for Saturday, July 19, 2025
Gemini Links 19/07/2025: Git For Authors and Filtered Antenna
Links for the day
UEFI 'Secure' Boot Abuses by Microsoft to be Brought Up in the UK High Court in 3 Months
we'll seek compensation
Next Year It'll Be Half a Decade Since the Fall of Freenode (and IRC is Still Doing OK)
Our IRC network is still accessible using the exact same software that ran in Windows 3.x
Lupa Will Soon Know of 3,100+ Active Gemini Capsules
And some people in the "Small Web" try to tell us that Gemini is dying?
The Slopfarms Are Taking Real News Articles and Replacing Them With Lies Generated by Machines
Bluntly speaking, Fagioli is nothing short of an online scammer
Links 19/07/2025: Techtarget to Cull 10% of Staff, New Threats to Free Press in the US (Home of Dangerous and Violent Stranglers From Microsoft)
Links for the day
Gemini Links 19/07/2025: "Climate Justice” and Forking Programs
Links for the day
What Wayland and Microsoft/IBM systemd Have in Common
focus on what IBM (Red Hat) is pushing while running over critics.
Linux Already Has About 60% of the "Market"
"When mentioning the client side," opines an associate, "it is essential to recite the list of other markets where Microsoft is negligible or a no-show. It is repetitive to do so, but it needs saying -- often."
In Norway, Android/Linux Has Just Hit All-Time High (First Time Since 2020), GNU/Linux Already Very Prevalent
Despite its small population size, Norway gave us Qt and many other things
Finland (and NATO) Must Move to GNU/Linux and Dump Microsoft Even Faster
"Microsoft is not a technology problem, it is a staffing problem."
Microsoft's Mass Layoffs Very Wide-Ranging, Media Focused on Gaming Though Microsoft Mass-Firing Lawyers and "AI" Staff (Contradicting Its Supposed "Investment" in "AI")
Microsoft plans to fire almost half a thousand people in legal roles
2012 Article About the Free Software Foundation Blasting Canonical/Ubuntu Over Adoption of "Secure" Boot (Microsoft's Remote Control Over GNU/Linux Since PCs' Power-on)
By Katherine Noyes (article has since then became 404, not found)
The Microsofters We Sued Helped Microsoft Make GNU/Linux 'Expire' This Year
"Linux and Secure Boot certificate expiration"
linuxconfig.org Joins linuxtechlab.com and Others, Becomes a Slopfarm With Fake Linux 'Articles' (LLM Slop)
They contain "linux" in their domain names, but they are just slopfarms
Links 19/07/2025: Microsoft Cuts in China and Wall Street Journal Sued for Reporting on Jeffrey Epstein
Links for the day
Debian Can Dump Blind Users Because I am Not Blind
the sort of mentality we're up against
Fascistic Policies Got 'Normalised' in 'Public Office'. Let's Not Let the Same Happen in 'Tech'.
Political discourse typically guides what's "normal" and what "good citizens" should believe/feel
The European Patent Office Cannot Attract Proficient Patent Examiners Who Master Their Domain
They are enablers and facilitators of corruption
Yes, Your Mastodon Instance Will Also Shut Down
Few people run a one-person instance in the Fediverse
The Demise of GAFAM Necessitates Greater and Broader Awareness
Morale at Microsoft is really bad
Free Software Foundation Reaches 75% of Funding Goal
Not bad for this "Fosschild"
Slopwatch: 7 New Examples of Fake 'Linux' Slop Pieces (Plagiarism With Misinformation)
Serial Sloppers need to be shunned
Links 19/07/2025: Kapo-berg Settles, Software Patents Challenged
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Friday, July 18, 2025
IRC logs for Friday, July 18, 2025