Bonum Certa Men Certa

Microsoft GNU-Hub (Part 3: Methodology)

Article by figosdev

GNUHub

Summary: "Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"

IN part 1 and part 2 we showed the dangers of outsourcing development to a Microsoft platform. Today, figosdev explains the methods. There will be at least a couple more parts after this one. This is a very important if not critical subject and it is completely overlooked by the media, which is too busy promoting GitHub as if it's a champion of "Open Source" (Wired has just done that again) when it is in fact proprietary software controlled by a company that attacks Software Freedom in a lot of ways. Another new article perpetuates the myth that Microsoft contributes the most to "Open Source", based on Microsoft's very own site, which is proprietary software by the way. Facts don't seem to matter these days (the leading "Linux" story this week is about a Microsoft bounty, associating "Linux" with security issues; was a $100,000 bounty granted to the press to amplify Microsoft in the context of "Linux"?).



Without further ado, figosdev:




I didn't start the DeleteGithub meme, though when I found out about the upcoming purchase I quickly worked to migrate -- within about a week.

I had already been locked out of my first GitHub account (which an associate had created for me, and I lost the password to both the account and the associated email) and created a new GitHub repo, which I deleted after migrating.

Obarun also worked very quickly to migrate from GitHub, which I know because I was following the distro at the time.

"When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself."After the first few weeks of Microsoft owning GitHub, I didn't pay very much attention to it at all. #DeleteGithub was created without me.

All the while, I've expected more comments like this one, which Part 2 received:

"Perhaps you should actually research where projects host their code a bit more and you would find that most of them only use github for a mirror.

As a case in point check this link and you will see that the canonical source for perl is not github."

First of all, this comment is ABSOLUTELY wrong. Most of them do NOT only use GitHub for a mirror. I know, because I check that quite often. In fact most of this article was already written the other day to assist a colleague, and I've addressed the issue of mirrors several times already. But I suppose now it will be better to share that information here.

A Free software license means anybody can create a mirror; that also means that the existence of a mirror means nothing whatsoever. It was suggested once that I consider making mirrors their own category -- I do my utmost to ignore them entirely, because they're a meaningless metric that would only be a waste of time to count. They signify nothing; maybe you could argue they imply that someone cares enough about the project to mirror it. That's still not very interesting to me.

"While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub."When this started, I was looking to find alternatives to certain applications that remained on GitHub, as a sort of boycott / protest / public awareness campaign. After about 20 minutes of looking stuff up, I was shocked at some of the examples I found. I figured that some new applications here and there were on GitHub, no big deal. We've got lots of alternatives.

The first thing I was wrong about (GLADLY) was OpenBSD. When I started, this was a more casual endeavour. It was just a sort of poke around out of curiosity. Wikipedia was a primary source of information, and having done more to verify and audit listings since then, I can say that it's pretty good in practice, even as more than a starting point. But I wasn't paying enough attention to mirrors at first.

If you go to the GitHub page for OpenBSD it says right at the top:

"Public git conversion mirror of OpenBSD's official cvs www repository"

That's plenty clear. When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself.

I thought the first effort was best suited to a wiki, vs. an article. While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub.

The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website. The GNU website still mirrors these plans, and GitHub looks a lot like them:

"The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website.""One 'blue sky' avenue that should be investigated is if there is any way to turn Linux into an opportunity for Microsoft." https://www.gnu.org/software/fsfe/projects/ms-vs-eu/halloween2.html

This quote is more than 20 years old, and there are countless others from the same group of documents that read a lot like "old news" from 2010 to the present. So my first concern is that GitHub not be used to help Microsoft take "ownership" (aka control of) Free software. That control can be taken gradually with or without non-GitHub mirrors, though non-GitHub mirrors would probably help.

For the person says "Perhaps you should actually research where projects host their code a bit more", here is a glimpse into what I actually do, written just 2 days ago and not planned for the article:

"It starts simple -- because at first this was all I did. I don't do every part of this process for every single package. I sort of follow my nose, going deeper when I feel it's important. The end result is that nothing is undeniable proof, but across the board you get a good fine-resolution picture with some occasional faulty pixels. I like to think this process would at least be more accurate than turning machine learning on it."

"So first, I go to the project's Wikipedia page. If it doesn't have one, it's probably not important enough to be on delete_github in the first place. But if Wikipedia lists GitHub as the repo, it's probably on GitHub."

"I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub."This was how it started, and for certain I could have paid more attention to details, but I was only dipping my toes in. I am very happy that I was wrong about OpenBSD, and eventually did an audit of the hundred or so projects I'd listed, finding that at least FOUR of those hundred or so were actually mirrors. I happily listed the items that were cleared: FFmpeg, OpenBSD, QEMU, Kali (GNU/) Linux

Apache Server was cleared much later. ASF was infiltrated by Microsoft, and they are using GitHub for some things, but Apache Server was taken off the list because it is in fact a mirror. Slitaz was also taken off the list -- not because it was an error, but because they moved off GitHub again.

I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub. Did I check the thousands of apps? Of course not. For a ballpark figure, I checked how many listed GitHub as the source or website for the app. If they say it's GitHub, that's good enough for a ballpark figure. But I didn't stop there.

I used GitHub data to figure out which of those thousands of apps were the most popular, and then went through that list and hand-picked the most familiar apps. I don't use Android much, though I've always used F-droid when I need Android apps. I haven't done much to verify the F-droid list, and nobody has complained.

Tom pointed me to a couple of LISP library collections, and just to find out if the 70-80% figure came up again, I checked all of those -- and came up with a similar ratio. I'm happy to say that since then, this 4:5 ratio doesn't come up all the time.

"Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with."I keep looking for large data sets to play with, having the data for all packages in the Tiny Core repo (I've also figured out how to recursively parse dependencies in the package information for Debian and Trisquel) but these are not sets that I'm always checking and verifying by hand. The purpose is to figure out what to look at next, and that's how I found myself examining the GNU project itself -- carefully.

Most, not all of the things I've checked on, have an initial phase and sometimes a deeper check involved. I've tended to document what I know, and it's usually pretty obvious from context (from the description) when the level of detail is superficial. A statistic without examples? It's just a cursory check. Examples and explanations? I spent more time. As I said 2 days ago:

This has yielded very few false positives, but Apache httpd is still a mirror. Mirrors don't count, not to me at least, because per the license, Microsoft is allowed to mirror every Free software package that exists... A GitHub mirror really means that the "real repo" is somewhere else, that's what we want. We are looking at people using GitHub for development:

* Bug tracking * Pull requests (even worse than bug tracking) * "Official repo" (not mirror)

In order from least to worst, those are the things we are looking for. Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with. [People actually say things like] "We can't lose our bug tracker! There's so much valuable data there!" Michele says Git is distributed so it's a non-issue. "But Gitea is migrating anyway" that's good news then. But I think [that's been said] for a year or two, so take it with a grain of salt until there's more evidence.

"For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed."Anyway, the most important criteria is:

* Wikipedia or better yet the project's own homepage links to GitHub * It's not a mirror

At some point I started checking project websites too. Seems obvious, but you have to realise that when this thing started I was being very casual. Just poking around, not being serious about it.

For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed.

Dependency lists, whether we are talking about Tiny Core or Debian, typically only cover immediate deps -- not deps of deps or deps of deps of deps. So I literally write a recursive routine to turn the dep data for Tiny Core into a FULL list of packages that require each item. It takes 45 minutes or more (to write and work the bugs out) though after it's done you kick yourself for the time wasted doing it manually. I probably spent 2 hours trying to do a fraction of the work that way.

"Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so."Now I can say "which packages pull in glib2" and get a full list.

I can do that FOR EACH PACKAGE then run wc -l on each list, getting a count of how many packages need each thing -- like how many need glib, how many need libffi, etc., run the list through sort -n and you know which deps are the most needed. Libffi is right at the top. In fact Roy tells me people were complaining about libffi a day or two prior to this discovery, but I pulled the fact right from the data I had cached. So now it's telling me things other people won't -- aka verifying things other people know.

All well and good, so I know the usual suspects when I download Trisquel. But I get bored with Trisquel (fig spent about a week processing all the source code so its easier to search) and started looking at GNU.

I start here http://savannah.gnu.org/search/?Search=Search&words=*&type_of_search=soft&exact=0&max_rows=500&type=1#options

"It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel."Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so.

For the GNU stuff, I finally had the "server" search all the code for things like bffi, perl, .pl, .py, ython, ithub, png, flex. This is all [output to] a single text file, and it shows the path/name of the file (project name/path/actual file/ line of text found) so if I grep this file for example, "gperf" i get stuff like this:

89/gcc-9.3.0/libsanitizer/sanitizer_common/sanitizer_procmaps_mac.cc:// Google Perftools, https://github.com/gperftools/gperftools. 108/global-6.6.4/reconf.sh:prog='autoconf automake bison flex gperf libtool m4 perl'^I# required programs 123/gperf-3.1/ChangeLog: when the -n option is used. Previously, it didn't 123/gperf-3.1/ChangeLog: I'm too busy to fix it , right now. The problem 123/gperf-3.1/ChangeLog: they weren't being entered into the hash table . 123/gperf-3.1/ChangeLog: * Added the -D option that handles keyword sets that 123/gperf-3.1/ChangeLog: * Modified Key_List::print_hash_function so that it

"I told bash to make a numbered folder for each project, that way I can cd 123/ TAB TAB instead of spelling out the project folder name and so I can iterate/grep using seq instead of folder names. That's a convenience, so if I'm babbling its not important."

"I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.""I look for png files, perl, python, libffi, glib. I write it down. Then I tally it up later."

"That's how I do it. All in all, there's logic but I try to do whats logical / convenient / efficient. And other than that I just play it by ear."

TL,DR; It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel.

Sometimes I get into the includes in the C code and note that include libpng is in an #ifdef. I'm not really a C coder but at least I get the concept of "you can configure this to compile with optional dependencies."

But even when I wasn't trying, the first hundred or so entries were mostly accurate. I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.

And the data changes, too. But someone had asked me about Perl specifically: "I took a quick look at the perl site. They do not cite GitHub..."

"They're developing Perl 6 on GitHub, so it fits the methodology."That's not entirely true. So I told them what I knew so far:

"Some [of the most important] things I check over and over and over. Perl is one. Let's do it again, it's useful exercise..."

https://en.wikipedia.org/wiki/Perl nothing.

Let's take a detour towards "Perl 6": https://en.wikipedia.org/wiki/Raku_(programming_language)

https://raku.org/

"Language Design"

"Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.""Specification - Official Raku language specification test suite" https://github.com/Raku/roast

[Roy is likely to turn these into links. It's very possible the link he creates will not match the exact link from the website -- the url will be the same, the text will be the same, but the actual overlap of text and underlined link may vary. In the original writing, I kept the text and urls separate.]

* Issues 78 * Pull requests 25

"Raku is GitHub."

See what I did there? They're developing Perl 6 on GitHub, so it fits the methodology.

You may not agree with this methodology, for various reasons. You might have a good reason -- or you could be a shill or marketing person. You could just be an ordinary fanboy. BUT, you might also have a point! Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.

By all means, I expect other people to make a case for/against considering it "captured or controlled by Microsoft." My concern is Early Warning. You may have a "better" criteria to offer that you consider more useful.

"Puppy Linux, to cite a different example, is developed all over the place."The whole idea of this project is to get the conversation going. But I've gone to great lengths to provide useful data overall, and I do try harder when there is a key project like Perl involved. It continues:

"Back to perl.org:"

"Contribute" https://www.perl.org/contribute.html

"Contribute to Perl Core" http://dev.perl.org/perl5/

"Perl" (these things are quoted as headings to look for, if they're listed more than once there's a reason)

"Perl" "Production-ready, under active development"

This part says:

"Some people have expressed an interest in getting Puppy away from GitHub.""Perl 5.30.2 is the current stable version of Perl. Perl is actively maintained and developed (git repository) by a large group of dedicated volunteers."

That links to this url:

https://github.com/Perl/perl5

So we go to that url, and it says:

* Issues 1,865 * Pull requests 33

Whatever the "Canonical url" is, what matters to me and what I'm actually looking for is that

* The official website tells people to go to GitHub. * The GitHub repo is being used for Issues and Pull requests

If there aren't Issues and Pull requests, I make the decision based on other factors.

If the README.md says "please don't use this for issues or pull requests" then that certainly counts for something.

Puppy Linux, to cite a different example, is developed all over the place. Packages are strewn across countless hobbyists websites. There are too many derivatives to even count -- literally hundreds of fan-based ISOs exist online that people have made over the years. Several active derivatives exist -- Many are based on Woof-CE, which is developed on GitHub. Some are not.

"Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project."Some people have expressed an interest in getting Puppy away from GitHub. Puppy, DSL (a predecessor to Tiny Core) and Xubuntu were the three distros that helped me finally delete my last copy of Windows more than 10 years ago, and I learned as much about GNU/Linux from Puppy as the others.

But the Woof team has made it clear they are unlikely to migrate. The work they've done that keeps them "locked in" to GitHub isn't as easy to "fork" as the codebase, and this is by design.

Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project.

We can debate that, but the purpose of what I'm doing is to find the projects to worry about, so something can be done. By all means, help me get some of the items off the list. But you probably won't find anybody who has gone to more effort than I have, unless it's Roy. And Roy is interested. And I've heard from fewer critics than people who are interested as well.

"Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"If you have are a fan, user or developer of a vital project like Perl, I am certainly interested in (and continually looking for) criteria that would allow it to be taken off this list. The whole point of putting it on the list is the hope of getting it back off the list again.

And if you missed them, here are some disclaimers that I've already made:

"I don’t always trust Debian dependencies, but they’re certainly illustrative" (part 1)

"If there are obvious mistakes or less obvious misconceptions I’m presenting when I talk about some of the details, I hope you’ll mention it in the comments. I’m sure there will be a few differences of opinion as well." (part 1)

"Python is worth watching for, but only proves to be a GitHub hostage sometimes." (part 1)

"This isn’t just about where the code is, but where the development takes place and who controls access." (part 2)

"This isn’t to admonish the author for not following a rule that doesn’t exist, but to highlight the more-than-hypothetical threat that the GNU project faces" (part 2)

"But do, if you're interested, please help get these projects off this endangered species list."At each new chapter of this research, I tend not to rely exclusively on previous research. Certainly I learn more as I do this, but I lean towards using each new focus as an opportunity to redundantly check things I've checked already -- that's how I discovered that Slitaz had moved. So at each pivot, I often get fresh data to confirm or update previous data.

Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!

But do, if you're interested, please help get these projects off this endangered species list. I'm generally quite interested in the evidence that other people can bring to the table, a good deal of the feedback received has proven useful. I still consider Perl to be endangered, and I think the criteria are relevant.

Long live rms, and happy hacking.

Licence: Creative Commons CC0 1.0 (public domain)

Comments

Recent Techrights' Posts

Under the Guise of "MIT Technology Review Insights" the Site MIT Technology Review Posts Corporate Spam as 'Articles'
Some of the articles aren't even articles but 'hit pieces' against Free software and some are paid advertisements
Brett Wilson LLP Has Track Record in Scam Coin Cases (e.g. Craig Wright and More), Now It Works for 'Crypto' Scam Purveyors
But wait, it gets worse
Will Brett Wilson LLP Handle Its Own Winding Up Petition or be Struck Off for Overt Abuse of Process?
Today we sue not only the first Microsofter
Ubuntu Becomes Microsoft GitHub, Based on Decision Made by British Army Officer
You're hopeless, Canonical
Sharing Code and Recipes
It helps explain the triviality of software freedom
How Many Women Has Microsoft's Alex Balabhadra Graveley Already Strangled and Where Does That End?
If you too are a victim of this man and wish to share information, contact us
"We Might Save Somebody's Life"
I follow the example of my father
Gemini Links 16/07/2025: Tmux and OCC25 Working TLS
Links for the day
 
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Wednesday, July 16, 2025
IRC logs for Wednesday, July 16, 2025
Exclusive: corruption in Tribunals, Greffiers, from protection rackets to cat whisperers
Reprinted with permission from Daniel Pocock
Links 16/07/2025: Chip Bans and Microsoft’s “Digital Escort” Program
Links for the day
Revolving Doors: One Day You're a Judge, the Next Day You're an Attorney Paying Public Officials and Working for Violent and Dangerous Microsoft Employees
how the US justice system works
Slopwatch: Noise, Plagiarism and Even Fear, Uncertainty, Doubt/Fear-mongering/Dramatisation
What are we meant to do to prevent a false association or misleading connotations? Game the LLMs? No. Boycott slopfarms.
Gemini Links 16/07/2025: BaseLibre Numerical System and Simple Web Browsing with TLS
Links for the day
Links 16/07/2025: Fascist Slop Takes "Intelligence" Clothing, New Criminal Case Against MElon
Links for the day
Why I am Suing the Serial Strangler From Microsoft, Alex Balabhadra Graveley, in the UK High Court This Week
Out of respect to the process and to the Court, I shall not share any pertinent details about the case
Links 16/07/2025: China’s Economy Grows Steadily, France Takes Action Regarding Harm to Children by GAFAM and Fentanylware (TikTok)
Links for the day
It is Not About Politics
Beware the people who try to make this about politics
Good Journalism Saves Lives
a shocking number of women die or get seriously hurt every day due to violence from a partner
Recognition of Women's Contributions to Free Software
Being passive is not an option when bad things are happening
Slopfarms Are Going to Perish Because Public Opinion is Changing
Many slopfarms will simply go offline
19 Years of Standing Up for Justice, Equality, and Truth
This week we shall take it up a notch
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Tuesday, July 15, 2025
IRC logs for Tuesday, July 15, 2025
Links 15/07/2025: LLM Pollution and Pushback in Ukraine
Links for the day
Gemini Links 15/07/2025: xkcd, New Cert, and Alhena Gemlog
Links for the day
Links 15/07/2025: Press Freedom at Risk and New Facebook Blunders
Links for the day
Reboots Should Never be Necessary
"BUT WHAT ABOUT SECURITY!!"
There's Still Hope for the World Wide Web
Let's hope that the trajectory of the Web won't be leading us to over-reliance on Google, nor will it reward worthless slopfarms
Gemini Links 15/07/2025: Smolweb and Alhena 5.1.7
Links for the day
The Danes Want GNU/Linux
David Heinemeier Hansson recently moved to GNU/Linux
Cory Doctorow Explains Why Software Freedom Matters, Whereas "Open Source" Misses the Point and Helps Monopolies
It's a very long article
BillPR (EpsteinGate-Bribed NPR) is Turning Into a Partial Slopfarm that Promotes Slop
"I went on a date with a chatbot!"
Two Weeks Passed Since Latest Large Wave of Microsoft Layoffs, More Expected Next Month
Blaming the debt on "AI" is just self-serving storytelling
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Monday, July 14, 2025
IRC logs for Monday, July 14, 2025
Gemini Links 15/07/2025: Gemini "Style Sheets" and Switching From Microsoft GitHub to Codeberg
Links for the day