Bonum Certa Men Certa

Microsoft GNU-Hub (Part 3: Methodology)

Article by figosdev

GNUHub

Summary: "Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"

IN part 1 and part 2 we showed the dangers of outsourcing development to a Microsoft platform. Today, figosdev explains the methods. There will be at least a couple more parts after this one. This is a very important if not critical subject and it is completely overlooked by the media, which is too busy promoting GitHub as if it's a champion of "Open Source" (Wired has just done that again) when it is in fact proprietary software controlled by a company that attacks Software Freedom in a lot of ways. Another new article perpetuates the myth that Microsoft contributes the most to "Open Source", based on Microsoft's very own site, which is proprietary software by the way. Facts don't seem to matter these days (the leading "Linux" story this week is about a Microsoft bounty, associating "Linux" with security issues; was a $100,000 bounty granted to the press to amplify Microsoft in the context of "Linux"?).



Without further ado, figosdev:




I didn't start the DeleteGithub meme, though when I found out about the upcoming purchase I quickly worked to migrate -- within about a week.

I had already been locked out of my first GitHub account (which an associate had created for me, and I lost the password to both the account and the associated email) and created a new GitHub repo, which I deleted after migrating.

Obarun also worked very quickly to migrate from GitHub, which I know because I was following the distro at the time.

"When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself."After the first few weeks of Microsoft owning GitHub, I didn't pay very much attention to it at all. #DeleteGithub was created without me.

All the while, I've expected more comments like this one, which Part 2 received:

"Perhaps you should actually research where projects host their code a bit more and you would find that most of them only use github for a mirror.

As a case in point check this link and you will see that the canonical source for perl is not github."

First of all, this comment is ABSOLUTELY wrong. Most of them do NOT only use GitHub for a mirror. I know, because I check that quite often. In fact most of this article was already written the other day to assist a colleague, and I've addressed the issue of mirrors several times already. But I suppose now it will be better to share that information here.

A Free software license means anybody can create a mirror; that also means that the existence of a mirror means nothing whatsoever. It was suggested once that I consider making mirrors their own category -- I do my utmost to ignore them entirely, because they're a meaningless metric that would only be a waste of time to count. They signify nothing; maybe you could argue they imply that someone cares enough about the project to mirror it. That's still not very interesting to me.

"While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub."When this started, I was looking to find alternatives to certain applications that remained on GitHub, as a sort of boycott / protest / public awareness campaign. After about 20 minutes of looking stuff up, I was shocked at some of the examples I found. I figured that some new applications here and there were on GitHub, no big deal. We've got lots of alternatives.

The first thing I was wrong about (GLADLY) was OpenBSD. When I started, this was a more casual endeavour. It was just a sort of poke around out of curiosity. Wikipedia was a primary source of information, and having done more to verify and audit listings since then, I can say that it's pretty good in practice, even as more than a starting point. But I wasn't paying enough attention to mirrors at first.

If you go to the GitHub page for OpenBSD it says right at the top:

"Public git conversion mirror of OpenBSD's official cvs www repository"

That's plenty clear. When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself.

I thought the first effort was best suited to a wiki, vs. an article. While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub.

The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website. The GNU website still mirrors these plans, and GitHub looks a lot like them:

"The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website.""One 'blue sky' avenue that should be investigated is if there is any way to turn Linux into an opportunity for Microsoft." https://www.gnu.org/software/fsfe/projects/ms-vs-eu/halloween2.html

This quote is more than 20 years old, and there are countless others from the same group of documents that read a lot like "old news" from 2010 to the present. So my first concern is that GitHub not be used to help Microsoft take "ownership" (aka control of) Free software. That control can be taken gradually with or without non-GitHub mirrors, though non-GitHub mirrors would probably help.

For the person says "Perhaps you should actually research where projects host their code a bit more", here is a glimpse into what I actually do, written just 2 days ago and not planned for the article:

"It starts simple -- because at first this was all I did. I don't do every part of this process for every single package. I sort of follow my nose, going deeper when I feel it's important. The end result is that nothing is undeniable proof, but across the board you get a good fine-resolution picture with some occasional faulty pixels. I like to think this process would at least be more accurate than turning machine learning on it."

"So first, I go to the project's Wikipedia page. If it doesn't have one, it's probably not important enough to be on delete_github in the first place. But if Wikipedia lists GitHub as the repo, it's probably on GitHub."

"I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub."This was how it started, and for certain I could have paid more attention to details, but I was only dipping my toes in. I am very happy that I was wrong about OpenBSD, and eventually did an audit of the hundred or so projects I'd listed, finding that at least FOUR of those hundred or so were actually mirrors. I happily listed the items that were cleared: FFmpeg, OpenBSD, QEMU, Kali (GNU/) Linux

Apache Server was cleared much later. ASF was infiltrated by Microsoft, and they are using GitHub for some things, but Apache Server was taken off the list because it is in fact a mirror. Slitaz was also taken off the list -- not because it was an error, but because they moved off GitHub again.

I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub. Did I check the thousands of apps? Of course not. For a ballpark figure, I checked how many listed GitHub as the source or website for the app. If they say it's GitHub, that's good enough for a ballpark figure. But I didn't stop there.

I used GitHub data to figure out which of those thousands of apps were the most popular, and then went through that list and hand-picked the most familiar apps. I don't use Android much, though I've always used F-droid when I need Android apps. I haven't done much to verify the F-droid list, and nobody has complained.

Tom pointed me to a couple of LISP library collections, and just to find out if the 70-80% figure came up again, I checked all of those -- and came up with a similar ratio. I'm happy to say that since then, this 4:5 ratio doesn't come up all the time.

"Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with."I keep looking for large data sets to play with, having the data for all packages in the Tiny Core repo (I've also figured out how to recursively parse dependencies in the package information for Debian and Trisquel) but these are not sets that I'm always checking and verifying by hand. The purpose is to figure out what to look at next, and that's how I found myself examining the GNU project itself -- carefully.

Most, not all of the things I've checked on, have an initial phase and sometimes a deeper check involved. I've tended to document what I know, and it's usually pretty obvious from context (from the description) when the level of detail is superficial. A statistic without examples? It's just a cursory check. Examples and explanations? I spent more time. As I said 2 days ago:

This has yielded very few false positives, but Apache httpd is still a mirror. Mirrors don't count, not to me at least, because per the license, Microsoft is allowed to mirror every Free software package that exists... A GitHub mirror really means that the "real repo" is somewhere else, that's what we want. We are looking at people using GitHub for development:

* Bug tracking * Pull requests (even worse than bug tracking) * "Official repo" (not mirror)

In order from least to worst, those are the things we are looking for. Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with. [People actually say things like] "We can't lose our bug tracker! There's so much valuable data there!" Michele says Git is distributed so it's a non-issue. "But Gitea is migrating anyway" that's good news then. But I think [that's been said] for a year or two, so take it with a grain of salt until there's more evidence.

"For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed."Anyway, the most important criteria is:

* Wikipedia or better yet the project's own homepage links to GitHub * It's not a mirror

At some point I started checking project websites too. Seems obvious, but you have to realise that when this thing started I was being very casual. Just poking around, not being serious about it.

For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed.

Dependency lists, whether we are talking about Tiny Core or Debian, typically only cover immediate deps -- not deps of deps or deps of deps of deps. So I literally write a recursive routine to turn the dep data for Tiny Core into a FULL list of packages that require each item. It takes 45 minutes or more (to write and work the bugs out) though after it's done you kick yourself for the time wasted doing it manually. I probably spent 2 hours trying to do a fraction of the work that way.

"Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so."Now I can say "which packages pull in glib2" and get a full list.

I can do that FOR EACH PACKAGE then run wc -l on each list, getting a count of how many packages need each thing -- like how many need glib, how many need libffi, etc., run the list through sort -n and you know which deps are the most needed. Libffi is right at the top. In fact Roy tells me people were complaining about libffi a day or two prior to this discovery, but I pulled the fact right from the data I had cached. So now it's telling me things other people won't -- aka verifying things other people know.

All well and good, so I know the usual suspects when I download Trisquel. But I get bored with Trisquel (fig spent about a week processing all the source code so its easier to search) and started looking at GNU.

I start here http://savannah.gnu.org/search/?Search=Search&words=*&type_of_search=soft&exact=0&max_rows=500&type=1#options

"It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel."Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so.

For the GNU stuff, I finally had the "server" search all the code for things like bffi, perl, .pl, .py, ython, ithub, png, flex. This is all [output to] a single text file, and it shows the path/name of the file (project name/path/actual file/ line of text found) so if I grep this file for example, "gperf" i get stuff like this:

89/gcc-9.3.0/libsanitizer/sanitizer_common/sanitizer_procmaps_mac.cc:// Google Perftools, https://github.com/gperftools/gperftools. 108/global-6.6.4/reconf.sh:prog='autoconf automake bison flex gperf libtool m4 perl'^I# required programs 123/gperf-3.1/ChangeLog: when the -n option is used. Previously, it didn't 123/gperf-3.1/ChangeLog: I'm too busy to fix it , right now. The problem 123/gperf-3.1/ChangeLog: they weren't being entered into the hash table . 123/gperf-3.1/ChangeLog: * Added the -D option that handles keyword sets that 123/gperf-3.1/ChangeLog: * Modified Key_List::print_hash_function so that it

"I told bash to make a numbered folder for each project, that way I can cd 123/ TAB TAB instead of spelling out the project folder name and so I can iterate/grep using seq instead of folder names. That's a convenience, so if I'm babbling its not important."

"I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.""I look for png files, perl, python, libffi, glib. I write it down. Then I tally it up later."

"That's how I do it. All in all, there's logic but I try to do whats logical / convenient / efficient. And other than that I just play it by ear."

TL,DR; It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel.

Sometimes I get into the includes in the C code and note that include libpng is in an #ifdef. I'm not really a C coder but at least I get the concept of "you can configure this to compile with optional dependencies."

But even when I wasn't trying, the first hundred or so entries were mostly accurate. I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.

And the data changes, too. But someone had asked me about Perl specifically: "I took a quick look at the perl site. They do not cite GitHub..."

"They're developing Perl 6 on GitHub, so it fits the methodology."That's not entirely true. So I told them what I knew so far:

"Some [of the most important] things I check over and over and over. Perl is one. Let's do it again, it's useful exercise..."

https://en.wikipedia.org/wiki/Perl nothing.

Let's take a detour towards "Perl 6": https://en.wikipedia.org/wiki/Raku_(programming_language)

https://raku.org/

"Language Design"

"Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.""Specification - Official Raku language specification test suite" https://github.com/Raku/roast

[Roy is likely to turn these into links. It's very possible the link he creates will not match the exact link from the website -- the url will be the same, the text will be the same, but the actual overlap of text and underlined link may vary. In the original writing, I kept the text and urls separate.]

* Issues 78 * Pull requests 25

"Raku is GitHub."

See what I did there? They're developing Perl 6 on GitHub, so it fits the methodology.

You may not agree with this methodology, for various reasons. You might have a good reason -- or you could be a shill or marketing person. You could just be an ordinary fanboy. BUT, you might also have a point! Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.

By all means, I expect other people to make a case for/against considering it "captured or controlled by Microsoft." My concern is Early Warning. You may have a "better" criteria to offer that you consider more useful.

"Puppy Linux, to cite a different example, is developed all over the place."The whole idea of this project is to get the conversation going. But I've gone to great lengths to provide useful data overall, and I do try harder when there is a key project like Perl involved. It continues:

"Back to perl.org:"

"Contribute" https://www.perl.org/contribute.html

"Contribute to Perl Core" http://dev.perl.org/perl5/

"Perl" (these things are quoted as headings to look for, if they're listed more than once there's a reason)

"Perl" "Production-ready, under active development"

This part says:

"Some people have expressed an interest in getting Puppy away from GitHub.""Perl 5.30.2 is the current stable version of Perl. Perl is actively maintained and developed (git repository) by a large group of dedicated volunteers."

That links to this url:

https://github.com/Perl/perl5

So we go to that url, and it says:

* Issues 1,865 * Pull requests 33

Whatever the "Canonical url" is, what matters to me and what I'm actually looking for is that

* The official website tells people to go to GitHub. * The GitHub repo is being used for Issues and Pull requests

If there aren't Issues and Pull requests, I make the decision based on other factors.

If the README.md says "please don't use this for issues or pull requests" then that certainly counts for something.

Puppy Linux, to cite a different example, is developed all over the place. Packages are strewn across countless hobbyists websites. There are too many derivatives to even count -- literally hundreds of fan-based ISOs exist online that people have made over the years. Several active derivatives exist -- Many are based on Woof-CE, which is developed on GitHub. Some are not.

"Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project."Some people have expressed an interest in getting Puppy away from GitHub. Puppy, DSL (a predecessor to Tiny Core) and Xubuntu were the three distros that helped me finally delete my last copy of Windows more than 10 years ago, and I learned as much about GNU/Linux from Puppy as the others.

But the Woof team has made it clear they are unlikely to migrate. The work they've done that keeps them "locked in" to GitHub isn't as easy to "fork" as the codebase, and this is by design.

Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project.

We can debate that, but the purpose of what I'm doing is to find the projects to worry about, so something can be done. By all means, help me get some of the items off the list. But you probably won't find anybody who has gone to more effort than I have, unless it's Roy. And Roy is interested. And I've heard from fewer critics than people who are interested as well.

"Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"If you have are a fan, user or developer of a vital project like Perl, I am certainly interested in (and continually looking for) criteria that would allow it to be taken off this list. The whole point of putting it on the list is the hope of getting it back off the list again.

And if you missed them, here are some disclaimers that I've already made:

"I don’t always trust Debian dependencies, but they’re certainly illustrative" (part 1)

"If there are obvious mistakes or less obvious misconceptions I’m presenting when I talk about some of the details, I hope you’ll mention it in the comments. I’m sure there will be a few differences of opinion as well." (part 1)

"Python is worth watching for, but only proves to be a GitHub hostage sometimes." (part 1)

"This isn’t just about where the code is, but where the development takes place and who controls access." (part 2)

"This isn’t to admonish the author for not following a rule that doesn’t exist, but to highlight the more-than-hypothetical threat that the GNU project faces" (part 2)

"But do, if you're interested, please help get these projects off this endangered species list."At each new chapter of this research, I tend not to rely exclusively on previous research. Certainly I learn more as I do this, but I lean towards using each new focus as an opportunity to redundantly check things I've checked already -- that's how I discovered that Slitaz had moved. So at each pivot, I often get fresh data to confirm or update previous data.

Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!

But do, if you're interested, please help get these projects off this endangered species list. I'm generally quite interested in the evidence that other people can bring to the table, a good deal of the feedback received has proven useful. I still consider Perl to be endangered, and I think the criteria are relevant.

Long live rms, and happy hacking.

Licence: Creative Commons CC0 1.0 (public domain)

Comments

Recent Techrights' Posts

Bing Might Shut Down - Just Like Skype Did - Some Time in the Coming Months/Years (Parts of It Already Shut Down)
they try to bring the losses under control
Microsoft Rumours: This Week's Scale of Layoffs "Higher Than Reported" and More Coming Soon ("A Lot More Severe" Than May's)
The "3%" figure is false
Slopwatch: Sloppy Brian, Brittany Slop, and General Observations
Creative people don't need slop; there's just nothing good about it, slop appeals to lazy people careless about quality
 
Microsoft WARN Notices Proliferate in the United States
From what we've seen, this wave was more than 3% (a lot more) and the next wave/s will be even bigger (possible as imminent as weeks from now), based on insider leaks
Links 15/05/2025: Google Betrays Publishers Again, Openwashing by Sysdig
Links for the day
Richard Stallman Still Respected by Many in the Libre Graphics Community
Richard Stallman and Professor Moglen never harmed anyone
If You Read Techrights, Then You Probably Want to Read Tux Machines as Well
That site is more active than this one
Gemini Links 15/05/2025: Forced Music in Publicly Accessible Space and ~silv is Online
Links for the day
Links 15/05/2025: KOSA Censorship (USA Becomes More Like KSA) and More National Cuts
Links for the day
Your Real Ally Would Not Defend the Company of SLAPP and Strangling of Women
who's left to tell us what's true?
Breakdown of Microsoft Layoffs Shows It's About Cost, Not Performance or Hype (Like "AI")
MSN (Microsoft) reposted this with some unnecessary spin
The Lawyers Working for the Serial Strangler From Microsoft on SLAPPing Techrights Have Apparently Lost Their Voice
the moment we mentioned that their media lawyer is leaving they went all quiet in social control media
At IBM, Relocation Can be a Trick or a Trap (IBM Gets Rid of Staff Under the Guise of "Relo")
IBM is not being honest with employees
Over at Tux Machines...
GNU/Linux news for the past day
Beyond Mass Layoffs at Microsoft: Entire Units Shut Down for Good
And it's far from over
Links 15/05/2025: Crikvenica, Analog Computer, and Slop 'Hallucinations'
Links for the day
IRC Proceedings: Wednesday, May 14, 2025
IRC logs for Wednesday, May 14, 2025
Links 14/05/2025: Fentanylware (TikTok) Harms Kids, Russia Refuses to Defuse
Links for the day
Gemini Links 15/05/2025: Poseur Nerds and Mennonites
Links for the day
VS Code Is Not FOSS, And Neither Is the Site "It's FOSS"
VS Code is proprietary spyware of Microsoft, yet this site keeps promoting it like it's FOSS
No, Microsoft Didn't Lay Off So Many People Because of "AI" "Innovation" or "Efficiency" or "Era" or "Revolution" Etc.
Debunking one very common lie
What We Do When We Say "GNU/Linux" to People
It talks about "Linux", "GNU", and what it means to say "GNU/Linux"
Links 14/05/2025: Facebook And Instagram Risk Nationwide Bans, Microsoft Subsidiaries Have Mass Layoffs Too
Links for the day
Canonical Will Give You Money Only If You Work for Microsoft!
Only if you are servicing (being a slave to) proprietary forges that Microsoft and the NSA control while violating the GPL will Canonical give you money
If Microsoft Staff That Strangles Woman Pays You to Write Lies, It Will Not End Well
The past couple of years were our most productive ever
Gemini Links 14/05/2025: "Writing My Story with Inspiration from Notable Lives" and People Start Shovelling Up LLM Slop Onto Geminispace,
Links for the day
Microsoft is Very Highly Stressed About Adoption of GNU/Linux at Windows' Expense (on Former "Vista 10" PCs)
What does this tell us?
Slopwatch: BetaNoise (BetaNews), LinuxSecurity, and Slopfarms Still Promoted by Google News
The primary goal is to demonstrate the problem persists
Links 14/05/2025: Google Agrees to $1.3 Billion Settlement After Spying, China Tariffs Don't Work
Links for the day
There Are Also Loads of Microsoft LinkedIn Layoffs Today (Keep Track of the Subsidiaries They Keep Out of Headlines)
Perhaps lost in the smokescreen
There Are Bigger Rounds of Microsoft Layoffs Coming, a Cull of 10% Implemented in Waves (the "3%" Figure is Misleading, Face-Saving)
Last night we said they might do the layoffs in three or at least two waves
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Tuesday, May 13, 2025
IRC logs for Tuesday, May 13, 2025
Gemini Links 13/05/2025: Apocalyptic Future and More
Links for the day
Unless a Third of All Microsoft Layoffs Worldwide Are in Redmond (Washington) Alone, Microsoft Has Just Lied to Everyone Via Jordan Novet in CNBC (i.e. the Usual Any Time There's Mass Layoffs and Novet Weighs in With False Numbers)
Maybe when Microsoft said 3% it meant ~6,000 or more in the US alone
McKinsey (McK) is Killing IBM, It's All About Killing This Goose, "National Sales Team 80% on PIP Now" (Preceding Layoffs Without Severance)
PIPs are not based on performance
Links 13/05/2025: Microsoft Breaks Windows Very Badly Again, Mass Layoffs Reported (But False Figures, It's a Lot Higher)
Links for the day
As Expected, Microsoft Uses Media Operative (Jordan Novet) to Downplay the Scale of Mass Layoffs
here we go
2025 Will be a Big Year For GNU/Linux on Desktops/Laptops
with an economy like this, people who don't live in rich countries won't turn to Apple
Signs of Trouble: Microsoft Job Openings for Jobs That Do Not Exist!
Keeping up appearances?
"Special Place in Hell" for Women Who Help Violent Microsofters From Another Continent Attack Local Women Who Did Nothing Wrong, They Just Got Bullied and Deserve Sympathy or Compensation
Nothing says "Brat" like men who attack women, right?
The Numbers Game: 50,000-60,000 Microsoft Workers Laid Off in 2.5 Years? And Debt Still Tripled Under Nadella.
under Nadella Microsoft's debt trebled
The Slow Death of Windows Will Mean the Inevitable Demise of Microsoft
Once people stop using Windows, it'll be hard for Microsoft to sell anything to them
Last Week's Public Talk by Richard Stallman Well Attended and Covered in Technical News Sites
and we're looking at about 60,000 Microsoft layoffs in 3 years
Gemini Links 13/05/2025: Shopping is an Exasperating Nightmare and Making Phones Minimal
Links for the day
23,000 More Microsoft Layoffs by the End of June If the Estimates Are Correct (In Addition to About 6,000 Layoffs So Far This Year)
There's no questions about many layoffs happening this month. It got leaked already. The only question is when (and also how many).
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Monday, May 12, 2025
IRC logs for Monday, May 12, 2025