Microsoft GNU-Hub (Part 3: Methodology)

Dr. Roy Schestowitz

2020-05-08 06:21:57 UTC
Modified: 2020-05-08 06:21:57 UTC

Article by figosdev

Summary: "Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"

IN part 1 and part 2 we showed the dangers of outsourcing development to a Microsoft platform. Today, figosdev explains the methods. There will be at least a couple more parts after this one. This is a very important if not critical subject and it is completely overlooked by the media, which is too busy promoting GitHub as if it's a champion of "Open Source" (Wired has just done that again) when it is in fact proprietary software controlled by a company that attacks Software Freedom in a lot of ways. Another new article perpetuates the myth that Microsoft contributes the most to "Open Source", based on Microsoft's very own site, which is proprietary software by the way. Facts don't seem to matter these days (the leading "Linux" story this week is about a Microsoft bounty, associating "Linux" with security issues; was a $100,000 bounty granted to the press to amplify Microsoft in the context of "Linux"?).

Without further ado, figosdev:

I didn't start the DeleteGithub meme, though when I found out about the upcoming purchase I quickly worked to migrate -- within about a week.

I had already been locked out of my first GitHub account (which an associate had created for me, and I lost the password to both the account and the associated email) and created a new GitHub repo, which I deleted after migrating.

Obarun also worked very quickly to migrate from GitHub, which I know because I was following the distro at the time.

"When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself."After the first few weeks of Microsoft owning GitHub, I didn't pay very much attention to it at all. #DeleteGithub was created without me.

All the while, I've expected more comments like this one, which Part 2 received:

"Perhaps you should actually research where projects host their code a bit more and you would find that most of them only use github for a mirror.

As a case in point check this link and you will see that the canonical source for perl is not github."

First of all, this comment is ABSOLUTELY wrong. Most of them do NOT only use GitHub for a mirror. I know, because I check that quite often. In fact most of this article was already written the other day to assist a colleague, and I've addressed the issue of mirrors several times already. But I suppose now it will be better to share that information here.

A Free software license means anybody can create a mirror; that also means that the existence of a mirror means nothing whatsoever. It was suggested once that I consider making mirrors their own category -- I do my utmost to ignore them entirely, because they're a meaningless metric that would only be a waste of time to count. They signify nothing; maybe you could argue they imply that someone cares enough about the project to mirror it. That's still not very interesting to me.

"While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub."When this started, I was looking to find alternatives to certain applications that remained on GitHub, as a sort of boycott / protest / public awareness campaign. After about 20 minutes of looking stuff up, I was shocked at some of the examples I found. I figured that some new applications here and there were on GitHub, no big deal. We've got lots of alternatives.

The first thing I was wrong about (GLADLY) was OpenBSD. When I started, this was a more casual endeavour. It was just a sort of poke around out of curiosity. Wikipedia was a primary source of information, and having done more to verify and audit listings since then, I can say that it's pretty good in practice, even as more than a starting point. But I wasn't paying enough attention to mirrors at first.

If you go to the GitHub page for OpenBSD it says right at the top:

"Public git conversion mirror of OpenBSD's official cvs www repository"

That's plenty clear. When this started, I was hoping to avoid visiting GitHub as much as possible, getting information from other sources. I now use various sources, including GitHub itself.

I thought the first effort was best suited to a wiki, vs. an article. While more people read articles, I was less interested in reporting and (am still) more interested in doing the research, so other people have a good idea of how much of this stuff is being developed and gradually coming to rely on GitHub.

The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website. The GNU website still mirrors these plans, and GitHub looks a lot like them:

"The takeover of Free software has indeed been very slow, planned in the late 90s as a defense of monopolies, brought into public awareness by Eric S. Raymond, and for some time the plans were hosted on the Open Source Initiative website.""One 'blue sky' avenue that should be investigated is if there is any way to turn Linux into an opportunity for Microsoft." https://www.gnu.org/software/fsfe/projects/ms-vs-eu/halloween2.html

This quote is more than 20 years old, and there are countless others from the same group of documents that read a lot like "old news" from 2010 to the present. So my first concern is that GitHub not be used to help Microsoft take "ownership" (aka control of) Free software. That control can be taken gradually with or without non-GitHub mirrors, though non-GitHub mirrors would probably help.

For the person says "Perhaps you should actually research where projects host their code a bit more", here is a glimpse into what I actually do, written just 2 days ago and not planned for the article:

"It starts simple -- because at first this was all I did. I don't do every part of this process for every single package. I sort of follow my nose, going deeper when I feel it's important. The end result is that nothing is undeniable proof, but across the board you get a good fine-resolution picture with some occasional faulty pixels. I like to think this process would at least be more accurate than turning machine learning on it."

"So first, I go to the project's Wikipedia page. If it doesn't have one, it's probably not important enough to be on delete_github in the first place. But if Wikipedia lists GitHub as the repo, it's probably on GitHub."

"I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub."This was how it started, and for certain I could have paid more attention to details, but I was only dipping my toes in. I am very happy that I was wrong about OpenBSD, and eventually did an audit of the hundred or so projects I'd listed, finding that at least FOUR of those hundred or so were actually mirrors. I happily listed the items that were cleared: FFmpeg, OpenBSD, QEMU, Kali (GNU/) Linux

Apache Server was cleared much later. ASF was infiltrated by Microsoft, and they are using GitHub for some things, but Apache Server was taken off the list because it is in fact a mirror. Slitaz was also taken off the list -- not because it was an error, but because they moved off GitHub again.

I went further and started counting all the apps in F-droid. About 4 out of 5 (out of several thousand) are based on GitHub. Did I check the thousands of apps? Of course not. For a ballpark figure, I checked how many listed GitHub as the source or website for the app. If they say it's GitHub, that's good enough for a ballpark figure. But I didn't stop there.

I used GitHub data to figure out which of those thousands of apps were the most popular, and then went through that list and hand-picked the most familiar apps. I don't use Android much, though I've always used F-droid when I need Android apps. I haven't done much to verify the F-droid list, and nobody has complained.

Tom pointed me to a couple of LISP library collections, and just to find out if the 70-80% figure came up again, I checked all of those -- and came up with a similar ratio. I'm happy to say that since then, this 4:5 ratio doesn't come up all the time.

"Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with."I keep looking for large data sets to play with, having the data for all packages in the Tiny Core repo (I've also figured out how to recursively parse dependencies in the package information for Debian and Trisquel) but these are not sets that I'm always checking and verifying by hand. The purpose is to figure out what to look at next, and that's how I found myself examining the GNU project itself -- carefully.

Most, not all of the things I've checked on, have an initial phase and sometimes a deeper check involved. I've tended to document what I know, and it's usually pretty obvious from context (from the description) when the level of detail is superficial. A statistic without examples? It's just a cursory check. Examples and explanations? I spent more time. As I said 2 days ago:

This has yielded very few false positives, but Apache httpd is still a mirror. Mirrors don't count, not to me at least, because per the license, Microsoft is allowed to mirror every Free software package that exists... A GitHub mirror really means that the "real repo" is somewhere else, that's what we want. We are looking at people using GitHub for development:

* Bug tracking * Pull requests (even worse than bug tracking) * "Official repo" (not mirror)

In order from least to worst, those are the things we are looking for. Even if it is partly developed on GitHub, I consider this a problem. It means something is happening that Microsoft controls, that Microsoft has them hostage with. [People actually say things like] "We can't lose our bug tracker! There's so much valuable data there!" Michele says Git is distributed so it's a non-issue. "But Gitea is migrating anyway" that's good news then. But I think [that's been said] for a year or two, so take it with a grain of salt until there's more evidence.

"For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed."Anyway, the most important criteria is:

* Wikipedia or better yet the project's own homepage links to GitHub * It's not a mirror

At some point I started checking project websites too. Seems obvious, but you have to realise that when this thing started I was being very casual. Just poking around, not being serious about it.

For Tiny Core, as I mentioned previously, I downloaded not only every package (looking for package data which isn't in the package) but every .info file, when I figured out that's where packages dependencies were listed.

Dependency lists, whether we are talking about Tiny Core or Debian, typically only cover immediate deps -- not deps of deps or deps of deps of deps. So I literally write a recursive routine to turn the dep data for Tiny Core into a FULL list of packages that require each item. It takes 45 minutes or more (to write and work the bugs out) though after it's done you kick yourself for the time wasted doing it manually. I probably spent 2 hours trying to do a fraction of the work that way.

"Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so."Now I can say "which packages pull in glib2" and get a full list.

I can do that FOR EACH PACKAGE then run wc -l on each list, getting a count of how many packages need each thing -- like how many need glib, how many need libffi, etc., run the list through sort -n and you know which deps are the most needed. Libffi is right at the top. In fact Roy tells me people were complaining about libffi a day or two prior to this discovery, but I pulled the fact right from the data I had cached. So now it's telling me things other people won't -- aka verifying things other people know.

All well and good, so I know the usual suspects when I download Trisquel. But I get bored with Trisquel (fig spent about a week processing all the source code so its easier to search) and started looking at GNU.

I start here http://savannah.gnu.org/search/?Search=Search&words=*&type_of_search=soft&exact=0&max_rows=500&type=1#options

"It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel."Behold, every official GNU project. Once again, I started doing this manually. Spent at least 10 hours doing that, got 1/4 the way up the list, from Xnee up to Metahtml. The article itself took 45 minutes or so.

For the GNU stuff, I finally had the "server" search all the code for things like bffi, perl, .pl, .py, ython, ithub, png, flex. This is all [output to] a single text file, and it shows the path/name of the file (project name/path/actual file/ line of text found) so if I grep this file for example, "gperf" i get stuff like this:

89/gcc-9.3.0/libsanitizer/sanitizer_common/sanitizer_procmaps_mac.cc:// Google Perftools, https://github.com/gperftools/gperftools. 108/global-6.6.4/reconf.sh:prog='autoconf automake bison flex gperf libtool m4 perl'^I# required programs 123/gperf-3.1/ChangeLog: when the -n option is used. Previously, it didn't 123/gperf-3.1/ChangeLog: I'm too busy to fix it , right now. The problem 123/gperf-3.1/ChangeLog: they weren't being entered into the hash table . 123/gperf-3.1/ChangeLog: * Added the -D option that handles keyword sets that 123/gperf-3.1/ChangeLog: * Modified Key_List::print_hash_function so that it

"I told bash to make a numbered folder for each project, that way I can cd 123/ TAB TAB instead of spelling out the project folder name and so I can iterate/grep using seq instead of folder names. That's a convenience, so if I'm babbling its not important."

"I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.""I look for png files, perl, python, libffi, glib. I write it down. Then I tally it up later."

"That's how I do it. All in all, there's logic but I try to do whats logical / convenient / efficient. And other than that I just play it by ear."

TL,DR; It's a mix of manually checking websites, manually checking Wikipedia, manually and programmatically checking package data and even setting up a dedicated machine to spend several days processing the tens of gigs of source code to Trisquel.

Sometimes I get into the includes in the C code and note that include libpng is in an #ifdef. I'm not really a C coder but at least I get the concept of "you can configure this to compile with optional dependencies."

But even when I wasn't trying, the first hundred or so entries were mostly accurate. I hand-checked each one more than once for mirrors, and there is no way to do 100% of this programmatically. Not every project follows the same rules.

And the data changes, too. But someone had asked me about Perl specifically: "I took a quick look at the perl site. They do not cite GitHub..."

"They're developing Perl 6 on GitHub, so it fits the methodology."That's not entirely true. So I told them what I knew so far:

"Some [of the most important] things I check over and over and over. Perl is one. Let's do it again, it's useful exercise..."

https://en.wikipedia.org/wiki/Perl nothing.

Let's take a detour towards "Perl 6": https://en.wikipedia.org/wiki/Raku_(programming_language)

https://raku.org/

"Language Design"

"Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.""Specification - Official Raku language specification test suite" https://github.com/Raku/roast

[Roy is likely to turn these into links. It's very possible the link he creates will not match the exact link from the website -- the url will be the same, the text will be the same, but the actual overlap of text and underlined link may vary. In the original writing, I kept the text and urls separate.]

* Issues 78 * Pull requests 25

"Raku is GitHub."

See what I did there? They're developing Perl 6 on GitHub, so it fits the methodology.

You may not agree with this methodology, for various reasons. You might have a good reason -- or you could be a shill or marketing person. You could just be an ordinary fanboy. BUT, you might also have a point! Either way, Perl 6 is absolutely, for the intent and purpose of this study, being developed on GitHub.

By all means, I expect other people to make a case for/against considering it "captured or controlled by Microsoft." My concern is Early Warning. You may have a "better" criteria to offer that you consider more useful.

"Puppy Linux, to cite a different example, is developed all over the place."The whole idea of this project is to get the conversation going. But I've gone to great lengths to provide useful data overall, and I do try harder when there is a key project like Perl involved. It continues:

"Back to perl.org:"

"Contribute" https://www.perl.org/contribute.html

"Contribute to Perl Core" http://dev.perl.org/perl5/

"Perl" (these things are quoted as headings to look for, if they're listed more than once there's a reason)

"Perl" "Production-ready, under active development"

This part says:

"Some people have expressed an interest in getting Puppy away from GitHub.""Perl 5.30.2 is the current stable version of Perl. Perl is actively maintained and developed (git repository) by a large group of dedicated volunteers."

That links to this url:

https://github.com/Perl/perl5

So we go to that url, and it says:

* Issues 1,865 * Pull requests 33

Whatever the "Canonical url" is, what matters to me and what I'm actually looking for is that

* The official website tells people to go to GitHub. * The GitHub repo is being used for Issues and Pull requests

If there aren't Issues and Pull requests, I make the decision based on other factors.

If the README.md says "please don't use this for issues or pull requests" then that certainly counts for something.

Puppy Linux, to cite a different example, is developed all over the place. Packages are strewn across countless hobbyists websites. There are too many derivatives to even count -- literally hundreds of fan-based ISOs exist online that people have made over the years. Several active derivatives exist -- Many are based on Woof-CE, which is developed on GitHub. Some are not.

"Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project."Some people have expressed an interest in getting Puppy away from GitHub. Puppy, DSL (a predecessor to Tiny Core) and Xubuntu were the three distros that helped me finally delete my last copy of Windows more than 10 years ago, and I learned as much about GNU/Linux from Puppy as the others.

But the Woof team has made it clear they are unlikely to migrate. The work they've done that keeps them "locked in" to GitHub isn't as easy to "fork" as the codebase, and this is by design.

Folks, canonical urls don't tell the whole story. Whether Microsoft has you in its clutches ultimately comes down to details and down to the reality of the project.

We can debate that, but the purpose of what I'm doing is to find the projects to worry about, so something can be done. By all means, help me get some of the items off the list. But you probably won't find anybody who has gone to more effort than I have, unless it's Roy. And Roy is interested. And I've heard from fewer critics than people who are interested as well.

"Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!"If you have are a fan, user or developer of a vital project like Perl, I am certainly interested in (and continually looking for) criteria that would allow it to be taken off this list. The whole point of putting it on the list is the hope of getting it back off the list again.

And if you missed them, here are some disclaimers that I've already made:

"I don’t always trust Debian dependencies, but they’re certainly illustrative" (part 1)

"If there are obvious mistakes or less obvious misconceptions I’m presenting when I talk about some of the details, I hope you’ll mention it in the comments. I’m sure there will be a few differences of opinion as well." (part 1)

"Python is worth watching for, but only proves to be a GitHub hostage sometimes." (part 1)

"This isn’t just about where the code is, but where the development takes place and who controls access." (part 2)

"This isn’t to admonish the author for not following a rule that doesn’t exist, but to highlight the more-than-hypothetical threat that the GNU project faces" (part 2)

"But do, if you're interested, please help get these projects off this endangered species list."At each new chapter of this research, I tend not to rely exclusively on previous research. Certainly I learn more as I do this, but I lean towards using each new focus as an opportunity to redundantly check things I've checked already -- that's how I discovered that Slitaz had moved. So at each pivot, I often get fresh data to confirm or update previous data.

Having gone to all this trouble, I can absolutely tell you: "Most of them only use github for a mirror" is completely untrue!

But do, if you're interested, please help get these projects off this endangered species list. I'm generally quite interested in the evidence that other people can bring to the table, a good deal of the feedback received has proven useful. I still consider Perl to be endangered, and I think the criteria are relevant.

Long live rms, and happy hacking. ⬆

Licence: Creative Commons CC0 1.0 (public domain)

Comments

arm

2020-05-09 16:15:37

Software freedom zealotry is a strange sort of a rooster. To be honest, I could not give a rats arse, I'm more worried about if I have to pull out the old cricket bat to protect my kids from the meth heads that live around the corner. Nice that you acknowledged my correction about OpenBSD. I actually enjoyed this article, had a bit more meat and potatoes too it. ;)

Microsoft GNU-Hub (Part 3: Methodology)

Comments

Recent Techrights' Posts