EditorsAbout the SiteComes vs. MicrosoftUsing This Web SiteSite ArchivesCredibility IndexOOXMLOpenDocumentPatentsNovellNews DigestSite NewsRSS

08.25.19

Thirteen Years of Techrights This Year

Posted in Site News at 8:16 am by Dr. Roy Schestowitz

Mark Webbink
Photo credit: Mark Webbink’s image by Luca Lucarini, CC BY-SA 3.0

Summary: We’re the survivor of a dying breed of sites, which are largely dedicated to FOSS-centric news

EARLIER this year Debian celebrated 26 years. That’s pretty impressive considering the fact that the grandfather of GNU/Linux, Slackware, was having some issues in recent years and its founder sought to raise funds through Patreon some weeks ago. This distribution was created by Patrick Volkerding in 1993, whereas Debian was created about a month later by Ian Murdock. He founded the Debian Project on August 16, 1993.

So far in 2019 at least 3 noteworthy GNU/Linux distributions ‘called it a day’. News sites covering GNU/Linux also suffered heavy casualties; these were some of the biggest sites, notably Linux Journal and Linux.com; few others became stagnant. It’s part of the decline of media in general, not a problem with GNU/Linux in particular.

“So far in 2019 at least 3 noteworthy GNU/Linux distributions ‘called it a day’. News sites covering GNU/Linux also suffered heavy casualties; these were some of the biggest sites, notably Linux Journal and Linux.com; few others became stagnant.”The journey of Techrights began back in the days of Digg. Remember Digg.com? I certainly do. I was a Ph.D. student at the time and “social media” had just begun to catch on (prior to it I spent a lot of time in USENET newsgroups). In 2006 I met Shane on Digg, where we shared our concerns about the Novell deal with Microsoft. That’s how a blog (back then dedicated to a Novell boycott) was born. Digg.com is still around, but it’s in no way related to the original Digg, which stagnated and died within a few years. By 2009 or 2010 it was already quite irrelevant, partly (depending on one’s interpretation) due to Facebook and Twitter, maybe even Reddit. Those three sites are still around. Back in 2006 we also shared concerns and views with Groklaw and Technocrat, the site of Bruce Perens (famous for Debian and OSI). Perens made a bit of a comeback, even in his own domain name, but that didn’t quite replace his original project, the “Slashdot for grown-ups” which suffered an epic demise just like Slashdot itself. As for Groklaw, it too made a sort of comeback attempt, first with Mark Webbink, a former Red Hat employee (he’s retired now; photo above), and then Pamela Jones (PJ) again. I spent years mailing her every day and her decision to ‘disappear’ from the Web was rather disappointing. Snowden’s leaks did not reveal much that wasn’t already known; they just provided hard proof for what many of us speculated about or cited other whistleblowers about (they didn’t have the documentary evidence at hand, so NSA denials was simpler). At the same time Andy Updegrove’s blog became less active (he’s with the Linux Foundation now) and the Web as we knew it was transforming into Social Control Media, which is a lot of hearsay.

The media as a whole is being battered; and no, tabloids aren’t media and channels like Fox News and CNN are mostly partisan feeding frenzy. They lack credibility and accuracy on a lot of topics — typically those that get them many viewers, drawing them in based largely on emotion, not substance.

In a sense, we view ourselves as survivors of much turbulence. We don’t rely on ads and we don’t pay salaries; I work full time in a technical job, so I can afford to keep the site going in my spare time. No rich sponsors, no sellouts, no “affiliate” posts.

“In a sense, we view ourselves as survivors of much turbulence.”It seems pretty certain we’ll reach 15 years. 20 years might be a challenge, but at the moment it seems doable because we're growing. Our European Patent Office (EPO) coverage helped make a positive impact and this year we’re gradually revisiting more and more aspects of GNU/Linux and Software Freedom. Some of the topics we covered nobody else dared cover. We have several important stories in the pipeline. Hopefully we won’t have to see any more publishers in the area of FOSS (what’s left of such publications) perishing and closing down. That creates an information vacuum that gives leeway to Microsoft's PR department and prevents introspection or self-assessment — something sorely needed in today’s tough terrain of GAFAM and Microsoft entryism.

08.21.19

Some Patent Attorneys Dislike Techrights Not Because It’s Wrong But Because Software Patents Are Wrong (and Sometimes Illegal)

Posted in Europe, Patents, Site News at 5:57 pm by Dr. Roy Schestowitz

Actually it is the case, as SUEPO made very clear (based on internal data)

EPC rantsSummary: Odd rants which misuse common law and ignore alleged Fair Use (and misinterpretation of copyright law, for censorship purposes) would have people believe that we’re wrong; but it’s more likely that the person in question is jealous, insecure, or offended by our stance on patent scope, which is very much rooted in the law itself (and the views widely held by software developers globally)

OUR in-depth coverage of European Patent Office (EPO) affairs is over 5 years old. We had covered the EPO prior to that, but not as frequently. We nowadays have pretty deep insights, valuable contacts and a good understanding of the issues. We’re fine with attorneys/lawyers not liking us. Some of them accept what we’re arguing, whereas others find it “offensive”. We can’t please everyone, but we can at least keep honest. We’re sincere, sometimes brutally (for some).

“We don’t try to discourage dissent against us; we’re all for free speech. But free speech also means the right to defend oneself — something IP Kat urgently needs teaching itself about.”We don’t typically write this kind of post, but SUEPO currently links to a Kluwer article from Team UPC, where the majority of comments mention Techrights in one form or another. There’s one person there who always claims to sort of agree while at the same time, perpetually, always bashing us. We’ve noticed the same in IP Kat and another site. It’s usually the same person and it often boils down to our view/s on patent scope.

In short, arguing that the EPC is OK with software patents is intellectually dishonest; that’s simply untrue. And no, calling it "HEY HI" (AI) won’t change that; I’ve done “AI” since my early 20s, I know how that works. I wrote code to that effect.

The law is pretty clear about software patents. So are the courts. So is the European Parliament. But we suppose those who make a living from such patents are in denial about it (for the same reason Team UPC is in denial about the collapse of UPC/A).

We don’t try to discourage dissent against us; we’re all for free speech. But free speech also means the right to defend oneself — something IP Kat urgently needs teaching itself about. It’s also deleting comments critical of the EPO and its management. Not cool…

Earlier today we saw a post about PPH (akin to PACE and other programmes that speed things up; another such programme was mentioned here yesterday). Speed isn’t indicative of quality and it’s usually detrimental to accuracy, especially when multiple people need to assess a case/application. PPH generally works in favour of software patents in Europe — patents that are legal neither in Europe nor in Australia. They’re looking “to fast-track patent applications,” as Paul Whenman and Andrew Gregory have just put it. Their article is about sloppy patent examination designed to just help aggressive patent trolls and equip those looking for sanctions/embargoes (profit by harm and extortion), not innovation. Campinos and Battistelli don’t know what innovation is; they’re not scientists. In the words of Whenman and Gregory: [via Lexology]

IP Australia became an early participant in the PPH process. Following a successful pilot program with the USPTO, which commenced on 14 April 2008, Australia joined the Global PPH (GPPH). The GPPH initially covered Canada, US, Japan, South Korea, Denmark, Finland, Great Britain, Iceland, Norway, Portugal, Spain and Russia. Subsequently, the New Zealand jurisdiction was added, along with a raft of other particpants.

Although the European Patent Office (EPO) is notably absent from the list of GPPH participants, fortuitously, IP Australia entered into a bilateral agreement with the EPO on 1 July 2016 in order to fast-track patent applications. This agreement provided for a trial period of three years. Given the global significance of the EPO, this was a very welcome and positive development.

[...]

On 1 July 2019 it was announced that the PPH trial between IP Australia and the EPO would continue for a further three years. Additionally, the original GPPH program with the other participant IP offices continues with no indication of curtailment.

This is indeed very good news as applicants will continue to be able to access and gain the benefits of the generous PPH programs operated by IP Australia.

Techrights has long expressed concerns about the EPO putting litigation first; it seems to have forgotten its core values and goals. If it exists to promote science and knowledge, it will give the benefit of the doubt to defendants/alleged infringers. Instead, today’s EPO gives many bogus patents to serial plaintiffs/claimants, who may in turn leverage these bogus patents to make bogus (invalid) claims of infringement. Patent trolls absolutely love that.

“On 1 July 2019 it was announced that the PPH trial between IP Australia and the EPO would continue for a further three years.”
      –Paul Whenman and Andrew Gregory
PPH is obviously biased or tilted in favour of plaintiffs, not defendants. Judging by who (or whose groups) today’s EPO management likes to associate with and hang out with (in the media it has liaised with Watchtroll), it’s crystal clear whose side they’re on. How many of today’s EPO managers even have a background in science? One is alleged to have faked his diploma, but that’s another matter. If a few people have an issue with our EPO coverage not because they disagree about the EPO but about patent scope, maybe it’s because they don’t do actual coding and can’t quite see things with developers’ scopes/optics.

08.20.19

EPO Cannot Handle Patent Justice With a Backlog of About 10,000 Cases at the Boards of Appeal

Posted in Site News at 12:50 pm by Dr. Roy Schestowitz

Recent: Index: G 2/19 (Enlarged Board of Appeal, EPO)

EPO toons

Summary: The EPO’s long war on judges and on the law has proven to be costly; it’s difficult to pretend that the EPO functions like a first-world legal framework

ABOUT one year ago the European Patent Office (EPO) had about 9,000 impending cases after Battistelli had attacked and understaffed the appeal boards for a number of years. António Campinos obviously did nothing to tackle this issue. Some of these cases, including an imminent one regarding computer simulation, concern software patents in Europe. Several months ago a blogger from Kluwer Patent Blog took note of that staggering number. The EPO management’s attack on its judges has resulted in an unbelievable backlog in the ‘justice’ faculties/departments. What good is justice that can take like a decade to arrive? It may be irrelevant by the time it’s ‘reached’. Similar issues exist at ILO-AT.

“What good is justice that can take like a decade to arrive? It may be irrelevant by the time it’s ‘reached’.”Just promoted via Lexology was last week’s article from a law firm, revealing that ‘acceleration’ is possible in particular cases (like PPH or PACE). To quote:

Appeal proceedings at the European Patent Office (EPO) typically last in excess of three years, but can last significantly longer (according to the 2017 Annual Report of the Boards of Appeal, technical appeal proceedings lasted 38 months on average, but some cases had been pending for eight years). With this long duration of proceedings, it is no surprise that there is a substantial backlog of pending cases (over 9,000 at the end of 2018, according to the 2018 Annual Report of the Boards of Appeal).

[...]

Requests for accelerated processing of an appeal should be filed with the competent Board of Appeal, and may be filed at the beginning of or during appeal proceedings. Such requests should specify the reasons for urgency, and be submitted with documents that support this reasoning. There is no official form for requesting accelerated processing of an appeal.

Preparing for such a request takes time and money. Given what we saw in the past (EPO leaks), this may discriminate based on size, connections, and money.

“Thankfully, the UPC is failing.”This is the kind of thing Germany’s FCC must look into; justice in today’s EPO is mostly an illusion. It’s infeasible. It used to more or less work, the Office used to more or less function. But now? Total chaos. Does one want to extend this system to courts all across Europe?

Thankfully, the UPC is failing. SUEPO linked to an article to that effect earlier today. People in the comments in that article (composed by Team UPC) mostly focus on Techrights, still upset because of our opposition to software patents (we assume these comments come from one single patent attorney).

“Ironically in some sense, the person who pushed the hardest for the UPC is also the person who doomed it. Battistelli’s attacks on the judges aren’t forgotten and aren’t forgiven. Brexit isn’t even the prime barrier; the extreme lack of justice at the EPO is.”In summer of 2019 the famous complaint against the UPC turns two. Each year that passes is another nail on the UPC’s coffin. Almost exactly a year ago this site called Down to Earth boiled down to misinformation, slanting everything in favour of the litigation ‘industry’ (as if it’s the sole thing that matters). The writer ended up putting a copyright sign/symbol as the head image in the article about patents, showing that these people have no clue what they’re writing about (or intentionally lying). They referred to patents as “IP” (not Invalid Patent but something meaningless and misleading).

Ironically in some sense, the person who pushed the hardest for the UPC is also the person who doomed it. Battistelli’s attacks on the judges aren’t forgotten and aren’t forgiven. Brexit isn’t even the prime barrier; the extreme lack of justice at the EPO is.

08.19.19

Speaking Truth to Monopolies (or How to Write Guest Posts in Techrights)

Posted in Site News at 4:16 am by Dr. Roy Schestowitz

“The jaws of power are always open to devour, and her arm is always stretched out, if possible, to destroy the freedom of thinking, speaking, and writing.”

John Adams

Summary: We need to have more articles tackling the passage of all power — especially when it comes to software — to few large monopolies that disregard human rights or actively participate in their abolishment in the digital realm

I HAVE spent much of my adult life writing about (and against) software monopolies. I had done that before this site even existed. Seeing that the reach of Techrights is growing and more people get involved in various capacity levels, we openly — and freely — welcome more articles from more people.

“We’re at the point now where both Free software and GNU/Linux are in a peculiar and precarious position.”The topics we cover aren’t hard to see; we do not, for example, publish HowTos; instead we just link to many. We ‘specialise’ in tackling attacks on Software Freedom, be these attacks technical or legal (e.g. acquisition or patents). We’re at the point now where both Free software and GNU/Linux are in a peculiar and precarious position. Their “livery” — so to speak — is being swapped by companies like Microsoft. It’s an attack on the very identity of one’s idealogical opposition. It’s designed to confuse, to obfuscate, to disorientate. We need to fight back as narratives are being distorted, not only in the media but mostly in the media. The demise of several big publishers contributes to this.

We invite readers to contribute posts. We’re very liberal when it comes to format and substance. Articles can be sent to bytesmedia@bytesmedia.co.uk which our core people read on a daily basis.

There’s a good chance Techrights will have posted its 26,000th post before the site turns 13 (middle of November), i.e. average at over 2,000 posts per year. We’re becoming more productive this year because more people have become involved.

08.17.19

Caturdays and Sundays at Techrights Will Get Busier

Posted in Site News at 4:12 am by Dr. Roy Schestowitz

Not cat photos but analysis of issues pertaining (or puuurtaining) to Software Freedom

Cat

Summary: Our plan to spend the weekends writing more articles about Software Freedom; it seems like a high-priority issue

THE growth of openwashing recently necessitated more and more responses (some of which too long for editorial comments in daily links). So last weekend we started the “Openwashing Report” — a series we intend to continue this weekend. This does not mean that we will focus any less on the EPO, Campinos, Battistelli and so on.

“There’s a certain urgency as windows seem to be closing on digital freedom…”Having recently (about 3 weeks ago) 'quit' Twitter I now have more time to spend writing articles. Weekends in particular will be spent writing about Free/libre software, GNU/Linux and technology rights (as per this site’s name). We will try to publish more articles per day (at our peak about a decade ago we averaged at more than 10 articles a day). As I’m working full time (job unrelated to this site) it’s more likely that weekends as opposed to weekdays will have more articles produced (counterintuitive as paid writers publish throughout the week and barely during weekends). There’s a certain urgency as windows seem to be closing on digital freedom; we now have listening devices out there; Microsoft admits recording and retaining it all (it’s a GAFAM thing). This is considered almost ‘normal’ now. How did we get to this point? How do we get out of it? Expect a bit of a focus shift. Let’s hope that each Caturday can make a positive difference by means of reporting the ills. We’re watching Free software adversaries.

Cat

Why Techrights Doesn’t Do Social Control Media

Posted in Site News at 3:36 am by Dr. Roy Schestowitz

More about Social Control than about Media. Standing on one’s own means more freedom of speech and no self-censorship.

One-tree hill

Summary: Being managed and censored by platform owners (sometimes their shareholders) isn’t an alluring proposition when a site challenges conformist norms and the status quo; Techrights belongs in a platform of its own

AFTER posting more than 670,000 tweets and having coined the term "Social Control Media" (which even Julian Assange and Wikileaks adopted) I decided to no longer use Twitter except to check replies once a day. It had become a massive productivity drain and usually an utter waste of time. I still have stuff posted there, albeit only as copies exported from decentralised and Free software platforms such as Diaspora and Pleroma (Mastodon-compatible).

“It seems pretty clear where Twitter is going with this; it wants to eventually become another Facebook.”A decade ago Techrights, Boycott Novell and TechBytes had active accounts in Identi.ca (and two of these in Twitter as well). At some stage it seemed clear that this kind of activity was detrimental to — not contributory towards — actual journalism. Techrights never had a Twitter account and character length is still a major limitation. Over the years surveillance and bloat got a lot worse; almost exactly a year ago Twitter also killed third-party tools by deprecating key APIs. It seems pretty clear where Twitter is going with this; it wants to eventually become another Facebook. We probably don’t have to explain why Facebook is so bad (many aspects to that).

There’s nothing to regret here overall; we didn’t participate in these sites and we probably lost nothing by staying out of these. I have personal accounts there, but these express my personal views (on politics) rather than the site’s.

“There’s nothing to regret here overall; we didn’t participate in these sites and we probably lost nothing by staying out of these.”Social Control Media (so-called ‘social’ ‘media’) is neither social nor media; when people socialise they don’t get managed by the billions by one single company/shareholders and media has generally (historically) checked claims/facts. Twitter lacks that.

Social Control Media has sadly ‘replaced’ the “long form” writings in a lot of blogs. That’s a shame really. Quality is being compromised for the sake of speed and concision. When we’re trying to actually find/syndicate reliable blogs we nowadays come to realise that many are inactive/dormant. Instead of sites they become “accounts” (on someone else’s platform, complete with throttling, censorship and ads); what used to be a site/blog is just some Twitter account that posts and reposts unverified nonsense. Techrights doesn’t wish to ever become something so shallow.

08.15.19

Links Are Not Endorsements

Posted in Site News at 2:10 am by Dr. Roy Schestowitz

No rollerblades

Summary: If the only alternative is to say nothing and link to nothing, then we have a problem; a lot of people still assume that because someone links to something it therefore implies agreement and consent

WE recently wrote about how Twitter cheapens if not ruins fact-finding/fact-checking. Weeks later it turned out that someone who had decided to declare a person dead was in fact wrong. That person is still alive. The Web is a fascinating maze of fabrications, hearsay, bad reporting, but also good investigative journalism. It’s not always easy to tell apart one from the other and it’s something we work on. Sometimes, including last night, we get feedback; a reader made contact with us to tell us something that we linked to (in daily links) was wrong. It doesn’t happen very often, maybe a couple of times per year and usually when it’s about more sensitive and divisive subjects like Kashmir.

“We don’t link to social control media; we only link to blogs and news sites. That’s still prone to mistakes/issues. But most importantly when we link to something that does not imply we endorse the message.”Our daily links are an effort to make sense of a lot of information. (Mis)information overload is a big problem and the ‘cheapening’ of it, especially because of social control media, means that any random person can utter false words. These things get rated and shared (passed on) based on emotion rather than adherence to facts, so that’s another problem. The scoring is all wrong. Nowadays a lot of so-called ‘articles’ are based on nothing but a “tweet”; that’s another problem.

Sometimes a Web site or an article we link to may offend someone. Sometimes the article is wrong. We don’t link to social control media; we only link to blogs and news sites. That’s still prone to mistakes/issues. But most importantly when we link to something that does not imply we endorse the message. Sometimes we link to sites we do not agree with simply because we need to highlight something that was said, done, or happened. In social control media it’s common to specify in profiles/disclaimers that “links (or other) are not endorsements”; the same applies here.

What we argue in our own (original) articles is another matter altogether. At least those can be judged based on more than a link alone.

08.10.19

Techrights Coding Projects: Making the Web Light Again

Posted in Site News at 11:05 am by Dr. Roy Schestowitz

A boatload of bytes that serve no purpose at all (99% of all the traffic sent from some Web sites)

Very bloated boat

Summary: Ongoing technical projects that improve access to information and better organise credible information preceded by a depressing overview regarding the health of the Web (it’s unbelievably bloated)

OVER the past few months (since spring) we’ve been working hard on coding automation and improving the back end in various ways. More than 100 hours were spent on this and it puts us in a better position to grow in the long run and also improve uptime. Last year we left behind most US/USPTO coverage to better focus on the European Patent Office (EPO) and GNU/Linux — a subject neglected here for nearly half a decade (more so after we had begun coverage of EPO scandals).

As readers may have noticed, in recent months we were able to produce more daily links (and more per day). About a month ago we reduced the volume of political coverage in these links. Journalism is waning and the quality of reporting — not to mention sites — is rapidly declining.

“As readers may have noticed, in recent months we were able to produce more daily links (and more per day).”To quote one of our guys, “looking at the insides of today’s web sites has been one of the most depressing things I have experienced in recent decades. I underestimated the cruft in an earlier message. Probably 95% of the bytes transmitted between client and server have nothing to do with content. That’s a truly rotten infrastructure upon which society is tottering.”

We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam. It’s the only way to keep up with quality while leaving out cruft and FUD (and Microsoft's googlebombing). A huge amount of effort goes into this and it takes a lot of time. It’s all done manually.

“We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam.”“I’ve been letting wget below run while I am mostly outside painting part of the house,” said that guy, having chosen to survey/assess the above-stated problem. “It turns out that the idea that 95% of what web severs send is crap was too optimistic. I spidered the latest URL from each one of the unique sites sent in the links from January through July and measured the raw size for the individual pages and their prerequisites. Each article, including any duds and 404 messages, averaged 42 objects [3] per article. The median, however, was 22 objects. Many had hundreds of objects, not counting cookies or scripts that call in scripts.

“I measured disk space for each article, then I ran lynx over the same URLs to get the approximate size of the content. If one counts everything as content then the lynx output is on average 1% the size of the raw material. If I estimate that only 75% or 50% of the text rendered is actual content then that number obviously goes down proportionally.

“I suppose that means that 99% of the electricity used to push those bits around is wasted as well. By extension, it could also mean that 99% of the greenhouse gases produced by that electricity is produced for no reason.

“The results are not scientifically sound but satisfy my curiosity on the topic, for now.

“Eliminating the dud URLs will produce a much higher object count.

“The results are not scientifically sound but satisfy my curiosity on the topic, for now.”
      –Anonymous
“Using more mainstream sites and fewer tech blogs will drive up the article sizes greatly.

“The work is not peer reviewed or even properly planned. I just tried some spur of the minute checks on article sizes in the first way I could think of,” said the guy. We covered this subject before in relation to JavaScript bloat and sites' simplicity, but here we have actual numbers to present.

“The numbers depend on the quality of the data,” the guy added, “that is to say the selection of links and the culling the results of 404′s, paywall messages, and cookie warnings and so on.

“As mentioned I just took the latest link from each of the sites I have bookmarked this year. That skews it towards lean tech blogs. Though some publishers which should know very much better are real pigs:


$ wget --continue --page-requisites --timeout=30
--directory-prefix=./test.a/

https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/

. . .

$ lynx --dump

https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/

> test.b

$ du -bs ./test.?
2485779	./test.a
35109	./test.b

“Trimming some of the lines of cruft from the text version for that article, I get close to two orders of magnitude difference between the original edition versus the trimmed text edition:

$ du -bs ./test.?
2485779	./test.a
35109	./test.b
27147	./test.c

“Also the trimmed text edition is close to 75% the size of the automated text edition. So, at least for that article, the guess of 75% content may be about right. However, given the quick and dirty approach, of this survey, not much can be said conclusively except 1) there is a lot of waste, 2) there is an opportunity for someone to do an easy piece of research.”

Based on links from 2019-08-08 and 2019-08-09, we get one set of results (extracted all URLs saved from January 2019 through July 2019; http and https only, eliminated PDF and other links to obviously non-html material). Technical appendices and footnotes are below for those wishing to explore further and reproduce.



+ this only retrieves the first layer of javascript, far from all of it
+ some site gave wget trouble, should have fiddled the agent string,
	--user-agent=""
+ too many sites respond without proper HTTP response headers,
	slows collection down intolerably
+ the pages themselves often contain many dead links
+ serial fetching is slow and because the sites are unique

$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l
91
$ find . -mindepth 1 -type f -print | wc -l
4171
which is an average of 78 objects per "article"

+ some sites were tech blogs with lean, hand-crafted HTML,
	mainstream sites are much heavier,
	so the above average is skewed towards being too light

Quantity and size of objects associated with articles,
does not count cookies nor secondary scripts:

$ find . -mindepth 1 -type f -printf '%s\t%p\n' \
| sort -k1,1n -k2,2 \
| awk '$1>10{
		sum+=$1;
		c++;
		s[c]=$1;
		n[c]=$2
	}
	END{
		printf "%10s\t%10s\n","Bytes","Measurement";
		printf "%10d\tSMALLEST\n",s[1];
		for (i in s){
			if(i==int(c/2)){
				printf "%10d\tMEDIAN SIZE\n",s[i];
			}
		};
		printf "%10d\tLARGEST\n",s[c];
		printf "%10d\tAVG SIZE\n",sum/c;
		printf "%10d\tCOUNT\n",c;
	}'

     Bytes      File Size
        13      SMALLEST
     10056      MEDIAN SIZE
  32035328      LARGEST
     53643      AVG SIZE
     38164      COUNT

     


Overall article size [1] including only the first layer of scripts,

     Bytes      Article Size
      8442      SMALLEST
    995476      MEDIAN
  61097209      LARGEST
   2319854      AVG
       921      COUNT

Estimated content [2] size including links, headers, navigation text, etc:

+ deleted files with errors or warnings,
	probably a mistake as that skews the results for lynx higher

     Bytes      Article Size
       929      SMALLEST
     18782      MEDIAN
    244311      LARGEST
     23997      AVG
       889      COUNT

+ lynx returns all text within the document not just the main content,
	at 75% content the figures are more realistic for some sites:

     Bytes      Measurement
       697	SMALLEST
     14087	MEDIAN
    183233	LARGEST
     17998	AVG
       889	COUNT

	at 50% content the figures are more realistic for other sites:

       465	SMALLEST
      9391	MEDIAN
    122156	LARGEST
     11999	AVG
       889	COUNT


       
       
$ du -bs * \
| sort -k1,1n -k2,2 \
| awk '$2!="l" && $1 {
		c++;
		s[c]=$1;
		n[c]=$2;
		sum+=$1
	}
	END {
		for (i in s){
			if(i==int(c/2)){
				m=i
			};
			printf "% 10d\t%s\n", s[i],n[i]
		};
		printf "% 10s\tArticle Size\n","Bytes";
		printf "% 10d\tSMALLEST %s\n",s[1],n[1];
		printf "% 10d\tMEDIAN %s\n",s[m],n[m];
		printf "% 10d\tLARGEST  %s\n",s[c],n[c];
		printf "% 10d\tAVG\n", sum/c;
		printf "% 10d\tCOUNT\n",c;
	}' OFS=$'\t'



[1]

$ time bash -c 'count=0;
shuf l \
| while read u; do
	echo $u;
	wget --continue --page-requisites --timeout=30 "$u" &
	echo $((count++));
	if ((count % 5 == 0)); then
		wait;
	fi;
	done;'
	


[2]

$ count=0;
time for i in $(cat l); do
	echo;echo $i;
	lynx -dump "$i" > $count;
	echo $((count++));
	done;


[3]

$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l
921

$ find . -mindepth 1 -type f -print | wc -l
38249



[4]

$ find . -mindepth 1 -type f -print \
| awk '{sub("\./","");sub("/.*","");print;}' | uniq -c | sort -k1,1n
-k2,2 | awk '$1{c++;s[c]=$1;sum+=$1;} END{for(i in s){if(i ==
int(c/2)){m=s[i];}}; print "MEDIAN: ",m; print "AVG", sum/c; print
"Quantity",c; }'



[5] 

$ find . -mindepth 1 -type f -name '*.js' -exec du -sh {} \; | sort
-k1,1rh | head
16M     ./www.icij.org/app/themes/icij/dist/scripts/main_8707d181.js
3.4M
./europeanconservative.com/wp-content/themes/Generations/assets/scripts/fontawesome-all.min.js
1.8M    ./www.9news.com.au/assets/main.f7ba1448.js
1.8M
./www.technologyreview.com/_next/static/chunks/commons.7eed6fd0fd49f117e780.js
1.8M    ./www.thetimes.co.uk/d/js/app-7a9b7f4da3.js
1.5M    ./www.crossfit.com/main.997a9d1e71cdc5056c64.js
1.4M
./www.icann.org/assets/application-4366ce9f0552171ee2c82c9421d286b7ae8141d4c034a005c1ac3d7409eb118b.js
1.3M
./www.digitalhealth.net/wp-content/plugins/event-espresso-core-reg/assets/dist/ee-vendor.e12aca2f149e71e409e8.dist.js
1.2M
./www.fresnobee.com/wps/build/webpack/videoStory.bundle-69dae9d5d577db8a7bb4.js
1.2M    ./www.ft.lk/assets/libs/angular/angular/angular.js


[6] About page bloat, one can pick just about any page and find from one to close to two orders of magnitude difference between the lynx dump and the full web page. For example,


$ wget --continue --page-requisites --timeout=30 \
    --directory-prefix=./test.a/ \

https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371

. . .

$ lynx --dump \
    https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371 \
    > test.b

$ du -bs ./test.?
250793	./test.a
 15385	./test.b

« Previous entries Next Page » Next Page »

RSS 64x64RSS Feed: subscribe to the RSS feed for regular updates

Home iconSite Wiki: You can improve this site by helping the extension of the site's content

Home iconSite Home: Background about the site and some key features in the front page

Chat iconIRC Channels: Come and chat with us in real time

New to This Site? Here Are Some Introductory Resources

No

Mono

ODF

Samba logo






We support

End software patents

GPLv3

GNU project

BLAG

EFF bloggers

Comcast is Blocktastic? SavetheInternet.com



Recent Posts