Pushing Bad Data- Google’s Latest Black Eye


Google stopped counting, or as a minimum publicly showing, the number of pages it indexed in September of 05, after a college-yard “measuring contest” with rival Yahoo. That remember topped out around 8 billion pages before it became removed from the homepage. News broke lately via diverse SEO forums that Google had suddenly added some other few billion pages to the index over the past few weeks. This may sound like a motive for a birthday party, but this “accomplishment” would now not replicate well on the seek engine that completed it.

What had the search engine optimization network humming become the character of the sparkling, new few billion pages. They had been blatant junk mail- containing Pay-Per-Click (PPC) advertisements, scraped content, and they were, in lots of cases, displaying up properly within the search outcomes. They drove out a long way older, extra set up web sites in doing so. A Google representative responded via boards to the difficulty via calling it a “bad records push,” something that met with diverse groans during the search engine marketing network.

How did a person manage to dupe Google into indexing such a lot of pages of junk mail in one of these brief periods of time? I’ll provide a high-level assessment of the method; however, don’t get too excited. Like a diagram of a nuclear explosive isn’t going to educate you on making the real element, you are not going a good way to run off and do it yourself after studying this article. Yet it makes for an interesting tale, one that illustrates the unpleasant problems cropping up with ever-increasing frequency inside the global’s most popular search engine.


A Dark and Stormy Night

Our tale begins deep within the coronary heart of Moldova, sandwiched scenically among Romania and Ukraine. In among heading off neighborhood vampire attacks, an enterprising local had an exquisite concept and ran with it, probably far away from the vampires… His idea changed to take advantage of how Google treated subdomains, and not only a little bit, however, in a massive way.


The difficulty’s coronary heart is that currently, Google treats subdomains lots the identical way because it treats complete domain names- as particular entities. This approach will add a subdomain’s homepage to the index and go back sooner or later to do a “deep crawl.” Deep crawls are actually the spider following hyperlinks from the area’s homepage deeper into the website online until it finds everything or gives up and is derived again later for greater.

Briefly, a subdomain is a “third-degree domain.” You’ve likely visible them earlier than, they appearance something like this: subdomain.Domain.Com. For instance, Wikipedia uses them for languages; the English version is “en.Wikipedia.Org,” the Dutch version is “nl.Wikipedia.Org.” Subdomains are one manner of arranging big web sites, in place of multiple directories or even separate domain names altogether.

So, we have a kind of web page Google will index honestly “no questions asked.” It’s a surprise no person exploited this case quicker. Some commentators believe the reason for that can be this “quirk” changed into introduced after the latest “Big Daddy” replacement. Our Eastern European pal got collectively some servers, content scrapers, spambots, PPC debts, and a few all-essential, very inspired scripts, and combined them all thusly…

Five Billion Served- And Counting…

First, our hero here crafted scripts for his servers that could. At the same time, GoogleBot dropped by way of producing an essentially endless wide variety of subdomains, all with an unmarried web page containing keyword-rich scraped content, keyworded hyperlinks, and PPC commercials for the one’s key phrases. Spambots are dispatched to place GoogleBot at the fragrance via referral and comment spam to tens of thousands of blogs around the sector. The spambots provide the huge setup, and it doesn’t take tons to get the dominos to fall.

GoogleBot finds the spammed links and follows them into the network as is its reason in lifestyles. Once GoogleBot is sent into the web, the scripts going for walks the servers surely maintain generating pages- page after page, all with a unique subdomain, all with keywords, scraped content material, and PPC ads. These pages get listed, and unexpectedly, you have got yourself a Google index three-5 billion pages heavier in below 3 weeks.

Reports imply, at the start, the PPC commercials on those pages had been from Adsense, Google’s very own PPC carrier. The final irony then is Google advantages financially from all the impressions being charged to AdSense users as they seem throughout those billions of spam pages. The AdSense revenues from this undertaking were the point, despite everything. Cram in such a lot of pages that, through sheer force of numbers, humans would discover and click on the advertisements in the one’s pages, making the spammer a pleasing income in a very brief amount of time.

Billions or Millions? What is Broken?

Word of this achievement spread like wildfire from the DigitalPoint boards. It unfolds like wildfire in the SEO community, to be particular. As of but, the “trendy public” is out of the loop and could probably continue to be so. A reaction using a Google engineer regarded on a Threadwatch thread approximately the topic, calling it a “terrible facts push.” Basically, the corporation line becomes they have no longer, in reality, added 5 billion pages. Later claims consist of assurances the difficulty can be constant algorithmically. Those following the situation (via tracking the recognized domains the spammer changed into using) see that Google is getting rid of them from the index manually.

The tracking is performed using the “web site:” command. Theoretically, command presentations the total range of indexed pages from the site you specify after the colon. Google has already admitted there are issues with this command, and “5 billion pages”, they appear to be claiming, is merely any other symptom of it. These troubles increase past simply the web page: command, but the display of the variety of outcomes for many queries, which some experience are especially misguided and in a few cases fluctuate wildly. Google admits they have indexed many of these spammy subdomains, but up to now haven’t supplied any exchange numbers to dispute the three-five billion showed initially through the web site: command.

 Bad Data

Over the past week, the quantity of the spammy domain names & subdomains listed has steadily diminished as Google employees manually put off the listings. There’s been no professional announcement that the “loophole” is closed. This poses the plain trouble that, because the manner has been shown, several copycats may dash to cash in earlier than the algorithm is modified to cope with it.


There are, at minimal, two matters broken right here. The web site: command and the difficult to understand, a tiny little bit of the algorithm that allowed billions (or at least hundreds of thousands) of spam subdomains into the index. Google’s cutting-edge precedence ought to possibly be too close the loophole before they’re buried in copycat spammers. The troubles surrounding the use or misuse of AdSense are just as troubling for those who are probably seeing little go back on their advertising budget this month.

Do we “preserve the religion” in Google inside the face of those activities? Most likely, yes. However, it isn’t so much whether or not they deserve that religion that most people will never recognize this befell. Days after the story broke, there may be tiny mention within the “mainstream” press. Some tech sites have stated it, but this isn’t always the form of a story with a view of becoming on the nightly news, broadly speaking, because the heritage know-how required to apprehend it is going past what the common citizen can muster. The tale will possibly emerge as-as an interesting footnote in that maximum esoteric and neoteric of worlds, “SEO History.”

Mr. Lester has served for 5 years as the webmaster for ApolloHosting.Com and previously labored within the IT industry for a further 5 years, acquiring knowledge of website hosting, layout, etc. Apollo Hosting provides website hosting, e-commerce website hosting, VPS hosting, and internet layout services to many customers. Established in 1999, Apollo prides itself on the very best ranges of customer support.