Ask The Search Engineer Your Questions!

Ever wanted to ask a Google Search Engineer something but knew they wouldn’t respond?

Well I will!

There are no stupid questions. I’ll keep your information anonymous in my responses, unless you specifically want attribution.

How much duplicate content is acceptable?

I think people get confused when they hear duplicate content isn’t bad, especially from engineers at Google. What they mean here is that a certain amount isn’t bad, and that amount typically is around 30% to 40%. This accounts for shared components like headers and footers. So when we are writing algorithms to create duplicate content penalties, typically they won’t start until they get above 50% correlation. Of course this correlation algorithm is a whole topic on its own, especially with word embedding.

Do I have a chance to outrank blog pages or comparison sites ranking for queries with a product page if the intent is mixed, or only the competitor product pages ranking?

I get why you ask this question: comparison sites are a big issue with Google’s algorithm. It’s having a hard time distinguishing who the real authority should be. I think if you have a chance, you really have to explore the statistical gaps between your landing page and these comparison site pages. If you can clearly see that the comparison sites are dominating a SERP and there is original OEM content that’s not ranking above it, that’s probably due to the EAT algorithms which are looking for completeness of the word graph. Given a set of words, are there other words that should be alongside those (this is one of the ways we determine what expertise means)? How much of a gap is that? These are the questions you need to be answering to give yourself a chance.

Can I really hide content in tabs for mobile or will that be valued less like it was when desktop mattered?

I wouldn’t bet on it. To a search engine the tabs are all expanded / enabled.

How many Google sub-algorithms are there really?

Thousands. And a lot of it is the chicken and the egg. For instance how do you determine the meaning of a page? You may first need to crawl all the links that link to that page, and based off of that citation structure somehow adjust the meaning using the existing content. But part of link scoring and determining how powerful a link is also may look at the relevancy of that link, how relevant the anchor text is to the target page. But to do that you have to have already calculated the original meaning of the page. But to do that you need to know the link scoring…. This causes a recursive nature to many algorithms where certain algorithms inputs depend on other algorithms outputs. So just to calculate one area may involve hundreds of sub-scores to get there, even if it’s just rotating between a given set of algorithms.

How many SEO factors actually matter vs the hundreds of small tweaks that barely make a difference?

It really depends on the competitiveness of the environment. At the more competitive levels, instead of a kitchen sink approach, it’s best to prioritize based off of statistical gaps between you and the outperformer in that part of the search engine. It’s why you’re seeing all these AI SEO tools pop up. They’re all trying to solve that problem.

It’s likely that the majority of algorithms will do nothing to move the needle and it’s only a small subset that will actually make a difference. There’s also different ROIs that you’ll get with different optimizations so that has to be taken into account as well. There’s a perfect mixture that can be calculated with things like search engine models.

Are there interaction effects with factors that only work when combined together?

Yes. Many times it’s trying to solve a problem with multiple levers, and some of those levers can be affected by the other levers. Case in point: let’s say that you’re pushing more internal linking to a competitive landing page, but you haven’t finished optimizing the content for that landing page. You can inadvertently remove overall traffic to a site because what you’ve done is sacrificed a potential good landing page’s link flow distribution for a currently bad landing page’s link flow distribution. So it’s the link flow distribution plus the content optimization that needs to occur in parallel for it to work. It’s always best to work on the content algorithms first and get that correct before you start messing with the internal link structure. There are countless other symbiotic relationships between groups of algorithms that working similar way.

How often do you see the query intent shift from a transactional term to an informational one (and vice versa)?

The search engine that I’ve been building at Market Brew for the past 15 years isn’t fielding millions of different unrelated queries and so therefore it doesn’t have the same problem to solve as Google. Market Brew’s search engine is designed to emulate another search engine, so typically the keywords are strategically picked to evaluate specific SERP, and the intent is already known.

That being said, NLP is a key technology here. Being able to complete the user’s query typically will give us the answer of intent. We use things like skip-gram neural nets to predict what the rest of the query might be, which can internally affect the way that the search engine perceives the intent.

What’s an optimization often missed by SEOs that plays a part we should optimize for?

Link Flow Distribution. The larger more successful sites are the biggest culprit: they mask the issue with sheer backlink power. One thing to remember is that only a certain amount of landing pages can be supported given a specific amount of backlinks. Sites like Home Depot and Walmart can get away with flat link flow distributions, where is the smaller site can’t.

The good thing is that this leaves the door open for smaller backlink profile sites to capture market share by simply optimizing their internal linking structure, which is an order of magnitude easier to do versus obtaining backlinks. You can often achieve 3x to 4x equivalent backlink power by simply readjusting and promoting landing pages up the distribution. These two facts combine to produce a very high ROI.

What changes should I make for the Google core updates?

I think everybody knows the guidelines with the core web vitals, but the more important question is does it make a difference? You need tools that can calculate these metrics not only on your site but all of your competitors, at scale. Then use modeling techniques to understand if there are any significant gaps between you and the outperformers in the model.

When you find the outperformer, copy them. When I say copy I mean UI implementation, UI frameworks, JavaScript libraries et al. All these contribute to the end user experience which the core web vitals are trying to optimize.

Are there good tools that closely approximate parts of Google algo (ie clearscope grade, ahrefs links)

I won’t get into promoting specific tools because the industry really doesn’t like that and everybody is super competitive with each other, but I think it’s important to have a organic approach: you’re trying to approximate a search engine. Google turned to machine learning so why shouldn’t you, to learn about how that same machine learning search engine works? The data providers are necessary for this to work, but not sufficient to solve the puzzle.

How does the EAT algorithm work?

As Google’s own engineers have explained, it’s made up of a bunch of mini algorithms. How do you define expertise? Elon musk is always talking about the best way to interview an employee. If somebody says that they led a project: ask them about the details. If they really ran the project they’ll know every single detail. If they were just part of the project then they won’t. The same happens with the approach to this algorithm, with the help of tools like CBOW (continuous bag of words) where you can take a group of words and determine how complete the word graph is for any given page. The more complete equals the more expertise.

How does a search engine incorporate new algorithms like Core Web Vitals?

Like any other new algorithm, it becomes a feature in its neural net. The same supervised learning goes on, with humans labeling the outcome and the neural net using these new algorithms as additional features on which to train.

How does a search engine handle similar words in a query?

Technologies like word2vec and transformer based machine learning like GPT-3 give us the ability to embed words anywhere within the search engine. Whether that’s determining the meaning of a particular page, or substituting or appending a user’s query with related words, we’re able to use keywords as vectors in everything we do.

Does Google really see JavaScript links?

Most of them. And it doesn’t even require simulating a user click on these links to decode them. Often times we only need to click on one link and then every link that looks like that we can use a transformer based approach to apply what we knew from the one click across every other link in the site that looks like it.

Will GPT-3 kill off content writing?

GPT-3 will be a challenge for Google and the way it evaluates content. After all these technologies are not inventing content, they are simply taking what was already written and spinning the appropriate content around whatever keywords that you want.

My guess is that there will be an algorithm update that detects GPT-3 and related transformer based generated content and discounts it. But until then, it will wipe out a lot of manual content services. You just can’t get a better ROI on AI generated content.

We know G classifies certain things, such as Local, YMYL, Adult – is it the content, the query or both that get classified?

Both, but they are different in nature. Query classification is really just a “blacklisting” process of certain queries. On the other hand, the YMYL stuff is more of a “raised threshold” for the EAT algorithms (see my answer about how EAT algorithms work).

Google has hundreds of ranking factors. Are they all manually “discovered”, or are some found via algos?

Most of the features of their models are manually created. So the algorithm structures are defined manually, but neural networks fill in the blanks: they figure out how that new feature fits into their existing algorithms through machine learning.

Without semantically evading … does G use implicit feedback from users to alter what position query results have in a SERP – for non-personal results?

I don’t see why they wouldn’t. In fact this area is one of the competitive advantages they have over any other search engine: the database of intent. For instance, if they know that users typically click back to the results after finding a particular result, they know that result isn’t as good as one where they spend more time on site.

How much influence does search-volume for navigational/brand queries play in a sites ranking (direct/indirect)?

Not as much as people think. Search Engineers try to stay away from things that can be easily manipulated, and search volume is one of those. Anyone could easily write a crawler that generates brand queries from every proxy in the world and artificially pump those queries.

How many individual factors/signals does an inbound link provide?

The citation structure around content has always been critical to Google’s existence. It’s how they took over the search industry from AltaVista 🙂

A link carries with it a specific value (at Market Brew we call this link flow share), which in turn is calculated by hundreds of smaller link family algorithms such as size, placement, editorial-ness, reciprocal-ness, related neighborhood, relevance (which also takes into account the content calculation) and so forth. And this value plays a role in how much sway a link has over the content calculation / classification it links to.

What’s the biggest waste-of-time optimisation?

This changes depending on the SERP. Some optimizations are useless in certain contexts, but critical in others. There are no “fixed” bias/weight settings anymore in the search engine, and these can change whenever they re-run their neural network to train that model again.

Does G still utilise the number of occurrences a word/phrase/string appears (for topicality/relevance, or spam detection etc.)?

I would suspect so. Although it’s not just counting the same words, but counting related words (using word2vec and other technologies). But the penalties have changed: I think there is less weight on this now that they can calculate other things like “expertise” of an article (see my EAT algorithm explanation).

Each year, G seems better at handling synonyms and variants. Does it utilise older approaches (such as lemmatisation or stemming), or things such as collocates, or contextual framing, or is it now based on position in things like Word2Vec/Phrase2Vec etc.?

The query parser / indexer still uses things like stemming, which is important to understand root forms of words, which they can then use to feed into their word2vec or phrase2vec algorithms to develop a vectorized word. For the user / SEO, don’t think that you need to list every variant of a word to get better results, it doesn’t work that way anymore. Once they have the vectorized list of words, there is no need to employ older techniques that overlap.

G says schema usage doesn’t improve rankings, but people show improvements with implementation. Is this semantic evasion by G (different interpretation of “ranking” etc.), or coincidence, or something else?

Schema helps Google format their SERPs easier, but they avoid using easily manipulated things for ranking algorithms. So you’d do schema for things like featured snippets, but it’s not going to trigger a higher ranking score in some content algorithms.

Algo’s are sometimes imperfect – it’s a matter of max-gains/best fit. What threshold does G have for effectivity for approval, and what measures are taken for handling anomalies/false positives?

What you are talking about are hyperparameters in their machine learning process. For instance, they use L2 regularization to avoid any specific algorithm from overwhelming the process. The false positives is where supervised learning comes into play: there is a human element that they depend on (whether it comes from passive user feedback or their searcher quality guidelines and the labeling of their models) to make sure the SERPs aren’t missing a specific type of landing page / site.

Word count is not a factor. But it is darn hard to get short content to rank (even if 100% matching the query), unless you have a ton of links. What should content producers do:

a) generate links
b) build up internal topicality/links
c) combine pages
d) other?

What you are seeing are the results of many different algorithms coming together. For instance, you can mask certain penalties by overwhelming dominance with other algorithms.

Given a specific backlink structure, a site can only rank a certain number of pages. A site like Walmart or Home Depot, for instance, can get away with weak content because of their huge backlink structure; whereas a site without all of those backlinks needs to focus more on their internal structure and content. We call this “death by a thousand pages” which means you don’t want to spread your site’s content too “thin” for its given backlink structure. Consolidating landing pages is an option (canonicalization can get you there sometimes).

Aside from from factors such as Titles and internal links/value flow, what do many legitimate/good quality sites do wrong that allows them to be outranked by spammers/low-quality content?

This is probably the most pressing issue for search engineers. We call this a “gap in the model” that allows such things. There is some feature that isn’t defined well enough to distinguish between these sites in this circumstance.

Even though you said “aside from”, I suspect most of the separation comes from what we call link flow distribution, which is defined by where that link flow goes once it enters the local graph (subdomain). Most of why spamming works these days is because the site has such a good linking structure (both internal and external).

Spamming in the near future will mostly entail finding the “blind spot” in technologies like GPT-3, confusing Google’s algorithms which determine which site is actually the OG content / authority.

Google “remembers” things. In some cases, a new site will suffer the consequences of an older sites actions due to domain name.

a) Is there an easy way to discern this for business owners?

b) Is there any way to correct it for those that couldn’t spot the issue?

I think the days of “sandboxes” are long gone. Crawling / indexing systems are very smart about throttling / resource utilization and can throttle up or down very easily. So don’t worry about getting stuck in some supplemental or sandbox situation. Google’s crawlers will crawl the new content, and while it may be initially slow, it will throttle back up once it determines the site is different.

What’s your favourite part of all the algo’s, and why?

I love how the PageRank calculation has evolved over the years from a simple pass-through link value to what it is now: a huge list of link algorithms that sometimes have a recursive nature to them (link relevancy depends on target content, but target content can depend on that same link value).

The fact that the general eigenvector matrix concept has not changed is remarkable. Of course there are better link algorithms today due to technologies that have developed over the years, but the brilliance of the PageRank idea still stands in some form today.

I’m also very excited about vectorized words and cosine similarities and how they’ve eliminated much of the hard-coded NLP graph.

What’s your least favourite approach that you know of in the system, and why? If there was one part of the entire system you could ditch and burn, what would it be, and what would you replace it with?

Adding tags like “nofollow” was so shortsighted. Anything that can be easily and directly manipulated is an enemy of the state for a search engineer. Google already burned that house down though, and hopefully they’ve learned that lesson. AuthorRank is another one: thinking that the industry wouldn’t manipulate this was naive.

If Google is attempting to automatically identify content as originating from an “expert” – is it relying on lexical cues … and if so, how does it compensate for people emulating/copying/simply using specific terms/phrases/syntactic structures?

Expertise is really about identifying the vectorized word graph of a grouping of content (page) and determining how complete that graph is. Technologies like CBOW (continuous bag of words) makes this pretty easy.

The hard part is what you allude to: how do search engineers distinguish the expert if things like GPT-3 can easily fill in the graph?

There’s no “Author Rank”, and G doesn’t alter rankings of content based on author. If so – why does G identify/associate authors with content – what’s the purpose/point?

This goes back to schema. It’s really a schema thing at the moment. So it helps them with markup on their suggestion system, but they won’t use it for ranking purposes due to the extreme manipulation that can occur.

G may return results for a query that are “personalised”, based on various factors, including historic. Does G utilise personalised results and potentially alter non-personal results based on things like volume/consistency etc. of personalised results?

This can be highly manipulated, so leaking over to non-personal results is highly unlikely (and unnecessary).

What is the most common “innocent” spam that sites tend to accidentally trip on/commit, (Not real spam, just a stupid mistake that often leads to false positives)?

Things like advertorial or paid link classifications. It’s easy to do if you are a content writer and don’t understand link structures.

Canonical Links are suggestions – G gets to decide whether to heed them. This suggests there are conditions to be met, such as degree of similarity.

a) Is the consolidation binary or graduated?

b) Is there a set degree of similarity required?

c) Is the value 100%?

Yes there is a set degree of similarity required. This prevents artificial manipulation of this feature. No, it is below 100%, most likely above 90% though. We use 95% in our search engine.

How much “learning” does G systems do based on web data from normal/business sites (rather than things like wikipedia etc.)?

All content is used for the training process. The more the merrier, it allows the model to avoid over optimization on a particular subset of content. That being said, the labeling process can be swayed towards one particular group of sites as examples of “good” or “bad”.

We know G associates “terms” with “pages” (urls), does it do the same on a wider level (terms and domain, subdomain etc.)?

Depends on the need for the granularity of site-wide algorithms. A DomainRank type of algorithm that makes a more granular eigenvector matrix could call for such a thing.

Thank you for giving an opportunity to SEO and search enthusiasts like me a platform to ask a question directly from a search engineer. I’ve always been curious about how Google works and have always wanted to learn more.

My question for you is:

How come links from very obvious spam pages help the ranking, especially in foreign-language SERPs. I’ve seen it from time and again that very obvious spam links can still help with the ranking, even though Google has moved to a better model of pagerank such as topic-sensitive pagerank. With the advancement in machine learning, shouldn’t these obvious spam pages pass no link equity? I supposed it shouldn’t be difficult at all for the machine to just get trained on spam samples and be easily able to identify these types of pages. I also assume languages shouldn’t be much of a problem in understanding the topic of the linking and target page? As this should easily be handled by vectors. Also, some of the spam features are very obvious that I don’t see any use of languages to identify them.

What’s your thought on this?

Thank you.

Regards,
Jackie

Jackie,

Thanks for the enthusiasm! We need more people like you to ask questions.

It’s a very good question why Google, still today, is having problems with spammy sites / links. The quick answer lies in the fact that Google decided, some time ago, to spite the SEO product and force everyone to AdWords. They obfuscated access to business, and left the deciphering to MIT / Stanford / Carnegie Mellon folk like me. In turn, the only sites that could take advantage of the knowledge of how a search engine really works, were the spammers themselves. The long answer has so many link graph considerations that many of these are performance / scalability trade-offs.

The link graph has nodes (pages) and edges (links). The original PageRank calculation simply considered how possible it was for a random web-surfer to reach a given node during their surfing, given any entry point. This was good for a while since there were few people at that time that could even catch on to the fact that they could take advantage of this complex Eigenvector-based calculation. I was one of them, so I know.

As time marched forward, search engines started adding additional parameters to the calculation. Things like reciprocity, anchor text relevancy, link neighborhood effects and more. All of these factors influence links today, including even mini-PageRank calculations that are done on topic clusters and re-integrated back into the parent calculation.

All of this to say that people like me are always going to be a step ahead since we can decipher what even the hell is going on. For this very reason, I created Market Brew to provide this same access to the SEO teams out there that wanted it. Market Brew is a search engine model that shows website owners how the search engine sees the site from individual link calculations, to knowledge based graph calculations, to semantic entity graphs, to link graphs, all the way up to the sub-domain link graph (often referred to as “domain rank”), to duplicate content correlation, to … well you get the point. The Market Brew models use a genetic algorithm called Particle Swarm Optimization to allow us to inject any algorithm into the mix and see how well it correlates with the SERP.

The point of this being that Google likely took similar steps to address the loopholes that continuously are being probed. In doing so, what gets created is a very complex graph of factors that determine the final ranking position. By viewing them in such a search engine model, you might find that the site that you had thought was winning because of its spamming link graph, is actually winning because of an entirely other part of the search engine model that it dominates in.

Currently, it IS EASY to train what the SPAM is and what it isn’t. The training isn’t the problem. It’s the sample data. It keeps changing. Back in 2006 when my software began making millions of dollars for hotel brands, we were changing our algorithms every month. We had to: my college friends that went to Google were having fun shutting us down every month only to find that a new exploit of their algorithm was up and running already. This cat and mouse game went on for many years and still goes on today, to much less extent.

For instance, right now, since Google switched some of its algorithms over to detect spammy link graphs, it is reliant on characterizing pages correctly via their knowledge graph / semantic entity graph. But some languages just aren’t built out yet. Much of Google’s knowledge graph depends on how much data we have in Wikipedia / WikiData as a collective Internet. Because some languages are not as built out yet, there can be an exploit whereby the spammy link graph that is built in that language may not get the same punishing blow as other link graphs that are in another language.

It’s me again, Jackie 🙂 Hope you don’t mind answering my other question

Anchor text ratio – something which many SEOs are obsessed with. Do you think search engines look at the anchor text ratio distribution for each keyword and is it something we should keep in mind to avoid certain link-based penalties? My hunch is it doesn’t matter, as long as you’re getting links from high-quality sites and that there are other factors that help justify the occurrence of the links (for instance, if your content goes viral, you might end up with thousands of links and keyword-rich anchor but that you’re also receiving traffic, social shares, etc.) (sort of like, multiple algorithms running together to check the validity of the links)

But I also believe it’s something we should take a look at our competitors to know what anchor texts are required to go against them.

What’s your thought on this? And what does Marketbrew look at in this situation?

Thank you in advance. Really appreciate your help!

Regards,
Jackie

Hey Jackie!

Yes, anchor text ratio is a signal in a number of areas in the search engine. At the lower levels, each link is scored based on link characteristics, including anchor text. For instance, in Market Brew’s models, we dampen the Link Flow Share (think of this metric as how powerful a link is) of a link whenever the anchor text is replicated (either on the same page or somewhere else linking to the same page that the link is linking to).

A more complex use-case is when search engines try to determine the meaning of a page. Search engines will often use the link structure around the content (specifically the Incoming Link Flow Share by keyword / entity) to help improve the signal-to-noise ratio in its content algorithms. If you have a wider disbursement of anchor text, you can (as an SEO) control the “meaning” of the page more precisely than if you just had the same anchor text coming in from every page.

Now, having tons of anchor text that is the same isn’t necessarily a bad thing: link penalties are zero-sum, so that Link Flow Share just goes to other links on the page.

Market Brew has a product called the Anchor Text Finder, which allows users to traverse an entire Anchor Text Graph between the target sites and their competitor sites, to see the disbursement and distribution of anchor text for all links. This can be found on each Analysis Group screen under the data dropdown for the Websites. You can, as you mentioned, inspect what the topic clusters are by sorting by “Associated Link Flow Share” for each anchor text, and then you will see how your industry goes after particular keywords at the 10,000 ft level.

Hi Scott,

It’s me again with another question 🙂

I’m wondering if google is able to track online mentions of every entities? (Brand, author name, etc).

I know they have a knowledge graph with information of well-known or notable entities but for smaller brands or an individuals, do they also have some kind of like sub knowledge graphs for every websites?

Say my brand is ABC, and I’m mentioned in some local newspaper, have my profiles set up, etc and being associated with other properties like phone number, topics, etc. Does google know my various data points and sort of like constructs a mini knowledge graph about my brand/website?

Or even author profiles for instance. Do they keep track of every author on the web and their information? Excluding the well-known entity that’s already in the knowledge graph/knowledge panel

Or they do, but not as a knowledge graph, but rather sort of like a vector representation. What could be the technology or technical implementation behind this?

Thanks a lot,
Jackie

Hey there!

We don’t know for sure, but all signs point to Google using Wikidata, the structured version of Wikipedia.

So, in order for it to recognize the entity, it needs to have an entry in there.

I watched some of the Market Brew’s videos and you mentioned that in market brews, you measure the E factor in EAT by looking at the word graph coverage on the page.

Does Market Brew also sort of looking at the author/brand behind the content and try to determine the author’s/brand’s authority as well? If so, how do you keep track of this?

What you were talking about is author rank, introduced by a number of SEO tools 10 years ago and suspected that Google was also doing something along these lines. They have since discontinued any such thing.

It it has been and continues to be Market Brew’s opinion that there is no such thing as author rank, however the topic cluster is definitely shaped by the overall citation structure around each Wikipedia entry. So some entities exhibit a gravitational force with other entities. One way that we can determine a topic cluster is to determine which of the entities share the strongest bonds in each page or paragraph. Those entities that are not related to get demoted and those that are closely related get promoted.