Ever wanted to ask a Google Search Engineer something but knew they wouldn’t respond?
Well I will!
There are no stupid questions. I’ll keep your information anonymous in my responses, unless you specifically want attribution.
How much duplicate content is acceptable?
I think people get confused when they hear duplicate content isn’t bad, especially from engineers at Google. What they mean here is that a certain amount isn’t bad, and that amount typically is around 30% to 40%. This accounts for shared components like headers and footers. So when we are writing algorithms to create duplicate content penalties, typically they won’t start until they get above 50% correlation. Of course this correlation algorithm is a whole topic on its own, especially with word embedding.
Do I have a chance to outrank blog pages or comparison sites ranking for queries with a product page if the intent is mixed, or only the competitor product pages ranking?
I get why you ask this question: comparison sites are a big issue with Google’s algorithm. It’s having a hard time distinguishing who the real authority should be. I think if you have a chance, you really have to explore the statistical gaps between your landing page and these comparison site pages. If you can clearly see that the comparison sites are dominating a SERP and there is original OEM content that’s not ranking above it, that’s probably due to the EAT algorithms which are looking for completeness of the word graph. Given a set of words, are there other words that should be alongside those (this is one of the ways we determine what expertise means)? How much of a gap is that? These are the questions you need to be answering to give yourself a chance.
Can I really hide content in tabs for mobile or will that be valued less like it was when desktop mattered?
I wouldn’t bet on it. To a search engine the tabs are all expanded / enabled.
How many Google sub-algorithms are there really?
Thousands. And a lot of it is the chicken and the egg. For instance how do you determine the meaning of a page? You may first need to crawl all the links that link to that page, and based off of that citation structure somehow adjust the meaning using the existing content. But part of link scoring and determining how powerful a link is also may look at the relevancy of that link, how relevant the anchor text is to the target page. But to do that you have to have already calculated the original meaning of the page. But to do that you need to know the link scoring…. This causes a recursive nature to many algorithms where certain algorithms inputs depend on other algorithms outputs. So just to calculate one area may involve hundreds of sub-scores to get there, even if it’s just rotating between a given set of algorithms.
How many SEO factors actually matter vs the hundreds of small tweaks that barely make a difference?
It really depends on the competitiveness of the environment. At the more competitive levels, instead of a kitchen sink approach, it’s best to prioritize based off of statistical gaps between you and the outperformer in that part of the search engine. It’s why you’re seeing all these AI SEO tools pop up. They’re all trying to solve that problem.
It’s likely that the majority of algorithms will do nothing to move the needle and it’s only a small subset that will actually make a difference. There’s also different ROIs that you’ll get with different optimizations so that has to be taken into account as well. There’s a perfect mixture that can be calculated with things like search engine models.
Are there interaction effects with factors that only work when combined together?
Yes. Many times it’s trying to solve a problem with multiple levers, and some of those levers can be affected by the other levers. Case in point: let’s say that you’re pushing more internal linking to a competitive landing page, but you haven’t finished optimizing the content for that landing page. You can inadvertently remove overall traffic to a site because what you’ve done is sacrificed a potential good landing page’s link flow distribution for a currently bad landing page’s link flow distribution. So it’s the link flow distribution plus the content optimization that needs to occur in parallel for it to work. It’s always best to work on the content algorithms first and get that correct before you start messing with the internal link structure. There are countless other symbiotic relationships between groups of algorithms that working similar way.
How often do you see the query intent shift from a transactional term to an informational one (and vice versa)?
The search engine that I’ve been building at Market Brew for the past 15 years isn’t fielding millions of different unrelated queries and so therefore it doesn’t have the same problem to solve as Google. Market Brew’s search engine is designed to emulate another search engine, so typically the keywords are strategically picked to evaluate specific SERP, and the intent is already known.
That being said, NLP is a key technology here. Being able to complete the user’s query typically will give us the answer of intent. We use things like skip-gram neural nets to predict what the rest of the query might be, which can internally affect the way that the search engine perceives the intent.
What’s an optimization often missed by SEOs that plays a part we should optimize for?
Link Flow Distribution. The larger more successful sites are the biggest culprit: they mask the issue with sheer backlink power. One thing to remember is that only a certain amount of landing pages can be supported given a specific amount of backlinks. Sites like Home Depot and Walmart can get away with flat link flow distributions, where is the smaller site can’t.
The good thing is that this leaves the door open for smaller backlink profile sites to capture market share by simply optimizing their internal linking structure, which is an order of magnitude easier to do versus obtaining backlinks. You can often achieve 3x to 4x equivalent backlink power by simply readjusting and promoting landing pages up the distribution. These two facts combine to produce a very high ROI.
What changes should I make for the Google core updates?
I think everybody knows the guidelines with the core web vitals, but the more important question is does it make a difference? You need tools that can calculate these metrics not only on your site but all of your competitors, at scale. Then use modeling techniques to understand if there are any significant gaps between you and the outperformers in the model.
When you find the outperformer, copy them. When I say copy I mean UI implementation, UI frameworks, JavaScript libraries et al. All these contribute to the end user experience which the core web vitals are trying to optimize.
Are there good tools that closely approximate parts of Google algo (ie clearscope grade, ahrefs links)
I won’t get into promoting specific tools because the industry really doesn’t like that and everybody is super competitive with each other, but I think it’s important to have a organic approach: you’re trying to approximate a search engine. Google turned to machine learning so why shouldn’t you, to learn about how that same machine learning search engine works? The data providers are necessary for this to work, but not sufficient to solve the puzzle.
How does the EAT algorithm work?
As Google’s own engineers have explained, it’s made up of a bunch of mini algorithms. How do you define expertise? Elon musk is always talking about the best way to interview an employee. If somebody says that they led a project: ask them about the details. If they really ran the project they’ll know every single detail. If they were just part of the project then they won’t. The same happens with the approach to this algorithm, with the help of tools like CBOW (continuous bag of words) where you can take a group of words and determine how complete the word graph is for any given page. The more complete equals the more expertise.
How does a search engine incorporate new algorithms like Core Web Vitals?
Like any other new algorithm, it becomes a feature in its neural net. The same supervised learning goes on, with humans labeling the outcome and the neural net using these new algorithms as additional features on which to train.
How does a search engine handle similar words in a query?
Technologies like word2vec and transformer based machine learning like GPT-3 give us the ability to embed words anywhere within the search engine. Whether that’s determining the meaning of a particular page, or substituting or appending a user’s query with related words, we’re able to use keywords as vectors in everything we do.
Does Google really see JavaScript links?
Most of them. And it doesn’t even require simulating a user click on these links to decode them. Often times we only need to click on one link and then every link that looks like that we can use a transformer based approach to apply what we knew from the one click across every other link in the site that looks like it.
Will GPT-3 kill off content writing?
GPT-3 will be a challenge for Google and the way it evaluates content. After all these technologies are not inventing content, they are simply taking what was already written and spinning the appropriate content around whatever keywords that you want.
My guess is that there will be an algorithm update that detects GPT-3 and related transformer based generated content and discounts it. But until then, it will wipe out a lot of manual content services. You just can’t get a better ROI on AI generated content.
We know G classifies certain things, such as Local, YMYL, Adult – is it the content, the query or both that get classified?
Both, but they are different in nature. Query classification is really just a “blacklisting” process of certain queries. On the other hand, the YMYL stuff is more of a “raised threshold” for the EAT algorithms (see my answer about how EAT algorithms work).
Google has hundreds of ranking factors. Are they all manually “discovered”, or are some found via algos?
Most of the features of their models are manually created. So the algorithm structures are defined manually, but neural networks fill in the blanks: they figure out how that new feature fits into their existing algorithms through machine learning.
Without semantically evading … does G use implicit feedback from users to alter what position query results have in a SERP – for non-personal results?
I don’t see why they wouldn’t. In fact this area is one of the competitive advantages they have over any other search engine: the database of intent. For instance, if they know that users typically click back to the results after finding a particular result, they know that result isn’t as good as one where they spend more time on site.
How much influence does search-volume for navigational/brand queries play in a sites ranking (direct/indirect)?
Not as much as people think. Search Engineers try to stay away from things that can be easily manipulated, and search volume is one of those. Anyone could easily write a crawler that generates brand queries from every proxy in the world and artificially pump those queries.
How many individual factors/signals does an inbound link provide?
The citation structure around content has always been critical to Google’s existence. It’s how they took over the search industry from AltaVista 🙂
A link carries with it a specific value (at Market Brew we call this link flow share), which in turn is calculated by hundreds of smaller link family algorithms such as size, placement, editorial-ness, reciprocal-ness, related neighborhood, relevance (which also takes into account the content calculation) and so forth. And this value plays a role in how much sway a link has over the content calculation / classification it links to.
What’s the biggest waste-of-time optimisation?
This changes depending on the SERP. Some optimizations are useless in certain contexts, but critical in others. There are no “fixed” bias/weight settings anymore in the search engine, and these can change whenever they re-run their neural network to train that model again.
Does G still utilise the number of occurrences a word/phrase/string appears (for topicality/relevance, or spam detection etc.)?
I would suspect so. Although it’s not just counting the same words, but counting related words (using word2vec and other technologies). But the penalties have changed: I think there is less weight on this now that they can calculate other things like “expertise” of an article (see my EAT algorithm explanation).
Each year, G seems better at handling synonyms and variants. Does it utilise older approaches (such as lemmatisation or stemming), or things such as collocates, or contextual framing, or is it now based on position in things like Word2Vec/Phrase2Vec etc.?
The query parser / indexer still uses things like stemming, which is important to understand root forms of words, which they can then use to feed into their word2vec or phrase2vec algorithms to develop a vectorized word. For the user / SEO, don’t think that you need to list every variant of a word to get better results, it doesn’t work that way anymore. Once they have the vectorized list of words, there is no need to employ older techniques that overlap.
G says schema usage doesn’t improve rankings, but people show improvements with implementation. Is this semantic evasion by G (different interpretation of “ranking” etc.), or coincidence, or something else?
Schema helps Google format their SERPs easier, but they avoid using easily manipulated things for ranking algorithms. So you’d do schema for things like featured snippets, but it’s not going to trigger a higher ranking score in some content algorithms.
Algo’s are sometimes imperfect – it’s a matter of max-gains/best fit. What threshold does G have for effectivity for approval, and what measures are taken for handling anomalies/false positives?
What you are talking about are hyperparameters in their machine learning process. For instance, they use L2 regularization to avoid any specific algorithm from overwhelming the process. The false positives is where supervised learning comes into play: there is a human element that they depend on (whether it comes from passive user feedback or their searcher quality guidelines and the labeling of their models) to make sure the SERPs aren’t missing a specific type of landing page / site.
Word count is not a factor. But it is darn hard to get short content to rank (even if 100% matching the query), unless you have a ton of links. What should content producers do:
- a) generate links
- b) build up internal topicality/links
- c) combine pages
- d) other?
What you are seeing are the results of many different algorithms coming together. For instance, you can mask certain penalties by overwhelming dominance with other algorithms.
Given a specific backlink structure, a site can only rank a certain number of pages. A site like Walmart or Home Depot, for instance, can get away with weak content because of their huge backlink structure; whereas a site without all of those backlinks needs to focus more on their internal structure and content. We call this “death by a thousand pages” which means you don’t want to spread your site’s content too “thin” for its given backlink structure. Consolidating landing pages is an option (canonicalization can get you there sometimes).
Aside from from factors such as Titles and internal links/value flow, what do many legitimate/good quality sites do wrong that allows them to be outranked by spammers/low-quality content?
This is probably the most pressing issue for search engineers. We call this a “gap in the model” that allows such things. There is some feature that isn’t defined well enough to distinguish between these sites in this circumstance.
Even though you said “aside from”, I suspect most of the separation comes from what we call link flow distribution, which is defined by where that link flow goes once it enters the local graph (subdomain). Most of why spamming works these days is because the site has such a good linking structure (both internal and external).
Spamming in the near future will mostly entail finding the “blind spot” in technologies like GPT-3, confusing Google’s algorithms which determine which site is actually the OG content / authority.
Google “remembers” things. In some cases, a new site will suffer the consequences of an older sites actions due to domain name.
a) Is there an easy way to discern this for business owners?
b) Is there any way to correct it for those that couldn’t spot the issue?
I think the days of “sandboxes” are long gone. Crawling / indexing systems are very smart about throttling / resource utilization and can throttle up or down very easily. So don’t worry about getting stuck in some supplemental or sandbox situation. Google’s crawlers will crawl the new content, and while it may be initially slow, it will throttle back up once it determines the site is different.
What’s your favourite part of all the algo’s, and why?
I love how the PageRank calculation has evolved over the years from a simple pass-through link value to what it is now: a huge list of link algorithms that sometimes have a recursive nature to them (link relevancy depends on target content, but target content can depend on that same link value).
The fact that the general eigenvector matrix concept has not changed is remarkable. Of course there are better link algorithms today due to technologies that have developed over the years, but the brilliance of the PageRank idea still stands in some form today.
I’m also very excited about vectorized words and cosine similarities and how they’ve eliminated much of the hard-coded NLP graph.
What’s your least favourite approach that you know of in the system, and why? If there was one part of the entire system you could ditch and burn, what would it be, and what would you replace it with?
Adding tags like “nofollow” was so shortsighted. Anything that can be easily and directly manipulated is an enemy of the state for a search engineer. Google already burned that house down though, and hopefully they’ve learned that lesson. AuthorRank is another one: thinking that the industry wouldn’t manipulate this was naive.
If Google is attempting to automatically identify content as originating from an “expert” – is it relying on lexical cues … and if so, how does it compensate for people emulating/copying/simply using specific terms/phrases/syntactic structures?
Expertise is really about identifying the vectorized word graph of a grouping of content (page) and determining how complete that graph is. Technologies like CBOW (continuous bag of words) makes this pretty easy.
The hard part is what you allude to: how do search engineers distinguish the expert if things like GPT-3 can easily fill in the graph?
There’s no “Author Rank”, and G doesn’t alter rankings of content based on author. If so – why does G identify/associate authors with content – what’s the purpose/point?
This goes back to schema. It’s really a schema thing at the moment. So it helps them with markup on their suggestion system, but they won’t use it for ranking purposes due to the extreme manipulation that can occur.
G may return results for a query that are “personalised”, based on various factors, including historic. Does G utilise personalised results and potentially alter non-personal results based on things like volume/consistency etc. of personalised results?
This can be highly manipulated, so leaking over to non-personal results is highly unlikely (and unnecessary).
What is the most common “innocent” spam that sites tend to accidentally trip on/commit, (Not real spam, just a stupid mistake that often leads to false positives)?
Things like advertorial or paid link classifications. It’s easy to do if you are a content writer and don’t understand link structures.
Canonical Links are suggestions – G gets to decide whether to heed them. This suggests there are conditions to be met, such as degree of similarity.
a) Is the consolidation binary or graduated?
b) Is there a set degree of similarity required?
c) Is the value 100%?
Yes there is a set degree of similarity required. This prevents artificial manipulation of this feature. No, it is below 100%, most likely above 90% though. We use 95% in our search engine.
How much “learning” does G systems do based on web data from normal/business sites (rather than things like wikipedia etc.)?
All content is used for the training process. The more the merrier, it allows the model to avoid over optimization on a particular subset of content. That being said, the labeling process can be swayed towards one particular group of sites as examples of “good” or “bad”.
We know G associates “terms” with “pages” (urls), does it do the same on a wider level (terms and domain, subdomain etc.)?
Depends on the need for the granularity of site-wide algorithms. A DomainRank type of algorithm that makes a more granular eigenvector matrix could call for such a thing.
Thank you.
Regards,
Jackie
It’s me again, Jackie 🙂 Hope you don’t mind answering my other question
Anchor text ratio – something which many SEOs are obsessed with. Do you think search engines look at the anchor text ratio distribution for each keyword and is it something we should keep in mind to avoid certain link-based penalties? My hunch is it doesn’t matter, as long as you’re getting links from high-quality sites and that there are other factors that help justify the occurrence of the links (for instance, if your content goes viral, you might end up with thousands of links and keyword-rich anchor but that you’re also receiving traffic, social shares, etc.) (sort of like, multiple algorithms running together to check the validity of the links)
But I also believe it’s something we should take a look at our competitors to know what anchor texts are required to go against them.
What’s your thought on this? And what does Marketbrew look at in this situation?
Thank you in advance. Really appreciate your help!
Regards,
Jackie
Hi Scott,
Jackie