29 Oct 2025

Google Announces a New Era for Voice Search

Google Announces a New Era for Voice Search

Google unveiled a huge overhaul to voice search that leverages AI to make it faster and more accurate, dubbed A New Era. Google has released an update to its voice search, which alters how voice search queries are processed and ranked.  The new AI model takes audio as input for the search and ranking process, skipping the stage when voice is transformed to text.

The prior approach was known as Cascade ASR, and it involved converting a speech inquiry into text before proceeding with the standard ranking procedure.  The issue with the approach is that it is prone to errors.  The audio-to-text conversion process can lose some contextual clues, resulting in a mistake.

The new system is known as Speech-to-Retrieval.  It is a neural network-based machine learning model that was trained on massive datasets of linked audio enquiries and documents.  This training allows it to process spoken search questions (without converting them to text) and match them to relevant documents.

Dual-Encoder Model: Two Neural Networks

The system uses two neural networks:

One of the neural networks, called the audio encoder, converts spoken queries into a vector-space representation of their meaning.

The second network, the document encoder, represents written information in the same kind of vector format.

The two encoders learn to map spoken questions and text documents into a single semantic space, resulting in related audio and text materials clustered together based on semantic similarity.

Audio Encoder                      

Speech-to-Retrieval (S2R) converts the audio of a person’s spoken inquiry into a vector (numbers) that conveys the semantic meaning of what they are asking for.

The notification utilises Edvard Munch’s renowned picture The Scream as an example.  In this example, the spoken phrase “the scream painting” creates a vector space point near information on Edvard Munch’s The Scream (such as its location in a museum).

Document Encoder

The document encoder does a similar process with text documents, such as web pages, converting them into their own vectors that reflect what they are about.

During model training, both encoders work together to ensure that vectors matching audio queries and documents are close together, while unrelated ones are far apart in the vector space.

Rich Vector Representation

According to Google’s statement, the encoders turn audio and text into “rich vector representations.”  A rich vector representation is an embedding that captures meaning and context from audio and text.  It’s considered “rich” because it includes both intent and context.

For S2R, this means that the system does not rely on keyword matching; instead, it “understands” conceptually what the user requests.  So, even if someone asks, “Show me Munch’s screaming face painting,” the vector version of that query will still appear near documents regarding The Scream.

Ranking Layer

S2R uses a rating algorithm similar to ordinary text-based search.  When someone asks a question, the audio is initially processed by a pre-trained audio encoder, which converts it into a numerical form (vector) that captures what the speaker intends.  That vector is then compared to Google’s index to determine which pages’ meanings are most comparable to the voiced request.

For example, if someone says “the scream painting,” the model converts the phrase into a vector indicating its meaning.  The algorithm then searches its document index for pages with vectors that closely match, such as information about Edvard Munch’s The Scream.

Once those potential matches have been found, a separate rating stage takes over.  This element of the system combines the first stage’s similarity scores with hundreds of other ranking signals for relevance and quality to determine which pages should be ranked first.

In conclusion, Google’s new Speech-to-Retrieval (S2R) system marks a major leap forward in voice search technology by eliminating the need to convert speech into text. Through its dual-encoder AI model comprising audio and document encoders it directly interprets spoken queries and matches them to relevant content using rich vector representations that capture meaning and context.

This allows for more accurate, intuitive, and context-aware results, overcoming the limitations of the older Cascade ASR method. By processing audio natively and ranking results based on semantic understanding rather than keywords, Google is ushering in a new era of faster, smarter, and more natural voice search experiences.

eWoke is a leading SEO company in Kochi, Kerala, delivering result-driven digital marketing solutions that boost online visibility and website rankings. With expert strategies, keyword optimization, and ethical SEO practices, eWoke helps businesses grow organically, attract quality traffic, and achieve sustainable success in competitive digital landscapes.

If you want to know more about our blogs, feel free to connect with our LinkedIn page.

 

Recent Posts