novaBBS - alt.comp.hardware.pc-homebuilt - Re: Voice recognition, directly on mobile phone ? or do it on cloud/server ?

Re: Voice recognition, directly on mobile phone ? or do it on cloud/server ?

<tepu1o$24tu7$1@dont-email.me>

https://www.novabbs.com/computers/article-flat.php?id=903&group=alt.comp.hardware.pc-homebuilt#903

copy link Newsgroups: alt.comp.hardware.pc-homebuilt

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: nos...@needed.invalid (Paul)
Newsgroups: alt.comp.hardware.pc-homebuilt
Subject: Re: Voice recognition, directly on mobile phone ? or do it on
cloud/server ?
Date: Thu, 1 Sep 2022 05:29:27 -0400
Organization: A noiseless patient Spider
Lines: 105
Message-ID: <tepu1o$24tu7$1@dont-email.me>
References: <2deb530e-cae7-40bd-b38b-1d490d961f4an@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 1 Sep 2022 09:29:28 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9be67779ef8f66098729e55a1604f95d";
logging-data="2258887"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19nSYKmhdCz0/PZSIp8YKUcPH5We4+whVQ="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:8eeUnhxE4XzmWHyGU3PamZhjXaM=
In-Reply-To: <2deb530e-cae7-40bd-b38b-1d490d961f4an@googlegroups.com>
Content-Language: en-US

by: Paul - Thu, 1 Sep 2022 09:29 UTC

On 9/1/2022 1:43 AM, Skybuck Flying wrote:
> Question for you to look into:
>
> 1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.
>
> or
>
> 2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.
>
> Bye,
> Skybuck.
>

It takes the same amount of power, where ever it is done.

By doing it on the server, the intellectual property of the algorithm
is protected, and it's harder to steal the secret of how it works.

Some devices, it's pretty obvious only centralized processing will
work. An Alexa device, does not have a lot of electrical power, and
it's unlikely to have a lot of square millimeters of silicon for this
job. It could do some preliminary processing, such as silence suppression.

Alexa devices use microphone arrays, and they can extract positional
information (phased array) to help collect sound only from the
human emitter and not from the whistling kettle nearby. What leaves
the Alexa device, might be a single .wav stream, consisting of the
"best" material extract from the microphone array. That makes Alexa
a directional recording device. It's job is to get the clearest sample,
with the noise suppressed.

With a smartphone, it's less clear. The big core on the smartphone,
may be fast enough to do recognition. It could encode the waveform without
conversion to final words (fricatives, plosives). The natural language
processing could be left to the server.

https://en.wikipedia.org/wiki/Fricative

https://en.wikipedia.org/wiki/Plosive

Generally, you don't want to run down, or overheat, the phone.

*******

https://www.scientificamerican.com/article/voice-analysis-should-be-used-with-caution-in-court/

In the 1990s a new system that minimized human judgment started to
gain popularity: automatic speaker recognition.

Most algorithms work by dividing the signal into brief time windows and
extracting the corresponding spectra of frequencies. The spectra then undergo
mathematical transformations that extract parameters, called cepstral coefficients,
related to the geometric shape of the vocal tract. Cepstral coefficients
provide a model of the speaker’s vocal tract shape.

That's an example of a kind of pre-processing. It may not be the whole job
you want done, but the portable device can do some of the work for you.

I don't know all the details of how it works, but I would expect
it's a partition-able problem, and parts of it can be done in
one place or another. I would think the server would still want the
voice sample, in case some secondary processing is needed. Like, if
the recognition makes no sense, and the server needs to try again.

A device like Alexa, Amazon would want to keep the voice sample, for
AI training purposes. They can run the samples again and again, and
improve how their end of the processing works.

*******

Tesla does something similar with their cars, as a lot of data is
collected from the car while you drive (cameras on a Tesla, no LIDAR),
and that data is used to provide self-driving information for the future
(how to navigate a street, what fixed objects are on the street, what
is around the next bend in the road). Road networks for which full data
is available, today there are already robo-taxi devices running in
small areas of cities. These are fully mapped areas. Tesla would be
building maps, of where all its cars have driven.

https://en.wikipedia.org/wiki/Waymo

"Waymo operates a commercial self-driving taxi service in the greater
Phoenix, Arizona [area]. In October 2020, the company expanded the
service to the public, and it was the only self-driving commercial
service that operates without safety backup drivers in the vehicle
at that time. Waymo also develops driving technology for use in other
vehicles, including delivery vans and Class 8 tractor-trailers for
delivery and logistics."

The Waymo devices might use a different sensor array than a Tesla
(might include LIDAR and radar, as well as vision). You can see
what is likely to be LIDAR on the vehicle top.

https://www.kbb.com/wp-content/uploads/make/chrysler/pacifica/20171/waymo-pacifica/01-waymo-autonomous-chrysler-pacifica-hybrid.jpg

Therefore, the raw data from every interaction with servers, is likely
kept for future training purposes. Call it AI if you want, but it can be
used at any level of the processing task.

Doing all the processing centrally, is not scalable. If a hundred million
people say "Hey Cortana" at the same time, the central server is likely
to be overloaded and unable to respond. That's a disadvantage of the server
doing every bit of it.

Paul

Subject	Author
Re: Voice recognition, directly on mobile phone ? or do it on	Paul

All your files have been destroyed (sorry). Paul.

computers / alt.comp.hardware.pc-homebuilt / Re: Voice recognition, directly on mobile phone ? or do it on cloud/server ?