Message-ID:

6 May, 2024: The networking issue during the past two days has been identified and fixed.

tech / sci.electronics.design / Re: US speech "accents"

Re: US speech "accents"

<tibkfj$23n7j$1@dont-email.me>

https://www.novabbs.com/tech/article-flat.php?id=107926&group=sci.electronics.design#107926

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: US speech "accents"
Date: Fri, 14 Oct 2022 05:25:14 -0700
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <tibkfj$23n7j$1@dont-email.me>
References: <ti5ut0$1ds4b$1@dont-email.me>
<qiidkh1e18jq5f4qpdvapt094552ll0355@4ax.com> <ti6u01$1i15u$1@dont-email.me>
<f48e85ca-36ec-495c-097e-c74c652ae013@electrooptical.net>
<ti9hk6$1raug$1@dont-email.me>
<17632b75-89f5-4344-ac88-b1d06aba7e49n@googlegroups.com>
<ti9uuq$1se1s$1@dont-email.me>
<92f37582-68bf-4a4a-b46c-6cfb14f93960n@googlegroups.com>
<tia441$1sre3$1@dont-email.me> <tiaa4u$1t7qh$1@dont-email.me>
<tibf4m$22lje$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 14 Oct 2022 12:25:23 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="cbfe910ae7d0b63cd756ddb04f966219";
logging-data="2219251"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+r/Rh5RG0ma4w7Xg6bd0z9"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:VDFG4DATczIQRXmm7hiSyhh1HKo=
In-Reply-To: <tibf4m$22lje$1@dont-email.me>
Content-Language: en-US

by: Don Y - Fri, 14 Oct 2022 12:25 UTC

On 10/14/2022 3:54 AM, Dimiter_Popoff wrote:
> On 10/14/2022 3:22, Don Y wrote:
>> On 10/13/2022 3:39 PM, Dimiter_Popoff wrote:
>>
>>> Your filters must be really good and adaptive. I do understand parts of
>>> what they say at times, especially when I know the context, but
>>> typically I miss words and if they are key to the sentence I am just
>>> lost.
>>
>> That's the problem with all "accents" (speech patterns foreign to YOUR
>> norm). One can learn to understand damn near anything. But, the effort
>> required to -- esp when it is an infrequent activity *or* where you
>> can't just ask for someone to repeat what they've said -- can make
>> listening "tedious" or stressful.
>>
>> Most synthetic speech suffers from this problem; it's OK in very small
>> doses *if* you've become accustomed to it AND have a pretty good idea
>> of what is being said. (try listening to someone reading nonsense
>> syllogisms in a "foreign accent" and make note of your overall comprehension!)
>> But, it's not the sort of experience that you "seek out".
>>
>> Users of The Reading Machine often complained of "listener fatigue",
>> despite the fact that they were THRILLED to have access to the materials
>> that were possible via the machine.
>>
>> [I always used to say the voice could penetrate CONCRETE!]
>
> Penetrate concrete with Don having trained the machine what to
> say in test mode... well well. :D

Shhhh! No one's supposed to know! :>

Re: US speech "accents"

<9938c304-d5aa-4818-98f0-dac3648f470cn@googlegroups.com>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=107966&group=sci.electronics.design#107966

copy link Newsgroups: sci.electronics.design

X-Received: by 2002:a37:4454:0:b0:6e7:9bd0:bf53 with SMTP id r81-20020a374454000000b006e79bd0bf53mr4951474qka.616.1665774522210;
Fri, 14 Oct 2022 12:08:42 -0700 (PDT)
X-Received: by 2002:a05:6214:f2b:b0:4b1:7b01:6de2 with SMTP id
iw11-20020a0562140f2b00b004b17b016de2mr5210717qvb.122.1665774522048; Fri, 14
Oct 2022 12:08:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: sci.electronics.design
Date: Fri, 14 Oct 2022 12:08:41 -0700 (PDT)
In-Reply-To: <tibkfj$23n7j$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=74.101.19.38; posting-account=rEo47AoAAAAz23oFFYoL4aHQauGkT8Lw
NNTP-Posting-Host: 74.101.19.38
References: <ti5ut0$1ds4b$1@dont-email.me> <qiidkh1e18jq5f4qpdvapt094552ll0355@4ax.com>
<ti6u01$1i15u$1@dont-email.me> <f48e85ca-36ec-495c-097e-c74c652ae013@electrooptical.net>
<ti9hk6$1raug$1@dont-email.me> <17632b75-89f5-4344-ac88-b1d06aba7e49n@googlegroups.com>
<ti9uuq$1se1s$1@dont-email.me> <92f37582-68bf-4a4a-b46c-6cfb14f93960n@googlegroups.com>
<tia441$1sre3$1@dont-email.me> <tiaa4u$1t7qh$1@dont-email.me>
<tibf4m$22lje$1@dont-email.me> <tibkfj$23n7j$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9938c304-d5aa-4818-98f0-dac3648f470cn@googlegroups.com>
Subject: Re: US speech "accents"
From: richsuli...@gmail.com (Rich S)
Injection-Date: Fri, 14 Oct 2022 19:08:42 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1977

by: Rich S - Fri, 14 Oct 2022 19:08 UTC

On Friday, October 14, 2022 at 12:25:30 PM UTC, Don Y wrote:
[snip]

the "standard" american english AIUI is
standard midwestern speech, & is taught
in news journalism, and acting schools.
Siri, Alexa, et al, are based on that.
the equivalent in UK is "BBC english."

so that's where I would start if I had to develop
a AI voice library. Or recruit an actor to
record new phoneme library. Of course
there are companies that do this already :-)

Re: US speech "accents"

<ticv0h$2b96p$1@dont-email.me>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=107981&group=sci.electronics.design#107981

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: US speech "accents"
Date: Fri, 14 Oct 2022 17:31:03 -0700
Organization: A noiseless patient Spider
Lines: 75
Message-ID: <ticv0h$2b96p$1@dont-email.me>
References: <ti5ut0$1ds4b$1@dont-email.me>
<qiidkh1e18jq5f4qpdvapt094552ll0355@4ax.com> <ti6u01$1i15u$1@dont-email.me>
<f48e85ca-36ec-495c-097e-c74c652ae013@electrooptical.net>
<ti9hk6$1raug$1@dont-email.me>
<17632b75-89f5-4344-ac88-b1d06aba7e49n@googlegroups.com>
<ti9uuq$1se1s$1@dont-email.me>
<92f37582-68bf-4a4a-b46c-6cfb14f93960n@googlegroups.com>
<tia441$1sre3$1@dont-email.me> <tiaa4u$1t7qh$1@dont-email.me>
<tibf4m$22lje$1@dont-email.me> <tibkfj$23n7j$1@dont-email.me>
<9938c304-d5aa-4818-98f0-dac3648f470cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 15 Oct 2022 00:31:13 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="b1a3435414c4a83dff88ffc618b8718f";
logging-data="2467033"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/qahXyDJT+0D5M3qGIIxMn"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:4qFBTJN3OGP3GhA+n5n3FCXlUOw=
In-Reply-To: <9938c304-d5aa-4818-98f0-dac3648f470cn@googlegroups.com>
Content-Language: en-US

by: Don Y - Sat, 15 Oct 2022 00:31 UTC

On 10/14/2022 12:08 PM, Rich S wrote:
> On Friday, October 14, 2022 at 12:25:30 PM UTC, Don Y wrote:
> [snip]
>
> the "standard" american english AIUI is
> standard midwestern speech, & is taught
> in news journalism, and acting schools.
> Siri, Alexa, et al, are based on that.
> the equivalent in UK is "BBC english."

But that's simply because someone *thought* you needed "one
true speaking style". Would a texan prefer to hear a generic
"midwestern" speaker or someone with a local twang? He's not
given a choice. Ditto for the New Yorker or Mainer.

Similarly for Siri, Alexa, etc.

You'd imagine, if comprehension was an issue, all teachers,
pastors, local tv/radio personalities, doctors, etc. would
similarly be "taught how to speak". Pilots would abandon
Yaeger-ese, stewardesses would be taught to switch to their
"understandable voice" before giving directions in the event
of an emergency, phone banks would be staffed by persons
fluent in that speaking style, etc.

And, why would we let our kids grow up with the "burden"
of speaking a local dialect (that they would later have to
"outgrow" to be understood)?

> so that's where I would start if I had to develop
> a AI voice library. Or recruit an actor to
> record new phoneme library. Of course
> there are companies that do this already :-)

You only "record" a voice if you are using limited
vocabulary synthesis (record the messages of interest)
or diphone synthesis (decompose spoken word into
phoneme *transitions* that can be reassembled to form
complete sound sequences).

[I use this form of synthesizer for people who would like to
feel they are interacting with a departed spouse or a
"responsible older child" (e.g., for a listener suffering
from dementia or alzheimers). Or, an ALS patient who would
be comforted by the sound of their (old) voice]

My question goes to what folks would *want* to listen to;
I doubt they'd all pick "midwestern US english" if given
a choice. (if that was the case, there would be no such
thing as "synthestic voiceS, plural")

With a formant synthesizer, you (typically) create the sounds from
a set of cascaded filters shaping a variable excitation (voicing)
source. So, you have control over *which* sounds you make (from
pronouncing rules) as well as how you make them.

[This is a slight oversimplification]

I can, already, give the user (listener) control over various
aspects of the voice to suit his needs (some folks have hearing
and/or comprehension issues). Would you prefer a deeper voice
or one in a higher register? Fast speaker or slow? Breathy
or crisp? Monotone or prosodic?

But, I'd not (yet) tackled the idea of altering the pronunciation
rules to favor *your* preferences for the manner in which you
expect to encounter certain words. So, those folks who like to
MAYSH their potatoes won't be burdened with having to map a
"mesh" pronunciation to their notion of "soft, white starch piles"

[It's a different sort of problem to tackle as it requires being
able to identify the things that characterize a specific "accent"
and goes beyond just "changing the sound of the phoneme" as it
also ties into prosodic computations]

Re: US speech "accents"

<d8f8d6b4-a8bc-4dff-a421-8b68c6323ba8n@googlegroups.com>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=107982&group=sci.electronics.design#107982

copy link Newsgroups: sci.electronics.design

X-Received: by 2002:a37:9785:0:b0:6cf:55d:e554 with SMTP id z127-20020a379785000000b006cf055de554mr444838qkd.459.1665796409391;
Fri, 14 Oct 2022 18:13:29 -0700 (PDT)
X-Received: by 2002:a37:a848:0:b0:6ed:2436:ad0 with SMTP id
r69-20020a37a848000000b006ed24360ad0mr447832qke.29.1665796409212; Fri, 14 Oct
2022 18:13:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: sci.electronics.design
Date: Fri, 14 Oct 2022 18:13:28 -0700 (PDT)
In-Reply-To: <ticv0h$2b96p$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=74.101.19.38; posting-account=rEo47AoAAAAz23oFFYoL4aHQauGkT8Lw
NNTP-Posting-Host: 74.101.19.38
References: <ti5ut0$1ds4b$1@dont-email.me> <qiidkh1e18jq5f4qpdvapt094552ll0355@4ax.com>
<ti6u01$1i15u$1@dont-email.me> <f48e85ca-36ec-495c-097e-c74c652ae013@electrooptical.net>
<ti9hk6$1raug$1@dont-email.me> <17632b75-89f5-4344-ac88-b1d06aba7e49n@googlegroups.com>
<ti9uuq$1se1s$1@dont-email.me> <92f37582-68bf-4a4a-b46c-6cfb14f93960n@googlegroups.com>
<tia441$1sre3$1@dont-email.me> <tiaa4u$1t7qh$1@dont-email.me>
<tibf4m$22lje$1@dont-email.me> <tibkfj$23n7j$1@dont-email.me>
<9938c304-d5aa-4818-98f0-dac3648f470cn@googlegroups.com> <ticv0h$2b96p$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d8f8d6b4-a8bc-4dff-a421-8b68c6323ba8n@googlegroups.com>
Subject: Re: US speech "accents"
From: richsuli...@gmail.com (Rich S)
Injection-Date: Sat, 15 Oct 2022 01:13:29 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2232

by: Rich S - Sat, 15 Oct 2022 01:13 UTC

On Saturday, October 15, 2022 at 12:31:20 AM UTC, Don Y wrote:
> On 10/14/2022 12:08 PM, Rich S wrote:
> > On Friday, October 14, 2022 at 12:25:30 PM UTC, Don Y wrote:
> > [snip]
> >
[snip]

oh Don, Up top, you did not clearly express what
your goal is. Why you're asking. Do you want
(1) one library of speech that
most everyone will understand, to convey basic
info in short bursts (a la Siri, Alexa) [What I thought].
(2) many selectable regional dialects, for long spans
human-like conversations?
Is that what's needed to prevent listener fatigue?
(i.e. minimizing the person's mental energy demand)

Re: US speech "accents"

<tidaeo$2gmf8$1@dont-email.me>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=107983&group=sci.electronics.design#107983

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: US speech "accents"
Date: Fri, 14 Oct 2022 20:46:22 -0700
Organization: A noiseless patient Spider
Lines: 164
Message-ID: <tidaeo$2gmf8$1@dont-email.me>
References: <ti5ut0$1ds4b$1@dont-email.me>
<qiidkh1e18jq5f4qpdvapt094552ll0355@4ax.com> <ti6u01$1i15u$1@dont-email.me>
<f48e85ca-36ec-495c-097e-c74c652ae013@electrooptical.net>
<ti9hk6$1raug$1@dont-email.me>
<17632b75-89f5-4344-ac88-b1d06aba7e49n@googlegroups.com>
<ti9uuq$1se1s$1@dont-email.me>
<92f37582-68bf-4a4a-b46c-6cfb14f93960n@googlegroups.com>
<tia441$1sre3$1@dont-email.me> <tiaa4u$1t7qh$1@dont-email.me>
<tibf4m$22lje$1@dont-email.me> <tibkfj$23n7j$1@dont-email.me>
<9938c304-d5aa-4818-98f0-dac3648f470cn@googlegroups.com>
<ticv0h$2b96p$1@dont-email.me>
<d8f8d6b4-a8bc-4dff-a421-8b68c6323ba8n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 15 Oct 2022 03:46:32 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="b1a3435414c4a83dff88ffc618b8718f";
logging-data="2644456"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+nIzbYMgWcje+hzKw/0pjx"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:3VqGuiLqG4ipkFM9SXs7ZvQmCNU=
Content-Language: en-US
In-Reply-To: <d8f8d6b4-a8bc-4dff-a421-8b68c6323ba8n@googlegroups.com>

by: Don Y - Sat, 15 Oct 2022 03:46 UTC

On 10/14/2022 6:13 PM, Rich S wrote:
> oh Don, Up top, you did not clearly express what
> your goal is. Why you're asking.

I hypothesized that people have biases that enhance their
comprehension of spoken word. I *assumed* that these biases
would be related to their "dialect preferences".

I can't control WHAT is said, so I can't use idioms that
a Mainer might use in place of phraseology that a New Yorker
might adopt (or, that a "purist" might express in text).

So, I am assuming that the *way* people are accustomed to
hearing things pronounced ("accent") is what drives this
comprehension enhancement.

> Do you want
> (1) one library of speech that
> most everyone will understand, to convey basic
> info in short bursts (a la Siri, Alexa) [What I thought].

I have different synthesizers that I deploy in different
situations.

I have a formant-based synthesizer that is best suited
for stand-alone, low resource applications. It doesn't
take much code or horsepower to make a formant synthesizer
that isn't restricted by vocabulary. (a diphone synthesizer
has a larger resource footprint -- think small, battery
powered applications like putting speech *in* an earbud)

When the appliance (earbud, in this example) can't connect to
anything, it still needs to be able to interact with the
wearer -- if only to tell him that there is no active connection!

Many of these messages are canned, but parameterized. E.g.,
"Volume level 27 (of 32)"
"Battery level 35%"
"Speaking rate 250 words per minute"
"Battery life remaining, 3:47"
etc.

A user will likely recognize most of these, given enough exposure
(because they are canned and, thus, predictable)

Others are sourced externally, but pronounced locally:
"Available access points include:
greenhornet
wifi234
guestaccess
...
"System unavailable. Contact Dr Betina at 234-5000 x234"
"Scheduled maintenance. Try again after 5:00pm"
"Account disabled. Contact Jesse at jesse@voyager.com"

A user will likely NOT expect to encounter these and will already
be "unhappy" because he's not getting what he planned on. So,
not particularly patient with a voice he can't understand
speaking something that he can't *predict*. Having to ask for
something to be repeated (possibly many times -- what's the
chance that the synthesizer pronounces Betina's name properly?)
just adds to the frustration -- with YOUR product (even though
YOU aren't the problem, in this case!)

> (2) many selectable regional dialects, for long spans
> human-like conversations?

For prolonged use, the user interacts with a more "resource-ful"
synthesizer -- at the far end of the radio link.

There aren't really any "long conversations" (paraphrasing your
comment). You don't typically discuss politics with Siri. Or,
ask Alexa how she spent her day... ("Oh, Siri and I went
shopping! We used your GOLD CARD and bought all new wardrobes for
ourselves...")

So, the interactions are typically short which means they
lack a lot of context or continuity. If it is a response to
a query (or directive) on your part, then you have *some*
context in which to evaluate the REPLY.

OTOH, an unprompted utterance can be just about anything!
"You forgot to turn off the stovetop"
"Your wash is ready to be transferred to the dryer"
"The mailman has delivered a package"
"It's getting dark. Are you sure you want to leave the garage door open?"
"Penny is at the front door"
"The freezer has warmed to a level that threatens its contents"
"Someone is wandering around the (fenced) back yard"
"The evening news broadcast is starting, soon"
"Time to wake up! You've a doctor appointment at 9:00am"
"Marijane is on the phone. Would you like to speak to her?"

These messages want to be short "announcements". So, you
have to make the adjustment from "not expecting any message"
to "realizing a message is being issued" and "understanding
that message" in short order.

And, some of them are (hopefully) very low frequency of occurrence
(the comment re: the freeezer failing, for example) so it's not
as if you have a past experience with it to remind you of the likely
content.

> Is that what's needed to prevent listener fatigue?
> (i.e. minimizing the person's mental energy demand)

I think there are two primary reasons for the fatigue.

When you listen to synthetic speech, you feel apprehensive.
There's a lot of anxiety as you're always "on guard" for
something that you won't understand (because it was
mispronounced, etc.)

"I was scared" might make no sense in a context where the
correct text was "I was scarred". So, you have to mentally
pause and think about what the correct word might have been,
instead of what you clearly heard!

Ages ago, synthesizers had all sorts of problems normalizing speech.
"Dr. Smith lives on Smith Dr." "Watch the polish maid polish the
silver." Or, coping with "input errors" (misspellings? grammatical
errors?). "How would you pronounce" (rest of sentence missing).
"Dog fly airplane book" "1,345" "1,2345,456" "1.02.3" "555-12121"

A lot of that has improved with better understanding of context.
But, even that has limitations:

(I am at the rehearsal for the play) "I read my lines." (is that past
or present tense?) "Then, the director gives me hints as to how he'd
like me to act the part."

When I watch movies, I turn on the subtitles as it improves comprehension
(I'm not having to fight sound effects, background music, etc.) -- I really
only have ONE chance to figure out what they are saying (rewinding to
replay something really impacts the presentation). Having to tell
a machine to repeat something more than RARELY quickly becomes noticeable.
Like someone who is hard-of-hearing constantly asking folks around
them to repeat what they said; it makes their situation more noticeable.

The voice (most) also has issues that make it tiring. When you watch
the evening newscast, there are likely several talking heads involved.
Often a male and female newscaster who alternate presentations -- to
give you a break from the monotony of a single presenter. Turn on
"Narrator" if you're in a Windows machine and see how long it takes before
you turn it back OFF! (close your eyes so you don't have visual cues
to help you sort out what it is saying!)

It's hard to "program" naturalness into speech. You can use some
general rules for prosody, breath pauses, etc. But, it still feels
like you are hearing "Hello, my name is Bob" repeated as if an audio
*recording* each time you encounter it.

And, you're listening to a disembodied voice. There are no visual
cues to pick up on (gestures), body language, etc.

I'm just trying to make the experience the "least unusual" that it
can be, for "average joes". My design philosophy has always been to do
work so the user doesn't have to!

[E.g., for a user with dementia, having the voice of her eldest
child tell her to turn off the stovetop -- or go out and grab the
DELIVERED mail -- is probably more reassuring than someone like Siri
"dictating" to them!]

Re: US speech "accents"

<tiead5$2nvo9$2@dont-email.me>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=107996&group=sci.electronics.design#107996

copy link Newsgroups: sci.electronics.design

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: US speech "accents"
Date: Sat, 15 Oct 2022 05:51:38 -0700
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <tiead5$2nvo9$2@dont-email.me>
References: <ti5ut0$1ds4b$1@dont-email.me>
<qiidkh1e18jq5f4qpdvapt094552ll0355@4ax.com> <ti6u01$1i15u$1@dont-email.me>
<f48e85ca-36ec-495c-097e-c74c652ae013@electrooptical.net>
<ti9hk6$1raug$1@dont-email.me>
<17632b75-89f5-4344-ac88-b1d06aba7e49n@googlegroups.com>
<ti9uuq$1se1s$1@dont-email.me>
<92f37582-68bf-4a4a-b46c-6cfb14f93960n@googlegroups.com>
<tia441$1sre3$1@dont-email.me> <tiaa4u$1t7qh$1@dont-email.me>
<tibf4m$22lje$1@dont-email.me> <tibkfj$23n7j$1@dont-email.me>
<9938c304-d5aa-4818-98f0-dac3648f470cn@googlegroups.com>
<ticv0h$2b96p$1@dont-email.me>
<d8f8d6b4-a8bc-4dff-a421-8b68c6323ba8n@googlegroups.com>
<tidaeo$2gmf8$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 15 Oct 2022 12:51:50 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="b1a3435414c4a83dff88ffc618b8718f";
logging-data="2883337"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18CIMMkFs2o/03HcKisp4rq"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:xNgQU4s3PUwMTByIfQrPJq8/Gcc=
In-Reply-To: <tidaeo$2gmf8$1@dont-email.me>
Content-Language: en-US

by: Don Y - Sat, 15 Oct 2022 12:51 UTC

On 10/14/2022 8:46 PM, Don Y wrote:

> Many of these messages are canned, but parameterized. E.g.,
> "Volume level 27 (of 32)"
> "Battery level 35%"
> "Speaking rate 250 words per minute"
> "Battery life remaining, 3:47"
> etc.

> Others are sourced externally, but pronounced locally:
> "Available access points include:
>    greenhornet
>    wifi234
>    guestaccess
>    ...
> "System unavailable. Contact Dr Betina at 234-5000 x234"
> "Scheduled maintenance. Try again after 5:00pm"
> "Account disabled. Contact Jesse at jesse@voyager.com"

> OTOH, an unprompted utterance can be just about anything!
> "You forgot to turn off the stovetop"
> "Your wash is ready to be transferred to the dryer"
> "The mailman has delivered a package"
> "It's getting dark. Are you sure you want to leave the garage door open?"
> "Penny is at the front door"
> "The freezer has warmed to a level that threatens its contents"
> "Someone is wandering around the (fenced) back yard"
> "The evening news broadcast is starting, soon"
> "Time to wake up! You've a doctor appointment at 9:00am"
> "Marijane is on the phone. Would you like to speak to her?"

> I'm just trying to make the experience the "least unusual" that it
> can be, for "average joes". My design philosophy has always been to do
> work so the user doesn't have to!

This is an online emulation of DECtalk -- probably the poster child for
formant based synthesizers.
<https://archive.org/details/dectalk>
Click the "power" button, then type text at the green ">" prompt,
followed by a period and newline.

Try "You forgot to turn off the stovetop" (without the quotes).
Yes, you can try it again -- as I suspect you weren't clear
as to what it actually *said*! <grin>

Ditto "The mailman has delivered a package" Or, any of the above
examples. Ask yourself if you would have been able to sort out
what was being said had you NOT typed the text into the emulator!

Try words like "police", "Berlin", "Boston", "Wednesday", "Row, row,
row your boat" (note that it doesn't play out the way you'd expect!).

There are (persistent) control codes that you can intersperse in the
input text that tweek the presentation. E.g., "[:ra 100]" (without
quotes but WITH brackets) will slow the speech rate to ~100 words
per minute. (unsighted users will up this to 300 or more and still
expect reasonable comprehension)

There are some predefined "voices" which can be selected. Use
"[:nX]" where X is {b,f,h,k,r,u,v,w} (the characteristics of the "v"
voice can be further adjusted and stored for later recall).

But, these just alter the characteristics of the "waveform synthesis",
not *what* is being synthesized. E.g., "boat" is always pronounced the
same, just "rendered" into a specific voice. I.e., "Insurance" is
always pronounced as "inSURance", never "INsurance".

Note that it is smart enough to alter the pronunciation of the definite
article ("the") based on the noun targeted: "the airplane" vs. "the boat".
So, emulating a speaker that doesn't make this alteration (e.g., B Obama)
isn't easily possible.

[@Dimiter, try the phrases I mentioned in my anecdote -- repeated
many times, on a single line, to simulate a continuous loop. Then,
imagine encountering them "by surprise"...]

Subject	Author
US speech "accents"	Don Y
Re: US speech "accents"	John Larkin
Re: US speech "accents"	amdx
Re: US speech "accents"	Jeff Layman
Re: US speech "accents"	John Larkin
Re: US speech "accents"	Phil Hobbs
Re: US speech "accents"	Jeff Layman
Re: US speech "accents"	Phil Hobbs
Re: US speech "accents"	John Larkin
Re: US speech "accents"	Lasse Langwadt Christensen
Re: US speech "accents"	Dimiter_Popoff
Re: US speech "accents"	Lasse Langwadt Christensen
Re: US speech "accents"	Dimiter_Popoff
Re: US speech "accents"	Lasse Langwadt Christensen
Re: US speech "accents"	Dimiter_Popoff
Re: US speech "accents"	Lasse Langwadt Christensen
Re: US speech "accents"	Don Y
Re: US speech "accents"	Dimiter_Popoff
Re: US speech "accents"	Don Y
Re: US speech "accents"	Rich S
Re: US speech "accents"	Don Y
Re: US speech "accents"	Rich S
Re: US speech "accents"	Don Y
Re: US speech "accents"	Don Y
Re: US speech "accents"	Gerhard Hoffmann
Re: US speech "accents"	Phil Hobbs
Re: US speech "accents"	Fred Bloggs
Re: US speech "accents"	Dean Hoffman
Re: US speech "accents"	Don Y
Re: US speech "accents"	Lasse Langwadt Christensen
Re: US speech "accents"	Don Y