Monday, May 13, 2024

AI: Are we there yet?

Artificial Intelligence or Artificial Idiocy?

As a pioneer of statistical and neural learning technologies for natural language processing and (embodied) conversational agents, it is has been great to see the advances that larger and larger language models and clever use of embeddings, attention and filtering, have made in the last couple of years.

It is important to understand that large language models (LLMs) themselves are statistical models that predict what words and phrases are likely to come next, and a model like GPT4 is trained on a very expensive run through a very large fixed body of text (the corpus) - and so itself doesn't learn any more, and doesn't actually understand anything about the world or what it is saying. So no real intelligence there yet...

However, the same kinds of models can be trained on speech, where the units are phonemes rather than characters, and phrasing is conveyed by intonation rather than punctuation. Moreover, similar models are being trained with images and videos, and this does start to give us information about the world.

The AI/LLM models that you are playing with may have both a fixed LLM model at the heart, other models trained to help with composing images or speech, prompts and filters that guide and censor them to produce answers of an acceptable form in an appropriate format, and so on. These additional layers can retain information within and between sessions, can look at images and can search the web. However, currently sessions tend to be limited with no direct memory of previous sessions and no actual learning across sessions, and the results of searches tend not to be retained fully even within a session - and the "robots.txt" limits on searches may impact the ability to refine the answer to an ongoing question (so it will want to start a new session on a new topic).

These conversations themselves may be used by a combination of human and automatic processing to improve the overall AI experience even though the underlying LLM hasn't changed. And of course, such experience can feed into future LLMs with greater quality control - although those Large Language Models take months of training on thousands of GPUs.

Writing reports and papers

One of the opportunities (from the point of view of employees and students) for these models is to research topics and write summaries and reports (and of course, they can also be used to try to identify and distinguish real human/student work from artificial/faked work).

The models are by their nature inclined to make up stories and facts, and are limited in their access to real facts (both those in the original corpus, due to the compilation into embeddings; and those in the searchable web). The report-writing wrappers around the LLMs may thus push them to write in a formal dotty way with references to the sources - although these sources need to be checked as they do not alway contain the "fact" asserted.  A good way to test out the models is on area where you are expert (and for me that's me, my research areas and my writings).

From the perspective of a teacher we have several problems. One is that students who use them are not learning things themselves, and don't know the area well enough even to see what is right and wrong. Longer term however, it is appropriate for students to learn how to use AI tools to be more efficient and effective - but we have big problems ensuring the accuracy of the AI's results, which requires a separate fact-checking step, and ideally would involve grounding in the real world and actual understanding of what it is talking about.

From the point of view of a user, whether academic or personal, there is a real problem with us believing what the system tell us, even though it is often wrong and can be persuaded to change its mind and tell you something different. There are ethical issues as well with certain uses, including as a "friend" or "adviser" or "counsellor", that were already considered by Joseph Weizenbaum in his 1970 book, "Computer Power and Human Reason: from Judgement to Calculation" — which was written after the "success" of his famous 1960s Eliza/Doctor program.

However, this is a track we are going down with automated systems providing help and advice, and we are currently addressing the dangers of bad advice or even just lack of empathy.

There are even fake research papers being written with the help of these LLMs.

Writing stories/books

But while these models may find it difficult to stick the the facts, surely that must mean that story-telling is a natural opportunity. Indeed there are now many AI-generated stories and books being published, and detecting these is a major headache for the publishing industry, and authors are being asked to disclose if "AI" has been used in the creation of the story, or the images, or the narration (technically we should also say yes if we used Word since it uses AI for spelling and grammar corrections/suggestions - but I don't use that as they are generally wrong after my initial typos are eliminated).

Conversely, marketers are using "AI" to produce blurbs and teasers, to select keywords and categories, for human written books - so even with these attempted protections, the books may be authentic human stories, but what you see when you purchase may be computer generated.

Human authors, may also use "AI" in brainstorming for ideas. But that may lead to inadvertent plagiarism as the LLM can generate phrases, sentences and larger sequences from its training and search data. Also, a book needs a consistent world and history that gives rise to its own set of fictional "facts" - and a lot of the filtering on top of an LLM is to ensure that it remains consistent within a conversation.

So to write a longer story or hold a longer conversation, it is important to use the LLM itself to summarize facts in a way that can be included in later prompts. Indeed bigger AI systems may use multiple LLMs to produce and manage and optimize prompts, and to filter and check the results - currently more internal constency than external accuracy.

We've also explored using LLMs to write a story in the style of a particular author, and/or target it to an appropriate audience. Generally, they can do pretty well at these stylistic things. But of course in a novel, you have to give each character their own personality, and would some how have to capture and maintain that in a sequence of prompts.

So far, I'm not finding them much competition for me! or much help...


Narrating stories/books

Natural "AI" voices are getting pretty good, and for me the big opportunity is to do audiobooks with authentic character voices. So I've recently produced audio versions of some stories in five different ways (some stories/poems/extracts are airing on radio, and I have some audiobook versions of my novels in the works).  

I've now produced my first audiobook (of Time for PsyQ) using Google Play AI technology, with the earlier chapters narrated in five different ways using three different toolchains (not all of which involve AI) as I experimented. For Apple Books, I've used their single female AI voice to autonarrate a second version. At this point the technology is new, and with significant restrictions about which outlets will accept what.

In fact, I think this technology is going to change the whole nature of audiobooks, making them more like the radio plays our parents or grandparents listened to. Down the track, we can expect to see multicharacter autonarrated audiobooks and even autoacted videobooks that compete with telemovie adaptations.

Unfortunately, reading is a dying art: both reading out loud, with appropriate expression; and reading to oneself, with good comprehension. Reading books ourselves requires us to interpret the author's descriptions of scenes and characters and emotions constructively, imagining them. Thus books impose the most cognitive load, audiobooks less, and videos/movies/plays the least. This suggests why people now watch movies and teleseries, including adaptations from books, more than they actually read books, with audiobooks now starting to overtake ebooks for market share so that they look like occupying an intermediate position. 

The rise of audiobooks is somewhat controversial in relation to their effect on literacy. It still requires interpretation of scene and character details, but a good narrator will convey emotions and distinguish the characters with slightly different pitch, accent and/or mannerism. A radioplay or dramatized audiobook goes further by adding sound effects (and I have experimented with character voices and sound effects for some of my stories/chapters voiced for radio) - but they are not permitted by the Google and Apple AI narration and Amazon says they cause problems for their AI-mediated Whispersync. 

Educators, including teacher/librarians, face a variety of so-called scholarly evidence and other inputs and recommendations, involving a great deal of only partially accurate information about this subject.

It is partly true that the same brain areas are involved in "reading" audio books and electronic or print books, in the sense that our language areas are active, and the cognitive areas responsible for understanding and interpreting the text are active.  To an extent even some speech/hearing areas are active, as nearby areas are involved in phonological, lexical and grammatical processing, and there are also "mirror" neurons that fire across multiple modalities (which Time for PsyQ's 11-year-old heroine mentions in the book, which has a lot of brain science and technology in it). Actually in my PhD (late 70s early 80s) I predicted and modeled mirror neurons as being necessary for language learning around the same time they were being discovered elsewhere (although unfortunately, the paper about their discovery was rejected by Science so publication was delayed till after my PhD thesis was complete).

The use of parallel hearing and reading of texts is also useful for comprehension - one of the reasons we use multimodal methods in teaching. I used this approach in learning Chinese (where the characters are more semantic than phonetic, and have different pronunciations in different languages/dialects). The phonic approach of teaching people to read has the disadvantage of focussing on letters rather than words and sentences, and hearing a audiobook as they read along in the text — or having it read to them by a parent or teacher as they read along — helps trigger those mirror neurons, helps them learn the pronunciation of less phonetic words and names, and models and encourages good fluent reading (if the reader is good - many cheap audiobooks of classics have rather poor readers who mispronounce the less common words).

In fact, the main reason I chose to adapt Time for PsyQ to an audiobook format was that a considerable number of parents and teachers had mentioned that they had enjoyed reading the book out loud with their children.

On the other hand, I still have reservations about audiobooks, and note that the western world is seeing increasing loss of literacy, with most Americans reading at primary school level or less. I expect to see the increasing prevalence of audiobooks, particularly in schools and libraries, to drive literacy to even lower levels.

The typical adult in an English-speaking country spends over 5 hours a day watching video of one form or another, and around 2 hour a day listening to audio of one form or an other, with audiobooks approaching 1 hour a day on average, while reading physical or electronic books has fallen to 16 minutes a day on the average in the US, with a similar amount of time spent reading traditional news sources.

There is also a corresponding transition from face-to-face and voice-telephony interaction to social media, and smart phones improved speech interfaces are impacting use of the reading/writing/typing modalities still further.

Nonetheless, I chose to go ahead and produce an audiobook of Time for PsyQ, which has just been published through Google Play and Findaway Voices, and is already available for Kobo (although will not be available on Amazon or Apple in the near future due to their rules regarding AI-medidated narration).


Single Narrator

Generally, an audiobook or radio narration will use a single narrator, in my case author narrator. This has a number of technical issues associated with it, relating to the equipment and the postprocessing to appropriate standards. Professional narrators will often work with a professional audio engineer to deal with this.

For me, there are also the limitations of my voice (although of the dozen or so audiobooks I've sampled, only the best professional voice actors to better). Furthermore, I am used to lecturing with a lapel mike, and speaking into a close mike for a recording is a little different. Also what is acceptable "live" is not acceptable for audiobooks (e.g. p-pops and s-sibilance and background noise).

In addition, I'm not the greatest at accents — particularly playing the parts of half a dozen different eleven-year olds of both sexes. For a short story, with few characters, I can make them different enough by getting into a persona. But my YA novel Time for PsyQ has 42 different people to voice (not to mention some animals).

So in my single narrator version, I gave people accents of different nationalities, and tried to adjust my tone, pitch, formants, speaking rate and voice quality to suggest their age and sex.

Signal Processing

For the signal processing (audio engineering) of my audio I use Apple's GarageBand. This allows for the normal compression and limiting needed to meet audiobook standards, but also allows some other possibilities. I pulled characters out of my narration onto separate tracks, and adjusted fundamental and formant frequency to adjust the age and sex of my characters.

This works surprisingly well, and I found it works best if I don't try to distinguish age/sex with frequency/formant shifts in my own narration — concentrating on prosody and national/ethnic accent.

Note that no AIs were harmed in the making of this version — all is standard signal processing using techniques that go back a century. The fundamental frequency of a voice is its tone (for tone languages) or intonation or pitch (for singing). The formants are the resonance frequencies that relate to the shape of your vocal tract (including oral and nasal cavities) as altered by manipulation of articulators (like tongue and lips) — so they change dynamically in a way defined by the language/dialect/accent, as well as systematically due to individual difference (including smaller dimensions for females/children giving rise to higher frequencies/resonances than adult males).

I call this signal processing approach voice munging.

Voice Changing/Voice to Voice (V2V)

There are now many open-source packages out there for doing voice-changing, although they are tricky to use and install. There are also companies that are using these packages on a subscription basis to allow you to change your voice to a celebrity voice (do not use for commercial purposes), or a synthetic voice, or in some cases a commercial voice (where a voice actor has been paid and the voice is licensed for performance in commercial works). 

None of them are particularly good at this point — those I have tried all have their limitations, and are generally designed for karaoke/covers in someone else's singing voice. Nonetheless I managed to produce a version of Time for PsyQ  Chapter 1 with good characterizations by using pitch adjustment in combination with the target voice's formant characteristics.

The target voice characteristics are extracted by training on a text, and it is relatively easy to get the characteristic relaxed baseline formants for a particular speaker (just get them to say errrrr). It doesn't even require a neural network (generally of the order of 20 parameters suffice and this vocoder-type technology goes back close to a century), although these systems do use "AI" in the form of an ANN (Artificial Neural Network) to derive theirs.

Because the commercial systems were so inconvenient to use, this took a huge amount of time. I may write my own wrapper around one of the open-source voice-changers to make this a bit more automatic (I had to identify all quotations, pull them out, individually, change frequencies and regenerate in the target voice, then paste it back into an appropriate character track).

This gives really good results because more parameters are controllable than I could manage manually in GarageBand (the V2V systems usually use a Python toolchain, although I often use Matlab). I also added sound effects at the various breaks in this version (which will be used on radio).


Voice Cloning/Text to Speech (TTS)

The final option I wish to present is text to speech, and I tested multiple systems and commercial products — all of which had considerable pros and cons. Some of the commercial systems had different sets of voices trained using different toolchains and these different technologies allowed different degrees of control available, in terms of adjusting pitch, emphasis, naturalness, speechrate, etc. (and one charged 5 times as much per character for the more realistic voices, including your own cloned voices). But I won't mention any names as this is not intended to be a formal review of such systems and I wouldn't want to plug any of them.

So what is a cloned voice? One feature of the commercial systems is that they allowed you to clone your own voices (licensed by the month), which means that you can use your own voice, family/friend voices or borrowed/stolen voices. I guess this is useful if you have difficulty speaking clearly for long periods or don't have the appropriate equipment to do commercial quality recording. However, you need to provide samples of your voice and the quality of these samples determines the quality of the clone voice - it reproduced the imperfections, the background/room noise, etc. And even some of the voices provided had such issues. Also the cloning may really be a merging of your voice into an existing model — which means an American accent can be heard under the clone of my Australian or British voicing even after training with a couple of hours of my speech.

But the big problem here is that the TTS systems don't read the way I do or speak the way my characters should. While V2V can track my prosody (pitch, timing, intonation, loudness, accent), these system do their own thing - badly. Though the some systems/voices allowed some repairs by virtue of their pitch/rate/emphasis/naturalness controls. The other issue is that the repertoire of available voices and accents is low. Also voices that are more 'intelligent' tend to offer less opportunity to affect their prosody manually.

But the main reason I abandoned these commercial systems was the cost of these per character — particular given the number of attempts at tweaking them I needed to get an acceptable result.

Google Play's AI voices/autonarration

After several weeks of playing around with various alternatives, and noting that Findaway Voices and Google Play will only put up autonarrated audiobooks done by Google Play I decided to try that for Time for PsyQ (and ACX doesn't allow me to submit anything as I don't reside in the US). 

With the Google Play autonarration, there is a limited repetoire of English-speaking voices: 12 US, 5 IN, 4 UK, 3 AU split unevenly across male and female, with no voice actors younger than 30ish (there is one female and two male US voices that sounds like late 20s). For comparison, Amazon Polly also has 12 US voices, with 6 UK voices (including 1 Welsh and 1 Irish), all adult; plus 3 IN voices, all female; 2 Australian voices, male and female; 1 NZ and 1 ZA voice (both female). Quite tough if you have an international cast with few (if any) Americans!

I quickly found that the Google Play interface was limited (could only change the target voice and the speech rate), flawed (question and command/imperative/exclamation intonation is not reliably produced, and postposed speech tags, like "Airlie said", are pronounced with loud sentence start levels and intonation) and overall quite buggy (e.g. hanging periodcally, plus although there is the ability to correct mispronounced words, it doesn't work reliably because it gets a couple of characters out in the segmentation — that is it replaces part of the word, or part of two words, with the new phonetically described pronunciation). 

So I did a lot of rewriting, moving speech tags from the end of quotations to the start, replacing mispronounced words by phonetically spelled versions in the text, or by a different word. You'd think it would know that "he'd read" is pronounced like "red" not "reed" but it tended to get that wrong. Overall, the result sounds much better than the half dozen AI-voice audiobooks I've listened too on Google Play.

I had hoped to munge the resulting audio to fix the most egregious prosody problems (questions and speech tags, and excited kids) — at least for broader Findaway Voices distribution, but when the specific AI-voice terms and conditions were presented they made clear that I could only use the generated audio in unaltered form. I will note that it is not clear from the various terms of the various audiobook distributors that my munged versions or V2V versions would be acceptable either, even though I technically narrated it all.

So with lots of workarounds, I completed Time for PsyQ markup and I got what I felt was an acceptable result and published it on Google Play. The final version actually does sound pretty good - perhaps even better than the preview version I worked with in their autonarration interface (although some voices didn't sound as good due I expected, probably due to compression to audiobook standards).

Findaway Voices has approved it will soon be available for Kobo (Walmart✓), Spotify, audiobooks.com, etc.

As they become available, all versions/outlets will be linked into