Monday, May 13, 2024

AI: Are we there yet?

 Artificial Intelligence or Artificial Idiocy?

As a pioneer of statistical and neural learning technologies for natural language processing and (embodied) conversational agents, it is has been great to see the advances that larger and larger language models and clever use of embeddings, attention and filtering, have made in the last couple of years.

It is important to understand that large language models (LLMs) themselves are statistical models that predict what words and phrases are likely to come next, and a model like GPT4 is trained on a very expensive run through a very large fixed body of text (the corpus) - and so itself doesn't learn any more, and doesn't actually understand anything about the world or what it is saying. So no real intelligence there yet...

However, the same kinds of models can be trained on speech, where the units are phonemes rather than characters, and phrasing is conveyed by intonation rather than punctuation. Moreover, similar models are being trained with images and videos, and this does start to give us information about the world.

The AI/LLM models that you are playing with may have both a fixed LLM model at the heart, other models trained to help with composing images or speech, prompts and filters that guide and censor them to produce answers of an acceptable form in an appropriate format, and so on. These additional layers can retain information within and between sessions, can look at images and can search the web. However, currently sessions tend to be limited with no direct memory of previous sessions and no actual learning across sessions, and the results of searches tend not to be retained fully even within a session - and the "robots.txt" limits on searches may impact the ability to refine the answer to an ongoing question (so it will want to start a new session on a new topic).

These conversations themselves may be used by a combination of human and automatic processing to improve the overall AI experience even though the underlying LLM hasn't changed. And of course, such experience can feed into future LLMs with greater quality control - although those Large Language Models take months of training on thousands of GPUs.

Writing reports and papers

One of the opportunities (from the point of view of employees and students) for these models is to research topics and write summaries and reports (and of course, they can also be used to try to identify and distinguish real human/student work from artificial/faked work).

The models are by their nature inclined to make up stories and facts, and are limited in their access to real facts (both those in the original corpus, due to the compilation into embeddings; and those in the searchable web). The report-writing wrappers around the LLMs may thus push them to write in a formal dotty way with references to the sources - although these sources need to be checked as they do not alway contain the "fact" asserted.  A good way to test out the models is on area where you are expert (and for me that's me, my research areas and my writings).

From the perspective of a teacher we have several problems. One is that students who use them are not learning things themselves, and don't know the area well enough even to see what is right and wrong. Longer term however, it is appropriate for students to learn how to use AI tools to be more efficient and effective - but we have big problems ensuring the accuracy of the AI's results, which requires a separate fact-checking step, and ideally would involve grounding in the real world and actual understanding of what it is talking about.

From the point of view of a user, whether academic or personal, there is a real problem with us believing what the system tell us, even though it is often wrong and can be persuaded to change its mind and tell you something different. There are ethical issues as well with certain uses, including as a "friend" or "adviser" or "counsellor", that were already considered by Joseph Weizenbaum in his 1970 book, "Computer Power and Human Reason: from Judgement to Calculation" — which was written after the "success" of his famous 1960s Eliza/Doctor program.

However, this is a track we are going down with automated systems providing help and advice, and we are currently addressing the dangers of bad advice or even just lack of empathy.

There are even fake research papers being written with the help of these LLMs.

Writing stories/books

But while these models may find it difficult to stick the the facts, surely that must mean that story-telling is a natural opportunity. Indeed there are now many AI-generated stories and books being published, and detecting these is a major headache for the publishing industry, and authors are being asked to disclose if "AI" has been used in the creation of the story, or the images, or the narration (technically we should also say yes if we used Word since it uses AI for spelling and grammar corrections/suggestions - but I don't use that as they are generally wrong after my initial typos are eliminated).

Conversely, marketers are using "AI" to produce blurbs and teasers, to select keywords and categories, for human written books - so even with these attempted protections, the books may be authentic human stories, but what you see when you purchase may be computer generated.

Human authors, may also use "AI" in brainstorming for ideas. But that may lead to inadvertent plagiarism as the LLM can generate phrases, sentences and larger sequences from its training and search data. Also, a book needs a consistent world and history that gives rise to its own set of fictional "facts" - and a lot of the filtering on top of an LLM is to ensure that it remains consistent within a conversation.

So to write a longer story or hold a longer conversation, it is important to use the LLM itself to summarize facts in a way that can be included in later prompts. Indeed bigger AI systems may use multiple LLMs to produce and manage and optimize prompts, and to filter and check the results - currently more internal constency than external accuracy.

We've also explored using LLMs to write a story in the style of a particular author, and/or target it to an appropriate audience. Generally, they can do pretty well at these stylistic things. But of course in a novel, you have to give each character their own personality, and would some how have to capture and maintain that in a sequence of prompts.

So far, I'm not finding them much competition for me! or much help...


Narrating stories/books

Natural "AI" voices are getting pretty good, and for me the big opportunity is to do audiobooks with authentic character voices. So I've recently produced audio versions of some stories in four different ways (some stories/poems/extracts are airing on radio, and I have some audiobook versions of my novels in the works). 

In fact, I think this technology is going to change the whole nature of audiobooks, making them more like the radio plays our parents or grandparents listened to.


Single Narrator

Generally, an audiobook or radio narration will use a single narrator, in my case author narrator. This has a number of technical issues associated with it, relating to the equipment and the postprocessing to appropriate standards. Professional narrators will often work with a professional audio engineer to deal with this.

For me, there are also the limitations of my voice. Although used to lecturing, speaking into a close mike for a recording is a little different. And what is acceptable "live" is not acceptable for audiobooks (e.g. p-pops and s-sibilance and background noise).

Furthermore, I'm not the greatest at accents - particularly playing the parts of half a dozen different eleven-year olds of both sexes. For a short story, with few characters, I can make them different enough by getting into a persona. But my YA novel Time for PsyQ has 42 different people to voice (not to mention some animals).

So in my single narrator version, I gave people accents of different nationalities, and tried to adjust my pitch and voice quality to suggest their age and sex.

Signal Processing

For the signal processing (audio engineering) of my audio I use Apple's GarageBand. This allows for the normal compression and limiting needed to meet audiobook standards, but also allows some other possibilities. I pulled characters out of my narration onto separate tracks, and adjusted fundamental and formant frequency to adjust the age and sex of my characters.

This works surprisingly well, and I found it works best if I don't try to distinguish age/sex in my own narration - concentrating on prosody and national/ethnic accent.

Note that no AIs were harmed in the making of this version - all standard signal processing using techniques that go back a century. The fundamental frequency of a voice is its tone (for tone languages) or intonation or pitch (for singing). The formants are the resonance frequencies that relate to the shape of your vocal tract (including oral and nasal cavities) as altered by manipulation of articulators (like tongue and lips) - so they change dynamically in a way defined by the language/dialect/accent, as well as systematically due to individual difference (including smaller dimensions for females/children giving rise to higher frequencies/resonances than adult males).

I call this signal processing approach voice munging.

Voice Changing/Voice to Voice (V2V)

There are now many open-source packages out there for doing voice-changing, although they are tricky to use and install. There are also companies that are using these packages on a subscription basis to allow you to change your voice to a celebrity voice (do not use for commercial purposes), or a synthetic voice, or in some cases a commercial voice (where a voice actor has been paid and the voice is licensed for performance in commercial works). 

None of them are particularly good at this point - those I have tried all have their limitations, and are generally designed for karaoke/covers in someone else's singing voice. Nonetheless I managed to produce a version of Time for PsyQ  Chapter 1 with good characterizations by using pitch adjustment in combination with the target voice's formant characteristics.

The target voice characteristics are extracted by training on a text, and it is relatively easy to get the characteristic relaxed baseline formants for a particular speaker (just get them to say errrrr). It doesn't even require a neural network (generally of the order of 20 parameters suffice and this vocoder-type technology goes back close to a century), although these systems do use "AI" in the form of an ANN (Artificial Neural Network) to derive theirs.

Because the commercial systems were so inconvenient to use, this took a huge amount of time. I may write my own wrapper around one of the open-source voice-changers to make this a bit more automatic (I had to identify all quotations, pull them out, individually, change frequencies and regenerate in the target voice, then paste it back into an appropriate character track).

This gives really good results because more parameters are controllable than I could manage manually in GarageBand (the V2V systems usually use a Python toolchain, although I often use Matlab). I also added sound effects at the various breaks in this version (which will be used on radio).


Voice Cloning/Text to Speech (TTS)

The final option I wish to present is text to speech, and I tested multiple systems and commercial products - all of which had considerable pros and cons. Some of the commercial systems clearly had voices trained using different toolchains and these thus different voices had different degrees of control available, in terms of adjusting pitch, emphasis, naturalness, speechrate, etc. (and one charged 5 times as much per character for the more realistic voices, including your own cloned voices) But I won't mention any names as this is not intended to be a formal review of such systems and I wouldn't want to plug any of them.

So what is a cloned voice? One feature of the commercial systems is that they allowed you to clone your own voices (licensed by the month), which means that you can use your own voice, family/friend voices or borrowed/stolen voices. I guess this is useful if you have difficulty speaking clearly for long periods or don't have the appropriate equipment to do commercial quality recording. However, you need to provide samples of your voice and the quality of these samples determines the quality of the clone voice - it reproduced the imperfections, the background/room noise, etc. And even some of the voices provided had such issues. Also the cloning may really be a merging of your voice into an existing model - which means an American accent can be heard under the clone of my Australian voice even after training with a couple of hours of my speech.

But the big problem here is that the TTS systems don't read the way I do or speak the way my characters should. While V2V can track my prosody (pitch, timing, intonation, loudness, accent), these system do their own thing - badly. Though the better systems/voices allowed some repairs by virtue of their pitch/rate/emphasis/naturalness controls. The other issue is that the repertoire of available voices and accents is low.

But the main reason I abandoned these commercial systems was the cost of these per character - particular given the number of attempts at tweaking them I needed to get an acceptable result.

Google Play's AI voices/autonarration

After several weeks of playing around with various alternatives, and noting that Findaway Voices and Google Play will only put up autonarrated audiobooks done by Google Play I decided to try that for Time for PsyQ

I quickly found that the Google Play interface was limited (could only change the voice and the speech rate), flawed (question and command/imperative/exclamation intonation is not produced, and postposed speech tags, like "Airlie said", are pronounced with loud sentence start levels and intonation) and buggy (there is the ability to correct mispronounced words, but it doesn't work reliably because it gets a couple of characters out in the segmentation - that is it replaces part of the word, or part of two words, with the new phonetically described pronunciation).

So I did a lot of rewriting, moving speech tags from the end of quotations to the start, replacing mispronounced words by phonetically spelled versions in the text, or by a different word. You'd think it would know that "he'd read" is pronounced like "red" not "reed" but it tended to get that wrong.

I had hoped to munge the resulting audio to fix the most egregious prosody problems (questions and speech tags, and excited kids) — at least for broader Findaway Voices distribution, but when the specific AI-voice terms and conditions were finally presented they made clear that I could only use the generated audio in unaltered form. I will note that it is not clear from the various terms of the various audiobook distributors that my munged versions or V2V versions would be acceptable either, even though I technically narrated it all.

So with lots of workarounds, I completed Time for PsyQ markup and I got what I felt was an acceptable version and published it on Google Play. The final version actually does sound pretty good - even better than the preview version I worked with in their autonarration interface (although some voices didn't sound as good due I expect to compression to audiobook standards).

I am currently waiting for Findaway Voices to approve it and make it available for Kobo, Spotify, audiobooks.com, etc.

As they become available, all versions/outlets will be linked into