Monday, May 13, 2024

AI: Are we there yet?

Artificial Intelligence or Artificial Idiocy?

As a pioneer of statistical and neural learning technologies for natural language processing and (embodied) conversational agents, it is has been great to see the advances that larger and larger language models and clever use of embeddings, attention and filtering, have made in the last couple of years.

It is important to understand that large language models (LLMs) themselves are statistical models that predict what words and phrases are likely to come next, and a model like GPT4 is trained on a very expensive run through a very large fixed body of text (the corpus) - and so itself doesn't learn any more, and doesn't actually understand anything about the world or what it is saying. So no real intelligence there yet...

However, the same kinds of models can be trained on speech, where the units are phonemes rather than characters, and phrasing is conveyed by intonation rather than punctuation. Moreover, similar models are being trained with images and videos, and this does start to give us information about the world.

There are also risks that come from the social and legal pressures that are being brought to bear on the development of these systems. 

When I started working with embeddings and LLMs in the 1970s (for speech and text), there were three big problems beyond the the actual language and learning domain: the ability to find large enough amounts of text and/or speech, the size and cost of primary and secondary storage (memory and disk/tape), and the speed of the computers and their storages systems. I was working initially with (multiple) 8 and 16-bit computers where the memory sizes where just a few KB (64KB was the limit) and disk sizes were just a few MB (my first hard disk was 10MB) and a large corpus was thus of the order of a couple of million characters. 

Now large language models are trained on millions of books, billions of webpages, trillions of characters of text (viz. TB). They also have access to movies/photos/images and programs/code. A major risk thus relates to availability, quality, privacy and copyright. Have these materials been used and copied illegally? And should copyright holders get redress (demands including not just financial recompense but deletion of any LLMs that include their work or a tainted by illegal use of it as training data)? Or should the copyright legislation and its fair use exceptions be modernized to allow and control such use (as happened with fair use copying for print, audio and video)?

I am an interested party on both sides of this: my books and other publications have been indexed and analysed by Google and others, but I see this as different from pirating. It seems to me to be fair use as it enables me to search the web and find the books and papers - and see enough context to have a fair idea whether it is worthwhile obtaining and reading it (my university has subscriptions to most things I want, and can get others on interlibrary loans or as fair use copies). There are provisions in law for this, including payment of usage fees to copyright agencies on behalf of university and school users. It is also possible for this to be paid by the government or the benefiting industry sectors (and covered by appropriate taxes and tarifss, as e.g. happened with audiovisual recording media). This is the path I hope and expect governments will follow, but as usual technology moves faster than government, and the courts do the best they can, and the legislation that results is not always technologically sound and pragmatically useful (and indeed new lobocracy laws may contractict a user's fair use rights).

From the AI side of the question, the LLM does not actually include a copy of any particular copyright work. Rather information from many works are "embedded" in statistical and/or neural frameworks that find the commonalities and interrelations. So what is generated is unlikely to be a full or unfair (more than 10%) use of a work, although it is quite likely to generate phrases and statements (linguistic or programmatic) that appear in similar forms in multiple works. Most of the language (and code) we use is commonplace and idiomatic and only the names (or variables) change. What is not commonplace but unusual or novel is what actually constitutes intellectural property or literary or technical invention. If I use an LLM-based system as a sounding board to fleshout my ideas and bring them together, that is very useful whether that is coded in natural language, a programming language or the results are encoded as an image.

Over the years before ChatGPT, Copilot, Gemini and the like made their appearance, I was using similar techniques to provide a hands free interface (based on eye-tracking and EEG for people with disabilities) to allow searching the web, collating the results into a report, and provide/suggest appropriate quotations and citations. As a teacher, I teach students how to do this properly, quoting and attributing things properly and avoiding academic dishonest, plagiarism and the like. This is much like what these LLM systems try to do, although at the moment they don't do it very well - much like my undergraduate students. But because of outcries about copyright and plagiarism, or getting the LLM to do student's assignments, or not reflecting today's politically correct prescriptions and proscriptions, the systems are being hamstrung, downgraded and restricted to the point where they are not as useful as they could be (and indeed they are thus getting worse rather better in terms of utility).

The AI/LLM models that you are playing with may have both a fixed LLM model at the heart, other models trained to help with composing images or speech, prompts and filters that guide and censor them to produce answers of an acceptable form in an appropriate format, and so on. These additional layers can retain information within and between sessions, can look at images and can search the web. However, currently sessions tend to be limited with no direct memory of previous sessions and no actual learning across sessions, and the results of searches tend not to be retained fully even within a session - and the "robots.txt" limits on searches may impact the ability to refine the answer to an ongoing question (so it will want to start a new session on a new topic).

These conversations themselves may be used by a combination of human and automatic processing to improve the overall AI experience even though the underlying LLM hasn't changed. And of course, such experience can feed into future LLMs with greater quality control - although those Large Language Models take months of training on thousands of GPUs.

Writing reports and papers

One of the opportunities (from the point of view of employees and students) for these models is to research topics and write summaries and reports (and of course, they can also be used to try to identify and distinguish real human/student work from artificial/faked work).

The models are by their nature inclined to make up stories and facts, and are limited in their access to real facts (both those in the original corpus, due to the compilation into embeddings; and those in the searchable web). The report-writing wrappers around the LLMs may thus push them to write in a formal dotty way with references to the sources - although these sources need to be checked as they do not alway contain the "fact" asserted.  A good way to test out the models is on area where you are expert (and for me that's me, my research areas and my writings).

From the perspective of a teacher we have several problems. One is that students who use them are not learning things themselves, and don't know the area well enough even to see what is right and wrong. Longer term however, it is appropriate for students to learn how to use AI tools to be more efficient and effective - but we have big problems ensuring the accuracy of the AI's results, which requires a separate fact-checking step, and ideally would involve grounding in the real world and actual understanding of what it is talking about.

From the point of view of a user, whether academic or personal, there is a real problem with us believing what the system tell us, even though it is often wrong and can be persuaded to change its mind and tell you something different. There are ethical issues as well with certain uses, including as a "friend" or "adviser" or "counsellor", that were already considered by Joseph Weizenbaum in his 1970 book, "Computer Power and Human Reason: from Judgement to Calculation" — which was written after the "success" of his famous 1960s Eliza/Doctor program.

However, this is a track we are going down with automated systems providing help and advice, and we are currently addressing the dangers of bad advice or even just lack of empathy.

There are even fake research papers being written with the help of these LLMs.

But there are also some missed opportunities.  Currently most of the citations (hyperlinked/footnote references) are spurious. They will mention some relevant words but will not in general contain the actual fact or argument that attributed to them. I have never yet seen an LLM chatbot give me properly quoted and attributed references that link directly to the source (and give precise page numbers for work that appears in printed or printready form). Every single response I've ever received (of thousands, across many companies' models) would receive a fail in terms of scholarly presentation and academic integrity and critical acumen. And generally you can lead them to give you "facts" that agree with what you've proposed.

We don't need a committee of yes-AIs, but critical analysis, synthesis and appraisal. And yes, multiple AI models with different training and different purposes can be combined to help refine the question (prompts), collate and present the facts (search), and argue the pros and cons on any issue or proposal (SWOT analysis).

But as it stands, whether you select creative, balanced or precise (in models like Copilot that offer this), you are likely to get faction rather than facts.

Writing stories/books

So while these models may find it difficult to stick the the facts, surely that must mean that story-telling is a natural opportunity. Indeed there are now many AI-generated stories and books being published, and detecting these is a major headache for the publishing industry, and authors are being asked to disclose if "AI" has been used in the creation of the story, or the images, or the narration (technically we should also say yes if we used Word since it uses AI for spelling and grammar corrections/suggestions - but I don't use that as they are generally wrong after my initial typos are eliminated).

Conversely, marketers are using "AI" to produce blurbs and teasers, to select keywords and categories, for human written books - so even with these attempted protections, the books may be authentic human stories, but what you see when you purchase may be computer generated.

Human authors, may also use "AI" in brainstorming for ideas. But that may lead to inadvertent plagiarism as the LLM can generate phrases, sentences and larger sequences from its training and search data. Also, a book needs a consistent world and history that gives rise to its own set of fictional "facts" - and a lot of the filtering on top of an LLM is to ensure that it remains consistent within a conversation.

So to write a longer story or hold a longer conversation, it is important to use the LLM itself to summarize facts in a way that can be included in later prompts. Indeed bigger AI systems may use multiple LLMs to produce and manage and optimize prompts, and to filter and check the results - currently more internal constency than external accuracy.

We've also explored using LLMs to write a story in the style of a particular author, and/or target it to an appropriate audience. Generally, they can do pretty well at these stylistic things. But of course in a novel, you have to give each character their own personality, and would some how have to capture and maintain that in a sequence of prompts.

So far, I'm not finding them much competition for me! or much help...


Narrating stories/books

Natural "AI" voices are getting pretty good, and for me the big opportunity is to do audiobooks with authentic character voices. So I've recently produced audio versions of some stories in five different ways (some stories/poems/extracts are airing on radio, and I have some audiobook versions of my novels in the works).  

I've now produced my first audiobook (of Time for PsyQ) using Google Play AI technology, with the earlier chapters narrated in five different ways using three different toolchains (not all of which involve AI) as I experimented. For Apple Books, I've used their single female AI voice to autonarrate a second version. At this point the technology is new, and with significant restrictions about which outlets will accept what.

In fact, I think this technology is going to change the whole nature of audiobooks, making them more like the radio plays our parents or grandparents listened to. Down the track, we can expect to see multicharacter autonarrated audiobooks and even autoacted videobooks that compete with telemovie adaptations.

Unfortunately, reading is a dying art: both reading out loud, with appropriate expression; and reading to oneself, with good comprehension. Reading books ourselves requires us to interpret the author's descriptions of scenes and characters and emotions constructively, imagining them. Thus books impose the most cognitive load, audiobooks less, and videos/movies/plays the least. This suggests why people now watch movies and teleseries, including adaptations from books, more than they actually read books, with audiobooks now starting to overtake ebooks for market share so that they look like occupying an intermediate position. 

The rise of audiobooks is somewhat controversial in relation to their effect on literacy. It still requires interpretation of scene and character details, but a good narrator will convey emotions and distinguish the characters with slightly different pitch, accent and/or mannerism. A radioplay or dramatized audiobook goes further by adding sound effects (and I have experimented with character voices and sound effects for some of my stories/chapters voiced for radio) - but they are not permitted by the Google and Apple AI narration and Amazon says they cause problems for their AI-mediated Whispersync. 

Educators, including teacher/librarians, face a variety of so-called scholarly evidence and other inputs and recommendations, involving a great deal of only partially accurate information about this subject.

It is partly true that the same brain areas are involved in "reading" audio books and electronic or print books, in the sense that our language areas are active, and the cognitive areas responsible for understanding and interpreting the text are active.  To an extent even some speech/hearing areas are active, as nearby areas are involved in phonological, lexical and grammatical processing, and there are also "mirror" neurons that fire across multiple modalities (which Time for PsyQ's 11-year-old heroine mentions in the book, which has a lot of brain science and technology in it). Actually in my PhD (late 70s early 80s) I predicted and modeled mirror neurons as being necessary for language learning around the same time they were being discovered elsewhere (although unfortunately, the paper about their discovery was rejected by Science so publication was delayed till after my PhD thesis was complete).

The use of parallel hearing and reading of texts is also useful for comprehension - one of the reasons we use multimodal methods in teaching. I used this approach in learning Chinese (where the characters are more semantic than phonetic, and have different pronunciations in different languages/dialects). The phonic approach of teaching people to read has the disadvantage of focussing on letters rather than words and sentences, and hearing a audiobook as they read along in the text — or having it read to them by a parent or teacher as they read along — helps trigger those mirror neurons, helps them learn the pronunciation of less phonetic words and names, and models and encourages good fluent reading (if the reader is good - many cheap audiobooks of classics have rather poor readers who mispronounce the less common words).

In fact, the main reason I chose to adapt Time for PsyQ to an audiobook format was that a considerable number of parents and teachers had mentioned that they had enjoyed reading the book out loud with their children.

On the other hand, I still have reservations about audiobooks, and note that the western world is seeing increasing loss of literacy, with most Americans reading at primary school level or less. I expect to see the increasing prevalence of audiobooks, particularly in schools and libraries, to drive literacy to even lower levels.

The typical adult in an English-speaking country spends over 5 hours a day watching video of one form or another, and around 2 hour a day listening to audio of one form or an other, with audiobooks approaching 1 hour a day on average, while reading physical or electronic books has fallen to 16 minutes a day on the average in the US, with a similar amount of time spent reading traditional news sources.

There is also a corresponding transition from face-to-face and voice-telephony interaction to social media, and smart phones improved speech interfaces are impacting use of the reading/writing/typing modalities still further.

Nonetheless, I chose to go ahead and produce an audiobook of Time for PsyQ, which has just been published through Google Play and Findaway Voices, and is already available for Kobo (although will not be available on Amazon or Apple in the near future due to their rules regarding AI-medidated narration).


Single Narrator

Generally, an audiobook or radio narration will use a single narrator, in my case author narrator. This has a number of technical issues associated with it, relating to the equipment and the postprocessing to appropriate standards. Professional narrators will often work with a professional audio engineer to deal with this.

For me, there are also the limitations of my voice (although of the dozen or so audiobooks I've sampled, only the best professional voice actors to better). Furthermore, I am used to lecturing with a lapel mike, and speaking into a close mike for a recording is a little different. Also what is acceptable "live" is not acceptable for audiobooks (e.g. p-pops and s-sibilance and background noise).

In addition, I'm not the greatest at accents — particularly playing the parts of half a dozen different eleven-year olds of both sexes. For a short story, with few characters, I can make them different enough by getting into a persona. But my YA novel Time for PsyQ has 42 different people to voice (not to mention some animals).

So in my single narrator version, I gave people accents of different nationalities, and tried to adjust my tone, pitch, formants, speaking rate and voice quality to suggest their age and sex.

Signal Processing

For the signal processing (audio engineering) of my audio I use Apple's GarageBand. This allows for the normal compression and limiting needed to meet audiobook standards, but also allows some other possibilities. I pulled characters out of my narration onto separate tracks, and adjusted fundamental and formant frequency to adjust the age and sex of my characters.

This works surprisingly well, and I found it works best if I don't try to distinguish age/sex with frequency/formant shifts in my own narration — concentrating on prosody and national/ethnic accent.

Note that no AIs were harmed in the making of this version — all is standard signal processing using techniques that go back a century. The fundamental frequency of a voice is its tone (for tone languages) or intonation or pitch (for singing). The formants are the resonance frequencies that relate to the shape of your vocal tract (including oral and nasal cavities) as altered by manipulation of articulators (like tongue and lips) — so they change dynamically in a way defined by the language/dialect/accent, as well as systematically due to individual difference (including smaller dimensions for females/children giving rise to higher frequencies/resonances than adult males).

I call this signal processing approach voice munging.

Voice Changing/Voice to Voice (V2V)

There are now many open-source packages out there for doing voice-changing, although they are tricky to use and install. There are also companies that are using these packages on a subscription basis to allow you to change your voice to a celebrity voice (do not use for commercial purposes), or a synthetic voice, or in some cases a commercial voice (where a voice actor has been paid and the voice is licensed for performance in commercial works). 

None of them are particularly good at this point — those I have tried all have their limitations, and are generally designed for karaoke/covers in someone else's singing voice. Nonetheless I managed to produce a version of Time for PsyQ  Chapter 1 with good characterizations by using pitch adjustment in combination with the target voice's formant characteristics.

The target voice characteristics are extracted by training on a text, and it is relatively easy to get the characteristic relaxed baseline formants for a particular speaker (just get them to say errrrr). It doesn't even require a neural network (generally of the order of 20 parameters suffice and this vocoder-type technology goes back close to a century), although these systems do use "AI" in the form of an ANN (Artificial Neural Network) to derive theirs.

Because the commercial systems were so inconvenient to use, this took a huge amount of time. I may write my own wrapper around one of the open-source voice-changers to make this a bit more automatic (I had to identify all quotations, pull them out, individually, change frequencies and regenerate in the target voice, then paste it back into an appropriate character track).

This gives really good results because more parameters are controllable than I could manage manually in GarageBand (the V2V systems usually use a Python toolchain, although I often use Matlab). I also added sound effects at the various breaks in this version (which will be used on radio).


Voice Cloning/Text to Speech (TTS)

The final option I wish to present is text to speech, and I tested multiple systems and commercial products — all of which had considerable pros and cons. Some of the commercial systems had different sets of voices trained using different toolchains and these different technologies allowed different degrees of control available, in terms of adjusting pitch, emphasis, naturalness, speechrate, etc. (and one charged 5 times as much per character for the more realistic voices, including your own cloned voices). But I won't mention any names as this is not intended to be a formal review of such systems and I wouldn't want to plug any of them.

So what is a cloned voice? One feature of the commercial systems is that they allowed you to clone your own voices (licensed by the month), which means that you can use your own voice, family/friend voices or borrowed/stolen voices. I guess this is useful if you have difficulty speaking clearly for long periods or don't have the appropriate equipment to do commercial quality recording. However, you need to provide samples of your voice and the quality of these samples determines the quality of the clone voice - it reproduced the imperfections, the background/room noise, etc. And even some of the voices provided had such issues. Also the cloning may really be a merging of your voice into an existing model — which means an American accent can be heard under the clone of my Australian or British voicing even after training with a couple of hours of my speech.

But the big problem here is that the TTS systems don't read the way I do or speak the way my characters should. While V2V can track my prosody (pitch, timing, intonation, loudness, accent), these system do their own thing - badly. Though the some systems/voices allowed some repairs by virtue of their pitch/rate/emphasis/naturalness controls. The other issue is that the repertoire of available voices and accents is low. Also voices that are more 'intelligent' tend to offer less opportunity to affect their prosody manually.

But the main reason I abandoned these commercial systems was the cost of these per character — particular given the number of attempts at tweaking them I needed to get an acceptable result.

Google Play's AI voices/autonarration

After several weeks of playing around with various alternatives, and noting that Findaway Voices and Google Play will only put up autonarrated audiobooks done by Google Play I decided to try that for Time for PsyQ (and ACX doesn't allow me to submit anything as I don't reside in the US). 

With the Google Play autonarration, there is a limited repetoire of English-speaking voices: 12 US, 5 IN, 4 UK, 3 AU split unevenly across male and female, with no voice actors younger than 30ish (there is one female and two male US voices that sounds like late 20s). For comparison, Amazon Polly also has 12 US voices, with 6 UK voices (including 1 Welsh and 1 Irish), all adult; plus 3 IN voices, all female; 2 Australian voices, male and female; 1 NZ and 1 ZA voice (both female). Quite tough if you have an international cast with few (if any) Americans!

I quickly found that the Google Play interface was limited (could only change the target voice and the speech rate), flawed (question and command/imperative/exclamation intonation is not reliably produced, and postposed speech tags, like "Airlie said", are pronounced with loud sentence start levels and intonation) and overall quite buggy (e.g. hanging periodcally, plus although there is the ability to correct mispronounced words, it doesn't work reliably because it gets a couple of characters out in the segmentation — that is it replaces part of the word, or part of two words, with the new phonetically described pronunciation). 

So I did a lot of rewriting, moving speech tags from the end of quotations to the start, replacing mispronounced words by phonetically spelled versions in the text, or by a different word. You'd think it would know that "he'd read" is pronounced like "red" not "reed" but it tended to get that wrong. Overall, the result sounds much better than the half dozen AI-voice audiobooks I've listened too on Google Play.

I had hoped to munge the resulting audio to fix the most egregious prosody problems (questions and speech tags, and excited kids) — at least for broader Findaway Voices distribution, but when the specific AI-voice terms and conditions were presented they made clear that I could only use the generated audio in unaltered form. I will note that it is not clear from the various terms of the various audiobook distributors that my munged versions or V2V versions would be acceptable either, even though I technically narrated it all.

So with lots of workarounds, I completed Time for PsyQ markup and I got what I felt was an acceptable result and published it on Google Play. The final version actually does sound pretty good - perhaps even better than the preview version I worked with in their autonarration interface (although some voices didn't sound as good due I expected, probably due to compression to audiobook standards).

Findaway Voices has approved it will soon be available for Kobo (Walmart✓), Spotify, audiobooks.com, etc.

As they become available, all versions/outlets will be linked into

No comments:

Post a Comment