ChatGPT is undoubtedly the best-known generative artificial intelligence (AI) tool, but it’s far from the only one… The number of applications powered by this form of AI and accessible to the general public continues to multiply. Using algorithms and big data, this type of AI can generate original content – be it text, image, audio or video – in response to user requests. As these applications evolve rapidly, they tend to become increasingly multimodal, in other words, to generate more than one type of content. To better understand them, it’s helpful to know the different categories under which the technologies underpinning them can be classified. Although there are many types of terminology, we’d like to suggest ten categories.
Note: Fascinating and useful as they may be, these tools raise a number of issues, including copyright, their use for deepfake, and all the more general ethical issues surrounding the development and use of generative AI. We won’t discuss these important and complex issues here. However, by gaining a better understanding of the potential of generative AI, we can collectively decide how we want this technology to be shaped.
Tools that convert…
1. text to text (TTT)*
These tools generate text responses to text queries (also known as prompts). They can answer questions, write or help to write texts, do language translation or even design quizzes. The best-known and most successful applications of this type are ChatGPT and Bard. With plug-ins, however, ChatGPT can handle non-text content, while Bard integrates voice recognition and lets you listen to the answer via audio. In the TTT category, we can also add code generation, i.e. the representation of data in the computer field. Other generative AI programmers, such as OpenAi Codex or GitHub Copilot, offer suggestions for autocompletion.
2. text to speech (TTS)
These tools generate audible speech responses to text requests. They can, therefore, respond orally to written requests, as a GPS does, for example, or transmit text aloud – a particularly useful function for editing videos and making written documents accessible to the visually impaired – or even translate written text and transmit this translation verbally. This technology works by breaking down text into letters and groups of letters, i.e. into smaller units of sound (phonemes), and rendering it by imitating the human voice. A choice of languages, accents, voices and emotions conveyed in the tone is usually offered. ElevenLabs’ Eleven Multilingual v2 and Meta AI’s SeamlessM4T are among the most recent and advanced applications of their kind. We should also mention the unveiling by Microsoft of VALL-E, a TTS application whose distinctive feature is its ability to reproduce the voice of a real person from an audio sample of just three seconds. Given the risks this technology poses for identity theft, it is not yet available to the general public.
3. text to image (TTI)
From text descriptions, these tools can generate original 2D images, some even in 3D, or modify existing images. Images can be photorealistic or artistic. Since the internet abounds with an impressive bank of images, the possibilities are virtually infinite, and it’s up to the user to formulate a query precise enough to ensure that the result is as close as possible to the one desired. From a creative point of view, these applications offer a wide range of possibilities, from simply improving the quality of a photo without the need for knowledge of retouching software to creating artistic images on any subject and in any style, with sometimes impressive renderings.
MidJourney, DALL-E, Stable Diffusion and Jasper Art are just a few of the best-known applications. DALL-E’s 3rd generation AI is capable of adding text to the image, a first for this type of application. On a more practical note, some TTI applications can specifically generate two-dimensional barcodes (often referred to as QR codes) – 2D codes made up of black square modules within a white background square, designed to quickly represent and transmit information via a barcode reader or smartphone.
4. text to video (TTV)
From text to image to text to video, there’s just one step… and a few considerable technological differences. Nevertheless, for the user, it’s once again a matter of formulating a textual request, with or without one or more existing images or videos, to obtain a brand new video – in the latter case, we can speak of VTV (see below). The integration of ChatGPT into certain applications even makes it possible to create scenarios from a simple description of ideas. So, TTV is firmly being multimodal.
Among the applications, some are designed to generate videos for informational or “formative” purposes, others for marketing or artistic purposes. Meta’s Make-A-Video and Runway’s Gen-2 are making a name for themselves as particularly innovative video-generating models. In the category of video editing software, Descript, for example, is an all-in-one tool that lets you edit video scenes by rewriting them (TTV) or cloning your own voice to insert a missing narration passage.
5. speech to text (STT)
In contrast to TTS, generative AI has considerably improved speech-to-text conversion, a possibility that saves a lot of time when you need a written version of, for example, discussions that took place during a meeting or interview, the information conveyed during a conference or course, and so on. There are many TTS applications – Whisper, SpeakAI, Otter.ai, AudioPen, etc. – for use with audio sources – which can be used with audio or video sources, translate multiple languages, allow editing and sharing, identify important themes and terms, and more.
6. text to audio (TTA)
Unlike the previous category, this one generates not speech but music of all kinds or high-quality sound effects, realistic or otherwise, from a simple text description. Two TTA generative AIs are making the headlines: AudioCraft from Meta and MusicML from Google. The former is open-source and accessible to the public so that it can be improved with new data from researchers and users. It includes three models: MusicGen, which generates music; AudioGen, which generates sound effects; and EnCodec, a high-performance decoder.
Both AudioCraft and MusicLM can also be used to produce new audio from existing audio (in other words, audio within audio). MusicML, for its part, is not yet available to the public, but those taking part in trials (simply sign up for the waiting list) get two pieces of music of 10 seconds each from their text request and are invited to choose the one they prefer to help improve the model. While the maximum length of audio generated by this Google TTA is five minutes (Agostinelli et al., 2023), Meta’s is just 12 seconds. However, Meta’s sampling frequency is higher than Google’s, at 32 kHz compared with 24 kHz, resulting in better quality sound, albeit still in mono.
7. image to text (ITT)
For several years now, optical character recognition (OCR) technology has been used to convert the pixels of typed, handwritten or printed text into machine-readable text. It can convert physical documents (invoices, paper books, license plates, etc.) into images (digitizing) and then decipher the text (letters and numbers) in these images. The integration of deep learning into this technology makes the applications that use it more efficient, particularly in the case of non-standard layouts and fonts.
Generative AI, however, extends the possibilities of ITT by enabling image analysis without text. A multimodal application such as ChatGPT is able, depending on the user’s query, not only to describe an image – a very useful use for improving accessibility for the visually impaired since it provides an oral description of images when text-to-speech is integrated – but also to answer questions about the image or create text from it. There are also generative AI applications specifically designed for image description, such as “caption generators” that integrate GPT-4 (GPT is the system, and ChatGPT is the conversational interface) and can create image descriptions tailored to each social network (accompanied by keywords and emoticons).
8. image to image (ITI)
Obtaining a new image from an existing one is possible, as described above, using ITI applications. However, a generative AI tool specialized in ITI conversion, as is the case with Photoshop’s Firefly, enables a range of advanced functions specific to photo editing to be explored. Among the most common uses are: style transfer (transferring an ordinary photograph into a painting in the style of a well-known painter, for example), correction (adding or removing elements using digital inpainting), colorization (turning old black-and-white photographs into colour) and super-resolution (increasing resolution improves visualization and print quality). Like text-to-image applications, image-to-image applications respond to textual instructions formulated by the user. Some applications, such as Good AI Art Generator, offer two image generation modes, ITI and TTI.
9. image to video (ITV)
Since a video is made up of a series of images, there are tools available that generate video from a single image, as can TTV applications. From a photo, they can produce a clip lasting a few seconds that meets the characteristics requested in the request or the animation styles and effects proposed by the tool. The applications try to distinguish themselves from one another. With Pika Labs, for example, the user must succinctly describe the scene and the type of movement required to animate the image. This application can also generate video without an image from a description, i.e. in TTV. The Animated Drawings application is designed to animate children’s drawings – although it can animate any type of image – while InstaVerse specializes in the creation of immersive 3D scenes for the metaverse, the virtual world accessible via virtual reality. These are just a few examples to illustrate the diversity of applications.
10. video to video (VTV)
Like ITI tools, VTV tools can be used to generate a new video from an existing one and to explore functions specific to video editing and animation. Runway’s Wonder Studio and Gen-2 applications, for example, let you transform a scene filmed by a simple smartphone camera into a movie scene, incorporating characters, animation, lighting and composition of your choice. Here, too, multimodality has made its entry, as these applications integrate other generative AI tools such as TTV or ITI.