How to Make Any Image Speak to the Visually Impaired

Imagine a world where anyone, regardless of their ability to see, can hear a detailed description of an image. This Large Language Model (LLM) based solution does just that. It transforms images into spoken descriptions, making them accessible to those who are visually impaired. Let’s explore how this works and how you can use it.

The Problem

For many people who are visually impaired, pictures on websites, apps, or in documents can be difficult or impossible to understand. While the text around the images may explain things, there is often no description of the picture itself. This can make navigating the web a frustrating experience. What if we could describe images aloud for those who can’t see them?

The Solution

This solution uses two powerful technologies to solve this problem:

It converts images into detailed descriptions using Meta's llama-3.2-90b-vision model.
Then, it reads aloud the description using a text-to-speech engine.

Here’s a quick breakdown of the main tools used:

Groq: Here we use Groq's fast inference to connect an advanced LLM model that can interpret images and describe them in words.
Edge TTS: This tool takes the text description and converts it into spoken words.
pygame: This library plays the spoken audio in real-time so the user can hear the description.

Implementation

Here’s the Python code that turns images into spoken descriptions:

import os
import asyncio
from groq import Groq
import base64
from dotenv import load_dotenv
import edge_tts
import io
import pygame

# Load environment variables
load_dotenv()

# Initialize Groq client
client = Groq(api_key=os.getenv("GROQ_API_KEY"))

# Initialize pygame mixer
pygame.mixer.init()

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def image_to_text(image_path):
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="llama-3.2-90b-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in detail for a visually impaired person."},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                    }
                ]
            }
        ]
    )
    return response.choices[0].message.content

async def text_to_speech_and_play(text):
    communicate = edge_tts.Communicate(text, "en-US-ChristopherNeural")
    audio_stream = io.BytesIO()
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            audio_stream.write(chunk["data"])

    audio_stream.seek(0)
    pygame.mixer.music.load(audio_stream)
    pygame.mixer.music.play()

    # Wait for the audio to finish playing
    while pygame.mixer.music.get_busy():
        await asyncio.sleep(0.1)

async def main():
    image_path = "path/to/your/image.jpg"
    description = image_to_text(image_path)
    print("Description:", description)
    print("Playing audio description...")
    await text_to_speech_and_play(description)
    print("Audio playback finished.")

if __name__ == "__main__":
    asyncio.run(main())

Output

Description: The image features a kingfisher bird perched on a branch, showcasing its vibrant plumage. The bird's head is turned to the left, with its long, black beak pointing slightly upwards. Its feathers display a striking combination of bright blue, orange, and white hues, with the blue feathers covering its back and wings, while the orange feathers are visible on its chest and belly. The white feathers are seen around its neck and underbelly.

The bird appears to be gazing into the distance, with its eyes fixed on something outside the frame. The background of the image is softly blurred, but it seems to depict a natural setting, possibly a forest or meadow, with shades of green and yellow visible behind the bird. Overall, the image presents a serene and peaceful atmosphere, capturing the beauty of the kingfisher in its natural habitat. Playing audio description... Audio playback finished.

How It Works

Encoding the Image: The encode_image function takes an image file and converts it into a base64 string. This string format is used to send the image data through the API.
Generating the Description: The image_to_text function sends the base64-encoded image to the Groq API, which uses the Llama-3.2-90b-vision-preview model. This model analyzes the image and generates a detailed text description tailored for visually impaired users.
Converting Text to Speech: Once the image description is ready, the text_to_speech_and_play function uses Edge TTS to turn the text into speech. The speech is streamed and played in real-time using pygame.
Playing the Audio: Finally, pygame plays the generated audio, allowing the user to listen to the image description.

Conclusion

This implementation demonstrates a practical way to make images more accessible to visually impaired individuals. By combining vision language models for image understanding and Edge TTS for text-to-speech conversion, this solution creates an inclusive experience where images are described and spoken aloud.

If you want to try it out, you can run the code with any image, and it will generate a spoken description. Feel free to customize and improve the script to suit your needs. Together, we can make digital content more accessible for everyone.