[PYTHON] I implemented Google's Speech to text in Django

In AWS, it is possible to transcribe by WEB operation, but in GCP it can be operated only from API. So, I studied django and tried it briefly. The reason is that Google's voice recognition accuracy is quite high. The flow is to upload to Google Storage and transcribe. The reason for Google Storage is that if it is local, there are conditions such as a file size of less than 10MB.

Completion drawing

Development environment

MacBook Python(3.7.7) Django(3.1.5) google-cloud-storage(1.35.0) google-cloud-speech(2.0.1) pydub(0.24.1)

Get json for Google authentication

Grant administrator privileges for "Google Storage" when creating a service account. スクリーンショット 2021-01-12 20.51.33.png

Speech to Text API activation

Enable API from GCP library スクリーンショット 2021-01-12 20.56.21.png

Environmental setting

pip3 install django==3.1.5

pip3 install google-cloud-storage==1.35.0

pip3 install google-cloud-speech==2.0.1

pip3 install pydub==0.24.1

Django settings

Project creation

A project folder will be created.

#Project name(project)
django-admin startproject project

Creating an application

Go to the project folder and create an application.

#Application creation
python3 manage.py startapp moji

Django (WEB server) basic settings

Set the files in the project folder in the project folder.


#So that anyone can access it

#To use html files
    'mozi',  #Add application(Now search for templates in mozi)


from django.contrib import admin
from django.urls import path,include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('mozi/', include('mozi.urls')), #urls in the mozi app.To be able to set py

Application (mozi) basic settings

Set the files in the mozi folder in the project folder. Create a new urls.py so that the screen transition can be set on the application side


from django.urls import path
from . import views

urlpatterns = [
    path('', views.index, name='index'),

Create main function

Add the upload destination to settings.py in the project folder. BASE_DIR is where manage.py is located, so create an upload folder there.


import os
MEDIA_ROOT = os.path.join(BASE_DIR, 'upload')
MEDIA_URL = '/upload/'

Added magic to urls.py in the project folder.


if settings.DEBUG:
    urlpatterns += static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)

Add the information required for the form to models.py under the mozi folder. "Media" here is created under the upload folder and the files are saved in it.


from django.db import models

class Upload(models.Model):
    document = models.FileField(upload_to='media')
    uploaded_at = models.DateTimeField(auto_now_add=True)

Create a new forms.py under the mozi folder to upload files from the forms.


from django import forms
from .models import Upload
class UploadForm(forms.ModelForm):
    class Meta:
        model = Upload
        fields = ('document',)

Create html to create a WEB screen.


<!DOCTYPE html>
<html lang="ja-JP">
    <meta charset="UTF-8">
    <h1>Google Speech To Text</h1>
    <form method="post" enctype="multipart/form-data">
        {% csrf_token %}
        {{ form.as_p }}
        <button type="submit">start</button>

   <h2>Transcription result</h2>
   <p>{{ transcribe_result }}</p>

Set views.py, which is the core of screen display.


from django.http import HttpResponse
from django.shortcuts import render,redirect
from .forms import UploadForm
from .models import Upload

def index(request):
    import os
    import subprocess

    #Save PATH
    source = "Path where the file is uploaded" 
    GCS_BASE = "gs://Bucket name/"    

    #Save results
    speech_result = ""

    if request.method == 'POST':
        #Google Storage environment preparation
        from google.cloud import storage
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='json PATH'
        client = storage.Client()
        bucket = client.get_bucket('Google Storage bucket name')
        #Save upload file
        form = UploadForm(request.POST,request.FILES)

        #Get the uploaded file name
        #Separate file name and extension(ext->extension(.py))
        transcribe_file = request.FILES['document'].name
        name, ext = os.path.splitext(transcribe_file)

        if ext==".wav": 
            #Upload to Google Storage
            blob = bucket.blob( transcribe_file )
            blob.upload_from_filename(filename= source + transcribe_file )

            #Get play time
            from pydub import AudioSegment
            sound = AudioSegment.from_file( source + transcribe_file )
            length = sound.duration_seconds
            length += 1

            #Delete working files
            cmd = 'rm -f ' + source + transcribe_file     
            subprocess.call(cmd, shell=True)

            from google.cloud import speech

            client = speech.SpeechClient()

            gcs_uri = GCS_BASE + transcribe_file

            audio = speech.RecognitionAudio(uri=gcs_uri)
            config = speech.RecognitionConfig(

            operation = client.long_running_recognize(config=config, audio=audio)

            response = operation.result(timeout=round(length))

            for result in response.results:
                speech_result += result.alternatives[0].transcript

            #Google Storage file deletion

            #File conversion process
            f_input = source + transcribe_file
            f_output = source + name + ".wav"
            upload_file_name = name + ".wav"
            cmd = 'ffmpeg -i ' + f_input + ' -ar 16000 -ac 1 ' + f_output
            subprocess.call(cmd, shell=True)

            #Upload to Google Storage
            blob = bucket.blob( upload_file_name )
            blob.upload_from_filename(filename= f_output )

            #Get play time
            from pydub import AudioSegment
            sound = AudioSegment.from_file( source + transcribe_file )
            length = sound.duration_seconds
            length += 1

            #Delete working files
            cmd = 'rm -f ' + f_input + ' ' + f_output     
            subprocess.call(cmd, shell=True)
            from google.cloud import speech

            client = speech.SpeechClient()

            gcs_uri = GCS_BASE + upload_file_name

            audio = speech.RecognitionAudio(uri=gcs_uri)
            config = speech.RecognitionConfig(

            operation = client.long_running_recognize(config=config, audio=audio)

            response = operation.result(timeout=round(length))

            for result in response.results:
                speech_result += result.alternatives[0].transcript

            #Google Storage file deletion
        form = UploadForm()
    return render(request, 'mozi/index.html', {
        'form': form,

Finally sync the application.

django-admin makemigrations mozi
django-admin migrate

Now that you're ready, start your web server.

python3 manage.py runserver server IP:8000

It was easy to build because I was able to describe the internal processing from the WEB server construction in Python. It will be a record of touching and memo.

Reference site

https://noumenon-th.net/programming/2019/10/28/django-forms/ https://qiita.com/peijipe/items/009fc487505dfdb03a8d https://cloud.google.com/speech-to-text/docs/async-recognize?hl=ja

