Preface

Web drinking party that is popular because the world is the world.

However, after a certain number of meetings, there will always be a timing when no one speaks. In the first place, everyone stays at home all the time, so there isn't much talk about it. I'm sure some people think that they shouldn't participate in such a drinking party, but it's difficult because there is no reason to decline this.

No matter how close you are, the drinking party will be silent when there is no more content to talk about.

The drinking party itself will continue lazily, so if you feel uncomfortable, don't worry. Such a situation is the worst for whatever reason.

Therefore, this time, I would like to create a program in Python that detects the silence state and plays a voice in order to avoid the silence state ** (awkward state due to) ** in the Web drinking party. (It's a relief)

Target

I usually use Zoom for meetings such as Web conferences, so I am aiming to use it with ** Zoom **.

However, in reality, the audio of the entire system will be monitored, so any software should probably be able to handle it.

The goal is to monitor Zoom's audio input for 10 seconds, and if it determines that there is no input = silence, the music file will be played randomly from the specified folder.

environment

I started with an idea, so I use ** Python ** as the programming language, but it doesn't mean anything. The operating environment is as follows.

■Windows10 ■Python3.7

The libraries used are as follows.


import pyaudio
import numpy as np
import wave
import math

from mutagen.mp3 import MP3 as mp3
import pygame
import time

import glob
import random
import sys

I personally wanted to use Zoom with the best possible sound quality, so I have a separate audio interface and microphone (probably the audio is pretty cut on the Zoom side so it doesn't make much sense, I'm self-sufficient). ..

■marantz / AUDIO SCOPE SG-5BC ■CREATIVE / SB X-Fi Surround 5.1

Now let's write the program.

Source code

audio = pyaudio.PyAudio()

def system(FORMAT, CHANNELS, RATE, CHUNK):

    stream = audio.open(format=FORMAT,
                        channels=CHANNELS,
                        rate=RATE,
                        input=True,
                        output=True,
                        input_device_index=1,#← Please change to a suitable index.
                        output_device_index=7,#← Please change to a suitable index.
                        frames_per_buffer=CHUNK)

    return stream

First, we instantiate and use pyaudio.PyAudio () to monitor the audio input by the microphone.

The input voice to be monitored is specified from the numerical value of input_device_index. If you don't know the device index value, you can look it up with the following code.

for index in range(0, p.get_device_count()):
    print(p. get_device_info_by_index(index))

Taking my environment as an example, the output is as follows.

{'index': 0, 'structVersion': 2, 'name': 'Microsoft Sound Mapper- Input', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 1, 'structVersion': 2, 'name': 'Playback redirect(SB X-Fi Surround 5.1)', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 2, 'structVersion': 2, 'name': 'line(USB2.0 High-Speed True HD ', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 3, 'structVersion': 2, 'name': 'line/Microphone input(SB X-Fi Surround 5.1', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 4, 'structVersion': 2, 'name': 'SPDIF In (USB2.0 High-Speed Tru', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 5, 'structVersion': 2, 'name': 'Microphone(USB2.0 High-Speed True HD ', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 6, 'structVersion': 2, 'name': 'Microsoft Sound Mapper- Output', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 2, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 7, 'structVersion': 2, 'name': 'speaker(SB X-Fi Surround 5.1)', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 6, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 8, 'structVersion': 2, 'name': 'SPDIF output(SB X-Fi Surround 5.1)', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 6, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 9, 'structVersion': 2, 'name': 'SPDIF Out (USB2.0 High-Speed Tr', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 2, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 10, 'structVersion': 2, 'name': 'speaker(USB2.0 High-Speed True H', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 8, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}

In my environment, there are several audio interfaces connected, so there are a lot of indexes like this.

This time, it is necessary to read not only the input voice but also the state of the other party talking and the state of the shared screen with Zoom.

Therefore, in this case, the index used is playback redirection, 1 to monitor the entire system. Check this value yourself and substitute an appropriate value.

There is an item that specifies ʻoutput_device_index, but this is specified because you want to play an audio file in wavformat. This item is not necessary especially if you do not plan to play thewavfile. In the first place, it is purposely made into a function, so if you do not have a plan, you can specifyFORMAT, CHANNELS, RATE, CHUNK` without making it a function.

frames = []

def surveillance():

    print("Under surveillance...")

    FORMAT = pyaudio.paInt16
    CHANNELS = 1  #monaural
    RATE = 44100  #Sample rate
    CHUNK = 2 ** 11  #Data score
    RECORD_SECONDS = 10  #Length of time to record

    stream = system(FORMAT, CHANNELS, RATE, CHUNK)

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        buf = stream.read(CHUNK)
        data = np.frombuffer(buf, dtype="int16")
        frames.append(max(data))

    stream.stop_stream()

    calculation()

The function name is "monitor".

Here, the voice is input to python for 10 seconds, and the ** maximum positive value ** of the voice waveform per second is extracted and added to frames.

To explain a little, in this case, RATE = 44100, so the sampling frequency is 44.1khz. This means that we are getting 44100 volume levels per second. Sound is a wave, and as long as it is a wave, it naturally includes negative values. If you want to check the exact level every 1 / 44.1 seconds, you need to get the absolute value, but this time only the maximum value is saved because it is only necessary to judge whether there is a sound within 10 seconds.

The acquired value is converted to 2 ** 16 steps with a dynamic range of 16 bits by np.frombuffer. However, as mentioned earlier, it has positive and negative values, so the maximum value is 32767.

def calculation():

    print("Calculation")

    rms = (max(frames))

    db = 20 * math.log10(rms) if rms > 0.0 else -math.inf
    print(f"RMS：{format(db, '3.1f')}[dB]")

    if (db<=65):#← Please adjust the numbers according to the environment
        random_music()
        #disc_jockey()

    else:
        pass

    frames.clear()

Next is the function that determines the silence state.

Logarithmize the acquired values to make it easier to understand the level changes. Determine the threshold and branch at ʻif`. In my environment, about 65dB seems to be a good value. Change this value according to your own environment.

def random_music():

    print("Random music")

    files = [r.split('/')[-1] for r in glob.glob('./data/*.mp3')]
    filename = random.choice(files)  #The mp3 file you want to play
    print(filename)

    pygame.mixer.init()
    pygame.mixer.music.load(filename)  #Load the sound source
    mp3_length = mp3(filename).info.length  #Get the length of the sound source
    pygame.mixer.music.play(1)  #Playback starts. Play n times if part 1 is changed(In that case, also xn the number of seconds on the next line.)
    time.sleep(mp3_length + 0.25)  #After starting playback, wait for the length of the sound source(0.25 Waiting for error elimination)
    pygame.mixer.music.stop()  #Playback stops after waiting for the length of the sound source

Finally, it is a program that randomly extracts mp3 files from arbitrary folders and plays them. In my case, I put it in data directly under the source code directory.

try:
    while True:
        surveillance()

except KeyboardInterrupt:
    print("Emergency stop")
    sys.exit(0)

stream.close()
audio.terminate()

After that, the program is looped and monitored.

If it is urgent and the silence is broken, it ends with ctrl + c.

def disc_jockey():

    print("Play...")

    filename = "./disc_jockey.wav"
    wf = wave.open(filename, "rb")

    FORMAT = audio.get_format_from_width(wf.getsampwidth())
    CHANNELS = wf.getnchannels()
    RATE = wf.getframerate()
    CHUNK = wf.getnframes()

    stream = system(FORMAT, CHANNELS, RATE, CHUNK)

    data = wf.readframes(CHUNK)
    stream.write(data)

    stream.start_stream()
    stream.stop_stream()
    stream.close()

    random_music()

By the way, I mentioned at the beginning that I want to play wav as well, so I will describe how to do it.

It is basically the same as the recording method, but in this case it is necessary to determine the value according to the state of the file, so each value is acquired by wave and substituted.

The function name is DJ because ** the other party doesn't make sense even if the music suddenly plays **, so I created it here with the intention of inserting a song introduction voice.

And the data I had was ** Chris 〇 Puller **'s "Let's go to the song here" and ** Ioin Hikaru **'s voice wav file imitating. Now this.

... I don't know why.

Summary

I couldn't go outside because of a certain virus, and the silent web drinking party is still going on, but let's all do our best.

A popular web drinking party, Python detects silence and plays voice.

Preface

Target

environment

Source code

Summary