Generate thumbnail for arbitrary audio file - python

I want to represent an audio file in an image with a maximum size of 180×180 pixels.
I want to generate this image so that it somehow gives a representation of the audio file, think of it like SoundCloud's waveform (amplitude graph)?.
I wonder if any of you have something for this. I have been searching around for a bit, mainly "audio visualization" and "audio thumbnailing", but I have not found anything useful.
I first posted this to, this is my attempt to reach any programmers working on this.

You could also break up the audio into a chunks and measure the RMS (a measure of loudness). let's say you want an image that is 180 pixels wide.
I'll use pydub, a light-weight wrapper I wrote around the std lib wave modeule:
from pydub import AudioSegment
# first I'll open the audio file
sound = AudioSegment.from_mp3("some_song.mp3")
# break the sound 180 even chunks (or however
# many pixels wide the image should be)
chunk_length = len(sound) / 180
loudness_of_chunks = []
for i in range(180):
start = i * chunk_length
end = chunk_start + chunk_length
chunk = sound[start:end]
the for loop can be represented as the following list comprehension, I just wanted it to be clear:
loudness_of_chunks = [
sound[ i*chunk_length : (i+1)*chunk_length ].rms
for i in range(180)]
Now the only think left to do is scale the RMS down to a 0 - 180 scale (since you want the image to be 180px tall)
max_rms = max(loudness_of_chunks)
scaled_loudness = [ (loudness / max_rms) * 180 for loudness in loudness_of_chunks]
I'll leave the drawing of the actual pixels to you, I'm not very experienced with PIL or ImageMagik :/

Based on Jiaaro's answer (thanks for writing pydub!), and built for web2py here's my two cents:
def generate_waveform():
img_width = 1170
img_height = 140
line_color = 180
filename = os.path.join(request.folder,'static','sounds','adg3.mp3')
# first I'll open the audio file
sound = pydub.AudioSegment.from_mp3(filename)
# break the sound 180 even chunks (or however
# many pixels wide the image should be)
chunk_length = len(sound) / img_width
loudness_of_chunks = [
sound[ i*chunk_length : (i+1)*chunk_length ].rms
for i in range(img_width)
max_rms = float(max(loudness_of_chunks))
scaled_loudness = [ round(loudness * img_height/ max_rms) for loudness in loudness_of_chunks]
# now convert the scaled_loudness to an image
im ='L',(img_width, img_height),color=255)
draw = ImageDraw.Draw(im)
for x,rms in enumerate(scaled_loudness):
y0 = img_height - rms
y1 = img_height
draw.line((x,y0,x,y1), fill=line_color, width=1)
buffer = cStringIO.StringIO()
del draw
im = im.filter(ImageFilter.SMOOTH).filter(ImageFilter.DETAIL),'PNG')
return, filename=filename+'.png')


Object Spilling in Ray

I have a script using ray like this:
import ray
from PIL import Image
object_store_memory=1000 * 1024 * 1024 * 100,
img_paths = np.array([200k image paths])
def read_img(path):
img = np.asarray(
return img
images = ray.get([read_img.remote(path) for img_path in img_paths[:10000]])
When I process ~5000 images via img_paths[:5000], this program executes in about 5 seconds. When I bump this up to ~10000, the program takes 4 minutes to execute and gives me messages like:
(raylet) Spilled 132187 MiB, 12533 objects, write throughput 1052 MiB/s.
This is my first time using ray, so I'm not sure how to prevent this from happening.
Ended up solving this by processing my images in batches. The system wasn't allocating the requested 100 GB of ram, so objects were still spilling. Looks something like this:
refs = [read_img.remote(path) for path in paths]
images = np.empty((1920, 1920))
for i in range(len(refs) // 5000 + 1):
images = np.vstack(
(images, np.array(ray.get(refs[i * 5000 : (i + 1) * 5000])))

How to convert a numpy array to a mp3 file

I am using the soundcard library to record my microphone input, it records in a NumPy array and I want to grab that audio and save it as an mp3 file.
import soundcard as sc
import numpy
import threading
speakers = sc.all_speakers() # Gets a list of the systems speakers
default_speaker = sc.default_speaker() # Gets the default speaker
mics = sc.all_microphones() # Gets a list of all the microphones
default_mic = sc.get_microphone('Headset Microphone (Arctis 7 Chat)') # Gets the default microphone
# Records the default microphone
def record_mic():
with default_mic.recorder(samplerate=48000) as mic, default_speaker.player(samplerate=48000) as sp:
for _ in range(1000000000000):
data = mic.record(numframes=None) # 'None' creates zero latency
# Save the mp3 file here
recordThread = threading.Thread(target=record_mic)
With Scipy (to wav file)
You can easily convert to wav and then separately convert wav to mp3. More details here.
from import write
samplerate = 44100; fs = 100
t = np.linspace(0., 1., samplerate)
amplitude = np.iinfo(np.int16).max
data = amplitude * np.sin(2. * np.pi * fs * t)
write("example.wav", samplerate, data.astype(np.int16))
With pydub (to mp3)
Try this function from this excellent thread -
import pydub
import numpy as np
def write(f, sr, x, normalized=False):
"""numpy array to MP3"""
channels = 2 if (x.ndim == 2 and x.shape[1] == 2) else 1
if normalized: # normalized array - each item should be a float in [-1, 1)
y = np.int16(x * 2 ** 15)
y = np.int16(x)
song = pydub.AudioSegment(y.tobytes(), frame_rate=sr, sample_width=2, channels=channels)
song.export(f, format="mp3", bitrate="320k")
#[[-225 707]
# [-234 782]
# [-205 755]
# ...,
# [ 303 89]
# [ 337 69]
# [ 274 89]]
write('out2.mp3', sr, x)
Note: Output MP3 will of cause be 16-bit, because MP3s are always 16 bit. However, you can set sample_width=3 as suggested by #Arty for 24-bit input.
As of now the accepted answer produces extremely distorted sound atleast in my case so here is the improved version :
#librosa read
#pydub read
channel_sounds = sound.split_to_mono()
samples = [s.get_array_of_samples() for s in channel_sounds]
fp_arr = np.array(samples).T.astype(np.float32)
fp_arr /= np.iinfo(samples[0].typecode).max
fp_arr=np.array([x[0] for x in fp_arr])
#i normalize the pydub waveform with librosa for comparison purposes
so you read the audiofile from any library and you have a waveform then you can export it to any pydub supported codec with this code below, i also used librosa read waveform and it works perfect.
wav_io = io.BytesIO(), sample_rate, waveform)
sound = AudioSegment.from_wav(wav_io)
with open("file_exported_by_pydub.mp3",'wb') as af:

Why converting npy files (containing video frames) to tfrecords consumes too much disk space?

I am working on a violence detection service. I am trying to develop software based on the code in this repo. My dataset consists of videos resided in two directories "Violence" and "Non-Violence".
I used this code to generate npy files out of RGB channels and optical flow features. The output of this part would be 2 folders containing npy array with 244x244x5 shape. (np.float32 dtype). so it's like I have video frames in RGB in the first 3 channels (npy[...,:3]) and optical flow features in the next two channels (npy[..., 3:]).
Now I am trying to convert them to tfrecords and use to speed up the training process. Since my model input has to be a cube tensor, my training elements has to be 64 frames of each video. It means the data point shape has to be 64x244x244x5.
So I used this code to convert the npy files to tfrecords.
from pathlib import Path
from os.path import join
import tensorflow as tf
import numpy as np
import cv2
from tqdm import tqdm
def normalize(data):
mean = np.mean(data)
std = np.std(data)
return (data - mean) / std
def random_flip(video, prob):
s = np.random.rand()
if s < prob:
video = np.flip(m=video, axis=2)
return video
def color_jitter(video):
# range of s-component: 0-1
# range of v component: 0-255
s_jitter = np.random.uniform(-0.2, 0.2)
v_jitter = np.random.uniform(-30, 30)
for i in range(len(video)):
hsv = cv2.cvtColor(video[i], cv2.COLOR_RGB2HSV)
s = hsv[..., 1] + s_jitter
v = hsv[..., 2] + v_jitter
s[s < 0] = 0
s[s > 1] = 1
v[v < 0] = 0
v[v > 255] = 255
hsv[..., 1] = s
hsv[..., 2] = v
video[i] = cv2.cvtColor(hsv, cv2.COLOR_HSV2RGB)
return video
def uniform_sample(video: str, target_frames: int = 64) -> np.ndarray:
gets video and outputs n_frames number of frames in video.
len_frames = int(len(data))
interval = int(np.ceil(len_frames / target_frames))
# init empty list for sampled video and
sampled_video = []
for i in range(0, len_frames, interval):
# calculate number of padded frames and fix it
num_pad = target_frames - len(sampled_video)
if num_pad > 0:
padding = [video[i] for i in range(-num_pad, 0)]
sampled_video += padding
return np.array(sampled_video, dtype=np.float32)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
if __name__ == '__main__':
path = Path('transformed/')
npy_files = list(path.rglob('*.npy'))[:100]
aug = True
# one_hots = to_categorical(range(2), dtype=np.int8)
path_to_save = 'data_tfrecords'
tfrecord_path = join(path_to_save, 'all_data.tfrecord')
with as writer:
for file in tqdm(npy_files, desc='files converted'):
# load npy files
npy = np.load(file.as_posix(), mmap_mode='r')
data = np.float32(npy)
del npy
# Uniform sampling
data = uniform_sample(data, target_frames=64)
# Add augmentation
if aug:
data[..., :3] = color_jitter(data[..., :3])
data = random_flip(data, prob=0.5)
# Normalization
data[..., :3] = normalize(data[..., :3])
data[..., 3:] = normalize(data[..., 3:])
# Label one hot encoding
label = 1 if file.parent.stem.startswith('F') else 0
# label = one_hots[label]
feature = {'image': _bytes_feature(tf.compat.as_bytes(data.tobytes())),
'label': _int64_feature(int(label))}
example = tf.train.Example(features=tf.train.Features(feature=feature))
The code works fine, but the real problem is that it consumes too much disk drive. my whole dataset consisting of 2000 videos takes 12 GB, when I converted them to npy files, it became around 80 GB, and now using tfrecords It became over 120 GB or so. How can I convert them in an efficient way to reduce the space required to store them?
The answer might be too late. But I see you are still saving the video frame in your tfrecords file.
Try removing the "image" feature from your features list. And saving per frame as their Height, Width, Channels, and so forth.
feature = {'label': _int64_feature(int(label))}
Which is why the file is taking more space.

Using a Data Converter to Display 3D Volume as Images

I would like to write a data converter tool. I need analyze the bitstream in a file to display the 2D cross-sections of a 3D volume.
The dataset I am trying to view can be found here:
It's the file titled: burned_wood_with_tape_1664x512x256_12bit.raw (832 MB)
Would extremely appreciate some direction. Willing to award a bounty if I could get some software to display the dataset as images using a data conversion.
As I'm totally new to this concept, I don't have code to show for this problem. However, here's a little something I tried using inspiration from other questions on SO:
import rawpy
import imageio
path = "Datasets/burned_wood_with_tape_1664x512x256_12bit.raw"
for item in path:
item_path = path + item
raw = rawpy.imread(item_path)
rgb = raw.postprocess()
Down below I implemented next visualization.
Example RAW file burned_wood_with_tape_1664x512x256_12bit.raw consists of 1664 samples per A-scan, 512 A-scans per B-scan, 16 B-scans per buffer, 16 buffers per volume, and 2 volumes in this file, each sample is encoded as 2-bytes unsigned integer in little endian order, only 12 higher bits are used, lower 4 bits contain zeros. Samples are centered approximately around 2^15, to be precise data has these stats min 0 max 47648 mean 32757 standard deviation 454.5.
I draw gray images of size 1664 x 512, there are total 16 * 16 * 2 = 512 such images (frames) in a file. I draw animated frames on screen using matplotlib library, also rendering these animation into GIF file. One example of rendered GIF at reduced quality is located after code.
To render/draw images of different resulting resolution you need to change code line with plt.rcParams['figure.figsize'], this fig size contains (widht_in_inches, height_in_inches), by default DPI (dots per inch) equals to 100, meaning that if you want to have resulting GIF of resolution 720x265 then you need to set this figure size to (7.2, 2.65). Also resulting GIF contains animation of a bit smaller resolution because axes and padding is included into resulting figure size.
My next code needs pip modules to be installed one time by command python -m pip install numpy matplotlib.
Try it online!
# Needs: python -m pip install numpy matplotlib
def oct_show(file, *, begin = 0, end = None):
import os, numpy as np, matplotlib, matplotlib.pyplot as plt, matplotlib.animation
plt.rcParams['figure.figsize'] = (7.2, 2.65) # (4.8, 1.75) (7.2, 2.65) (9.6, 3.5)
sizeX, sizeY, cnt, bits = 1664, 512, 16 * 16 * 2, 12
stepX, stepY = 16, 8
fps = 5
fsize, opened_here = None, False
if type(file) is str:
fsize = os.path.getsize(file)
file, opened_here = open(file, 'rb'), True
by = (bits + 7) // 8
if end is None and fsize is not None:
end = fsize // (sizeX * sizeY * by)
imgs = [] * sizeY * sizeX * by)
a = - begin) * sizeY * sizeX * by)
a = np.frombuffer(a, dtype = np.uint16)
a = a.reshape(end - begin, sizeY, sizeX)
amin, amax, amean, stdd = np.amin(a), np.amax(a), np.mean(a), np.std(a)
print('min', amin, 'max', amax, 'mean', round(amean, 1), 'std_dev', round(stdd, 3))
a = (a.astype(np.float32) - amean) / stdd
a = np.maximum(0.1, np.minimum(a * 128 + 128.5, 255.1)).astype(np.uint8)
a = a[:, :, :, None].repeat(3, axis = -1)
fig, ax = plt.subplots()
plt.subplots_adjust(left = 0.08, right = 0.99, bottom = 0.06, top = 0.97)
for i in range(a.shape[0]):
title = ax.text(
0.5, 1.02, f'Frame {i}',
size = plt.rcParams['axes.titlesize'],
ha = 'center', transform = ax.transAxes,
imgs.append([ax.imshow(a[i], interpolation = 'antialiased'), title])
ani = matplotlib.animation.ArtistAnimation(plt.gcf(), imgs, interval = 1000 // fps)
print('Saving animated frames to GIF...', flush = True) + '.gif', writer = 'imagemagick', fps = fps)
print('Showing animated frames on screen...', flush = True)
if opened_here:
Example output GIF:
I don't think it's a valid RAW file at all.
If you try this code:
import rawpy
import imageio
path = 'Datasets/burned_wood_with_tape_1664x512x256_12bit.raw'
raw = rawpy.imread(path)
rgb = raw.postprocess()
You will get a following error:
----> 5 raw = rawpy.imread(path)
6 rgb = raw.postprocess()
~\Anaconda3\envs\py37tf2gpu\lib\site-packages\rawpy\ in imread(pathOrFile)
18 d.open_buffer(pathOrFile)
19 else:
---> 20 d.open_file(pathOrFile)
21 return d
rawpy\_rawpy.pyx in rawpy._rawpy.RawPy.open_file()
rawpy\_rawpy.pyx in rawpy._rawpy.RawPy.handle_error()
LibRawFileUnsupportedError: b'Unsupported file format or not RAW file'

How Could I increase the speed

I am using below code for an image processing related study. The code works fine as functionality but it is too slow that one step takes up to 10 seconds.
I need faster process speed to reach at the aim.
import numpy
import glob, os
import cv2
import os
input = cv2.imread(path)
def nothing(x): # for trackbar
windowName = "Image"
cv2.createTrackbar("coef", windowName, 0, 25000, nothing)
condition = True
while (condition):
coef = cv2.getTrackbarPos("coef", windowName)
temp_img = input
row = temp_img.shape[0]
col = temp_img.shape[1]
red = []
green = []
for i in range(row):
for y in range(col):
# temp_img[i][y][0] = 0
temp_img[i][y][1] = temp_img[i][y][1]* (coef / 100)
temp_img[i][y][1] = temp_img[i][y][2] * (1 - (coef / 100))
# relative_diff = value_g - value_r
# temp =cv2.resize(temp,(1000,800))
cv2.imshow(windowName, temp_img)
# cv2.imwrite("output2.jpg", temp)
# cv2.waitKey(0)
if cv2.waitKey(30) >= 0:
condition = False
Is there anybody have an idea having faster result on the aim?
It's not entirely clear to me what object temp_img is exactly, but if it behaves like a numpy array, you could replace your loop by
temp_img[:,:,0] = temp_img[:,:,1]*(coef/100)
temp_img[:,:,1] = temp_img[:,:,2]*(1-coef/1000)
which should result in a significant speed up if your array is large. The implementation of such operations on arrays are optimised very well, whereas python loops are generally quite slow.
Edit based on comments:
Since you're working with large images and have some expensive operations that need an unscaled version but only need to be executed once, your code could get the following kind of structure
import... #do all your imports
def expensive_operations(image, *args, **kwargs):
#do all your expensive operations like object detection
def scale_image(image, scale):
#create a scaled version of image
def cheap_operations(scaled_image, windowName):
#perform cheap operations, e.g.
coef = cv2.getTrackbarPos("coef", windowName)
temp_img = np.copy(scaled_image)
temp_img[:,:,1] = temp_img[:,:,1]* (coef / 100)
temp_img[:,:,2] = temp_img[:,:,2] * (1 - (coef / 100))
cv2.imshow(windowName, temp_img)
input = cv2.imread(path)
windowName = "Image"
cv2.createTrackbar("coef", windowName, 0, 25000, nothing)
condition = True
expensive_results = expensive_operations(input) #possibly with some more args and keyword args
scaled_image = scale_image(input)
while condition:
cheap_operations(scaled_image, windowName)
if cv2.waitKey(30) >= 0:
condition = False
I do this kind of thing in nip2. It's an image processing spreadsheet that can manipulate huge images quickly. It has no problems doing this kind of operation on any size image at 60fps.
I made you an example workspace:
Here's what it looks like working on a 1gb starfield image:
You can drag the slider to change coeff. The processed image updates instantly as you drag. You can zoom and pan around the processed image to check details and adjust coeff.
The underlying image processing library is libvips, which has a Python binding, pyvips. In pyvips, your program would be:
import pyvips
def adjust(image, coeff):
return image * [1, coeff / 100, 1 - coeff / 100]
Though that's without the GUI elements, of course.
