What format does the module Struct require? - python

I faced the module Struct for the first time and my code gives me an error: "unpack requires a buffer of 1486080 bytes"
Here is my code:
def speed_up(n):
source = wave.open('sound.wav', mode='rb')
dest = wave.open('out.wav', mode='wb')
dest.setparams(source.getparams())
frames_count = source.getnframes()
data = struct.unpack("<" + str(frames_count) + "h", source.readframes(frames_count))
new_data = []
for i in range(0, len(data), n):
new_data.append(data[i])
newframes = struct.pack('<' + str(len(new_data)) + 'h', new_data)
dest.writeframes(newframes)
source.close()
dest.close()
How to figure out which format should I use?

The issue in your code is that you're providing struct.unpack with the wrong number of bytes. This is because of your usage of the wave module: Each frame in a wave file has getnchannels() samples, so when calling readframes(n) you will get back n * getnchannels() samples and this is the number you'd have to pass to struct.unpack.
To make your code more robust, you'd also need to look at getsampwidth() and use an appropriate format character, but the vast majority of wave files are 16-bit.
In the comments you also mentioned that the code didn't work after adding print(len(source.readframes(frames_count))). You didn't show the full code but I assume this is because you called readframes twice without calling rewind, so the second call didn't have any more data to return. It would be best to store the result in a variable if you want to use it in multiple lines.

Related

Astropy time conversion very slow

I have a script that reads some data from a binary stream of "packets" containing "parameters". The parameters read are stored in a dictionary for each packet, which is appended to an array representing the packet stream.
At the end this array of dict is written to an output CSV file.
Among the read data is a CUC7 datetime, stored as coarse/fine integer parts of a GPS time, which I also want to convert to a UTC ISO string.
from astropy.time import Time
def cuc2gps_time(coarse, fine):
return Time(coarse + fine / (2**24), format='gps')
def gps2utc_time(gps):
return Time(gps, format='isot', scale='utc')
The issue is that I realized that these two time conversions make up 90% of the total run time of my script, most of the work of my script being done in the 10% remaining (read binary file, decode 15 other parameters, write to CSV).
I somehow improved the situation by making the conversions in batches on Numpy arrays, instead of on each packet. This reduces the total runtime by about half.
import numpy as np
while end_not_reached:
# Read 1 packet
# (...)
nb_packets += 1
end_not_reached = ... # boolean
# Process in batches for better performance
if not nb_packets%1000 or not end_not_reached:
# Convert CUC7 time to GPS and UTC times
all_coarse = np.array([packet['lobt_coarse'] for packet in packets])
all_fine = np.array([packet['lobt_fine'] for packet in packets])
all_gps = cuc2gps_time(all_coarse, all_fine)
all_utc = gps2utc_time(all_gps)
# Add times to each packet
for packet, gps_time, utc_time in zip(packets, all_gps, all_utc):
packet.update({'gps_time': gps_time, 'utc_time': utc_time})
But my script is still absurdly slow. Reading 60000 packets from a 1.2GB file and writing it as a CSV takes 12s, against only 2.5s if I remove the time conversion.
So:
Is it expected that Astropy time conversions are so slow? Am I using it wrong? Is there a better library?
Is there a way to improve my current implementation? I suspect that the remaining "for" loop in there is very costly, but could not find a good way to replace it.
I think the problem is that you're doing multiple loops over your sequence of packets. In Python I would recommend having arrays representing each parameter, instead of having a list of objects, each with a bunch of scalar parameters.
If you can read all the packets in at once, I would recommend something like:
num_bytes = ...
num_bytes_per_packet = ...
num_packets = num_bytes / num_bytes_per_packet
param1 = np.empty(num_packets)
param2 = np.empty(num_packets)
...
time_coarse = np.empty(num_packets)
time_fine = np.empty(num_packets)
...
param_N = np.empty(num_packets)
for i in range(num_packets):
param_1[i], param_2[i], ..., time_coarse[i], time_fine[i], ... param_N[i] = decode_packet(...)
time_gps = Time(time_coarse + time_fine / (2**24), format='gps')
time_utc = time_gps.utc.isot

I am trying to convert a CSV to a WAV file, and my inexperience is causing a problem

To start, I know very little about Python. I am trying to convert a CSV to A wav file using a script I found in another post. Best I can tell it was written for an older version of python than the one I am using. One error I am getting is because of the version difference, I am just not sure how to correct it. The other error may be because of my ignorance with Python, but I am not sure of that.
The first error is:
Python\CVS-WAV2.py:44: DeprecationWarning: 'U' mode is deprecated
for time, value in csv.reader(open(fname, 'U'), delimiter=','):
I know in Python 3 the 'U' has been replaced with newline= with either "None, '\n', '\r', or '\n\r'. After reading up on the newline function I think that "None" is the option I want.
Once I change 'U' to newlinw=None, My first error goes away but I still get the following when I run the script:
File "\Python\CVS-WAV2.py", line 43, in
for time, value in csv.reader(open(fname, newline='\n'), delimiter=','):
ValueError: not enough values to unpack (expected 2, got 0)
I am not sure how to resolve this error though.
#!/usr/bin/python
import wave
import struct
import sys
import csv
import numpy
from scipy.io import wavfile
from scipy.signal import resample
def write_wav(data, filename, framerate, amplitude):
wavfile = wave.open(filename,'w')
nchannels = 1
sampwidth = 2
framerate = framerate
nframes = len(data)
comptype = "NONE"
compname = "not compressed"
wavfile.setparams((nchannels,
sampwidth,
framerate,
nframes,
comptype,
compname))
frames = []
for s in data:
mul = int(s * amplitude)
frames.append(struct.pack('h', mul))
frames = ''.join(frames)
wavfile.writeframes(frames)
wavfile.close()
print("%s written" %(filename))
if __name__ == "__main__":
if len(sys.argv) <= 1:
print ("You must supply a filename to generate")
exit(-1)
for fname in sys.argv[1:]:
data = []
for time, value in csv.reader(open(fname, newline=None), delimiter=','):
try:
data.append(float(value))#Here you can see that the time column is skipped
except ValueError:
pass # Just skip it
arr = numpy.array(data)#Just organize all your samples into an array
# Normalize data
arr /= numpy.max(numpy.abs(data)) #Divide all your samples by the max sample value
filename_head, extension = fname.rsplit(".", 1)
data_resampled = resample( arr, len(data) )
wavfile.write('rec.wav', 2000, data_resampled) #resampling at 2khz
print ("File written succesfully !")
Any help would be greatly appreciated!!
There are probably empty lines in your CSV file. One way to skip over those is to use the following:
for record in csv.reader(open(fname, newline=None), delimiter=','):
if not record:
continue
time, value = record

Python - Efficient way to flip bytes in a file?

I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack

Implement realtime signal processing in Python - how to capture audio continuously?

I'm planning to implement a "DSP-like" signal processor in Python. It should capture small fragments of audio via ALSA, process them, then play them back via ALSA.
To get things started, I wrote the following (very simple) code.
import alsaaudio
inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NORMAL)
inp.setchannels(1)
inp.setrate(96000)
inp.setformat(alsaaudio.PCM_FORMAT_U32_LE)
inp.setperiodsize(1920)
outp = alsaaudio.PCM(alsaaudio.PCM_PLAYBACK, alsaaudio.PCM_NORMAL)
outp.setchannels(1)
outp.setrate(96000)
outp.setformat(alsaaudio.PCM_FORMAT_U32_LE)
outp.setperiodsize(1920)
while True:
l, data = inp.read()
# TODO: Perform some processing.
outp.write(data)
The problem is, that the audio "stutters" and is not gapless. I tried experimenting with the PCM mode, setting it to either PCM_ASYNC or PCM_NONBLOCK, but the problem remains. I think the problem is that samples "between" two subsequent calls to "inp.read()" are lost.
Is there a way to capture audio "continuously" in Python (preferably without the need for too "specific"/"non-standard" libraries)? I'd like the signal to always get captured "in the background" into some buffer, from which I can read some "momentary state", while audio is further being captured into the buffer even during the time, when I perform my read operations. How can I achieve this?
Even if I use a dedicated process/thread to capture the audio, this process/thread will always at least have to (1) read audio from the source, (2) then put it into some buffer (from which the "signal processing" process/thread then reads). These two operations will therefore still be sequential in time and thus samples will get lost. How do I avoid this?
Thanks a lot for your advice!
EDIT 2: Now I have it running.
import alsaaudio
from multiprocessing import Process, Queue
import numpy as np
import struct
"""
A class implementing buffered audio I/O.
"""
class Audio:
"""
Initialize the audio buffer.
"""
def __init__(self):
#self.__rate = 96000
self.__rate = 8000
self.__stride = 4
self.__pre_post = 4
self.__read_queue = Queue()
self.__write_queue = Queue()
"""
Reads audio from an ALSA audio device into the read queue.
Supposed to run in its own process.
"""
def __read(self):
inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NORMAL)
inp.setchannels(1)
inp.setrate(self.__rate)
inp.setformat(alsaaudio.PCM_FORMAT_U32_BE)
inp.setperiodsize(self.__rate / 50)
while True:
_, data = inp.read()
self.__read_queue.put(data)
"""
Writes audio to an ALSA audio device from the write queue.
Supposed to run in its own process.
"""
def __write(self):
outp = alsaaudio.PCM(alsaaudio.PCM_PLAYBACK, alsaaudio.PCM_NORMAL)
outp.setchannels(1)
outp.setrate(self.__rate)
outp.setformat(alsaaudio.PCM_FORMAT_U32_BE)
outp.setperiodsize(self.__rate / 50)
while True:
data = self.__write_queue.get()
outp.write(data)
"""
Pre-post data into the output buffer to avoid buffer underrun.
"""
def __pre_post_data(self):
zeros = np.zeros(self.__rate / 50, dtype = np.uint32)
for i in range(0, self.__pre_post):
self.__write_queue.put(zeros)
"""
Runs the read and write processes.
"""
def run(self):
self.__pre_post_data()
read_process = Process(target = self.__read)
write_process = Process(target = self.__write)
read_process.start()
write_process.start()
"""
Reads audio samples from the queue captured from the reading thread.
"""
def read(self):
return self.__read_queue.get()
"""
Writes audio samples to the queue to be played by the writing thread.
"""
def write(self, data):
self.__write_queue.put(data)
"""
Pseudonymize the audio samples from a binary string into an array of integers.
"""
def pseudonymize(self, s):
return struct.unpack(">" + ("I" * (len(s) / self.__stride)), s)
"""
Depseudonymize the audio samples from an array of integers into a binary string.
"""
def depseudonymize(self, a):
s = ""
for elem in a:
s += struct.pack(">I", elem)
return s
"""
Normalize the audio samples from an array of integers into an array of floats with unity level.
"""
def normalize(self, data, max_val):
data = np.array(data)
bias = int(0.5 * max_val)
fac = 1.0 / (0.5 * max_val)
data = fac * (data - bias)
return data
"""
Denormalize the data from an array of floats with unity level into an array of integers.
"""
def denormalize(self, data, max_val):
bias = int(0.5 * max_val)
fac = 0.5 * max_val
data = np.array(data)
data = (fac * data).astype(np.int64) + bias
return data
debug = True
audio = Audio()
audio.run()
while True:
data = audio.read()
pdata = audio.pseudonymize(data)
if debug:
print "[PRE-PSEUDONYMIZED] Min: " + str(np.min(pdata)) + ", Max: " + str(np.max(pdata))
ndata = audio.normalize(pdata, 0xffffffff)
if debug:
print "[PRE-NORMALIZED] Min: " + str(np.min(ndata)) + ", Max: " + str(np.max(ndata))
print "[PRE-NORMALIZED] Level: " + str(int(10.0 * np.log10(np.max(np.absolute(ndata)))))
#ndata += 0.01 # When I comment in this line, it wreaks complete havoc!
if debug:
print "[POST-NORMALIZED] Level: " + str(int(10.0 * np.log10(np.max(np.absolute(ndata)))))
print "[POST-NORMALIZED] Min: " + str(np.min(ndata)) + ", Max: " + str(np.max(ndata))
pdata = audio.denormalize(ndata, 0xffffffff)
if debug:
print "[POST-PSEUDONYMIZED] Min: " + str(np.min(pdata)) + ", Max: " + str(np.max(pdata))
print ""
data = audio.depseudonymize(pdata)
audio.write(data)
However, when I even perform the slightest modification to the audio data (e. g. comment that line in), I get a lot of noise and extreme distortion at the output. It seems like I don't handle the PCM data correctly. The strange thing is that the output of the "level meter", etc. all appears to make sense. However, the output is completely distorted (but continuous) when I offset it just slightly.
EDIT 3: I just found out that my algorithms (not included here) work when I apply them to wave files. So the problem really appears to actually boil down to the ALSA API.
EDIT 4: I finally found the problems. They were the following.
1st - ALSA quietly "fell back" to PCM_FORMAT_U8_LE upon requesting PCM_FORMAT_U32_LE, thus I interpreted the data incorrectly by assuming that each sample was 4 bytes wide. It works when I request PCM_FORMAT_S32_LE.
2nd - The ALSA output seems to expect period size in bytes, even though they explicitely state that it is expected in frames in the specification. So you have to set the period size four times as high for output if you use 32 bit sample depth.
3rd - Even in Python (where there is a "global interpreter lock"), processes are slow compared to Threads. You can get latency down a lot by changing to threads, since the I/O threads basically don't do anything that's computationally intensive.
When you
read one chunk of data,
write one chunk of data,
then wait for the second chunk of data to be read,
then the buffer of the output device will become empty if the second chunk is not shorter than the first chunk.
You should fill up the output device's buffer with silence before starting the actual processing. Then small delays in either the input or output processing will not matter.
You can do that all manually, as #CL recommend in his/her answer, but I'd recommend just using
GNU Radio instead:
It's a framework that takes care of doing all the "getting small chunks of samples in and out your algorithm"; it scales very well, and you can write your signal processing either in Python or C++.
In fact, it comes with an Audio Source and an Audio Sink that directly talk to ALSA and just give/take continuous samples. I'd recommend reading through GNU Radio's Guided Tutorials; they explain exactly what is necessary to do your signal processing for an audio application.
A really minimal flow graph would look like:
You can substitute the high pass filter for your own signal processing block, or use any combination of the existing blocks.
There's helpful things like file and wav file sinks and sources, filters, resamplers, amplifiers (ok, multipliers), …
I finally found the problems. They were the following.
1st - ALSA quietly "fell back" to PCM_FORMAT_U8_LE upon requesting PCM_FORMAT_U32_LE, thus I interpreted the data incorrectly by assuming that each sample was 4 bytes wide. It works when I request PCM_FORMAT_S32_LE.
2nd - The ALSA output seems to expect period size in bytes, even though they explicitely state that it is expected in frames in the specification. So you have to set the period size four times as high for output if you use 32 bit sample depth.
3rd - Even in Python (where there is a "global interpreter lock"), processes are slow compared to Threads. You can get latency down a lot by changing to threads, since the I/O threads basically don't do anything that's computationally intensive.
Audio is gapless and undistorted now, but latency is far too high.

Fastest way to process a large file?

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
Reading line-by-line?
with open() as infile:
for line in infile:
Reading the file into memory in chunks and processing it, say 250 MB at a time?
The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.
It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.
And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.
Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.
Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)
So, for example:
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:
with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.
Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.
I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.
This is the code, I give an example in the end
def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
"""
function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned
Params :
fname : path to the file to be chunked
size : size of each chink is ~> this
skiplines : number of lines in the begining to skip, -1 means don't skip any lines
Returns :
start and end position of chunks in Bytes
"""
chunks = []
fileEnd = os.path.getsize(fname)
with open(fname, "rb") as f:
if(skiplines > 0):
for i in range(skiplines):
f.readline()
chunkEnd = f.tell()
count = 0
while True:
chunkStart = chunkEnd
f.seek(f.tell() + size, os.SEEK_SET)
f.readline() # make this chunk line aligned
chunkEnd = f.tell()
chunks.append((chunkStart, chunkEnd - chunkStart, fname))
count+=1
if chunkEnd > fileEnd:
break
return chunks
def parallel_apply_line_by_line_chunk(chunk_data):
"""
function to apply a function to each line in a chunk
Params :
chunk_data : the data for this chunk
Returns :
list of the non-None results for this chunk
"""
chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
func_args = chunk_data[4:]
t1 = time.time()
chunk_res = []
with open(file_path, "rb") as f:
f.seek(chunk_start)
cont = f.read(chunk_size).decode(encoding='utf-8')
lines = cont.splitlines()
for i,line in enumerate(lines):
ret = func_apply(line, *func_args)
if(ret != None):
chunk_res.append(ret)
return chunk_res
def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
"""
function to apply a supplied function line by line in parallel
Params :
input_file_path : path to input file
chunk_size_factor : size of 1 chunk in MB
num_procs : number of parallel processes to spawn, max used is num of available cores - 1
skiplines : number of top lines to skip while processing
func_apply : a function which expects a line and outputs None for lines we don't want processed
func_args : arguments to function func_apply
fout : do we want to output the processed lines to a file
Returns :
list of the non-None results obtained be processing each line
"""
num_parallel = min(num_procs, psutil.cpu_count()) - 1
jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)
jobs = [list(x) + [func_apply] + func_args for x in jobs]
print("Starting the parallel pool for {} jobs ".format(len(jobs)))
lines_counter = 0
pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering
outputs = []
for i in range(0, len(jobs), num_parallel):
print("Chunk start = ", i)
t1 = time.time()
chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])
for i, subl in enumerate(chunk_outputs):
for x in subl:
if(fout != None):
print(x, file=fout)
else:
outputs.append(x)
lines_counter += 1
del(chunk_outputs)
gc.collect()
print("All Done in time ", time.time() - t1)
print("Total lines we have = {}".format(lines_counter))
pool.close()
pool.terminate()
return outputs
Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like
def count_words_line(line):
return len(line.strip().split())
and then call the function like:
parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.

Categories