pydub - memoryerror - python

I am trying to split large podcasts mp3 files into smaller 5 minute chunks using python and the pydub library. This is my code:
folder = r"C:\temp"
filename = r"p967.mp3"
from pydub import AudioSegment
sound = AudioSegment.from_mp3(folder + "\\" + filename)
This works fine for small files but for the large podcasts i am interested in 100mb +. This returns the following error.
Traceback (most recent call last):
File "C:\temp\mp3split.py", line 6, in <module>
sound = AudioSegment.from_mp3(folder + "\\" + filename)
File "C:\Python27\lib\site-packages\pydub\audio_segment.py", line 522, in from_mp3
return cls.from_file(file, 'mp3', parameters)
File "C:\Python27\lib\site-packages\pydub\audio_segment.py", line 511, in from_file
obj = cls._from_safe_wav(output)
File "C:\Python27\lib\site-packages\pydub\audio_segment.py", line 544, in _from_safe_wav
return cls(data=file)
File "C:\Python27\lib\site-packages\pydub\audio_segment.py", line 146, in __init__
data = data if isinstance(data, (basestring, bytes)) else data.read()
MemoryError
Is this a limitation of the library? should i be using an alternative approach to achieve this?
If i add the follwoing code to check the memory status at the point of running.
import psutil
print psutil.virtual_memory()
This prints:
svmem(total=8476975104L, available=5342715904L, percent=37.0 used=3134259200L, free=5342715904L)
This suggests to me that there is plenty of memory at the start of the operation, though I am happy to be proven wrong.

Yes, the most likely cause is that you've simply run out of available memory. Do you know how much memory you have available before you execute that statement? Consider inserting a system call (see the os package) just before the failing statement.

Related

python joblib returning `TypeError: cannot pickle 'weakref' object` with the same code, but different input data

I'm trying to parallelise code which converts strings to a third party package object using a function defined in that library. However, joblib is failing depending on the input data that I provide. Are the return types of the function important when using joblib?
To reproduce the issue:
First install third party libraries with:
pip install joblib music21
and download the data file test_input.abc (it's 4kb of text).
This code will run fine as a script:
from typing import List
import music21
from joblib import Parallel, delayed
def convert_string(string: str, format: str = "abc") -> music21.stream.Score:
return music21.converter.parse(string, format=format)
def convert_list_of_strings(
string_list,
n_jobs=-1,
prefer=None
) -> List[music21.stream.Score]:
return Parallel(n_jobs=n_jobs, prefer=prefer)(
delayed(convert_string)(string) for string in string_list
)
if __name__ == "__main__":
string_list = ['T:tune\nM:3/4\nL:1/8\nK:C\nab cd ef|GA BC DE' for _ in range(1000)]
output = convert_list_of_strings(string_list)
print(output)
i.e. it returns a list of music21.stream.Score objects.
However, if you change the main call to read the attached file, i.e.:
if __name__ == "__main__":
filepath = "test_input.abc"
tune_sep = "\n\n"
with open(filepath, "r") as file_object:
string_list = file_object.read().strip().split(tune_sep)
output = convert_list_of_strings(string_list)
print(output)
This will return the following error:
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 356, in _sendback_result
result_queue.put(_ResultItem(work_id, result=result,
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 241, in put
obj = dumps(obj, reducers=self._reducers)
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'weakref' object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "joblib_test.py", line 50, in <module>
output = convert_list_of_strings(string_list)
File "joblib_test.py", line 39, in convert_list_of_strings
return Parallel(n_jobs=n_jobs, prefer=prefer)(
File "/path/to/venv/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/path/to/venv/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/path/to/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/path/to/venv/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/path/to/venv/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
TypeError: cannot pickle 'weakref' object
What I've tried to fix
Checked that the code runs fine when run without parallelism i.e. set n_jobs=1
Changed joblib's backend to use threads i.e. set prefer="threads" - this fixes things, i.e. it will not error - but I don't want to use threads!
Tried serializing the output of the function, i.e. changing convert_string to:
def convert_string(string: str, format: str = "abc") -> str:
return music21.converter.freezeStr(music21.converter.parse(string, format=format))
this also means the code will run...but now I have thousands of objects I need to deserialize!
Checked the input data type is all the same when reading from the file, and the first method (i.e. it's a list of strings)
Checked the output data type is all the same when reading from the file, and the first method (i.e. it's a list of music21.stream.Score)
Faffed about with #wrap_non_pickleable_objects
So I'm guessing that the content of the music21.stream.Score in the output is causing issue?
Change music21.sites.WEAKREF_ACTIVE = False (you might need to edit music21/sites.py directly) and music21 won't use any weak references. They'll probably disappear in v8 anyhow (or maybe even sooner since they're mostly an implementation detail). They were needed for running music21 in the Pre-Python2.6 circular reference counting era, but they're not really necessary anymore.
However, you're not going to get much of a speedup in your code, because the process of serializing and deserializing the Stream for passing across the multiprocessing worker-core->controller-core boundary will usually take as long as parsing the file itself, if not more. I can't find where I wrote it at some point but there's a guide to parallel running music21 that suggests do all your stream parsing in the worker-core and only pass back small data structures (counts of numbers of notes, etc.), not the whole score.
Oh, for some of these things, music21's common.parallel library (which uses joblib) will help make common tasks easier:
https://web.mit.edu/music21/doc/moduleReference/moduleCommonParallel.html

OSError: file not found

I'm trying to write a script that needs to rename (in the script itself, not in the folder) some .txt files to be able to use them in a loop, enumerating them.
I decided to use a dictionary, something like this:
import os
import fnmatch
dsc = {}
for filename in os.listdir('./texto'):
if fnmatch.fnmatch(filename, 'dsc_hydra*.txt'):
dsc[filename[:6]] = filename
print(dsc)
print(dsc['dsc_hydra1'])
The 'print(something)' are just to check if everything is going well.
I need to rename them because I'm using them in future functions and I don't want to address them using all that path stuff, something like:
IFOV = gi.IFOV_generic(gmatOUTsat1, matrixINPUTsat1, dsc['dsc_hydra1'], 'ifovfileMST.json', k_lim, height, width)
Using dsc['dsc_hydra1'], I get this error:
Traceback (most recent call last):
File "mainSMART_MST.py", line 429, in <module>
IFOV1= gi.IFOV_generic(gmatOUTsat1,matrixINPUTsat1,dsc['dsc_hydra1'],'ifovfileMST.jso',k_lim, height, width)
File "/home/alumno/Escritorio/HDD_Nuevo/HO(PY)/src/generateIFOV.py", line 49, in IFOV_generic
DCM11,DCM12,DCM13,DCM21,DCM22,DCM23,DCM31,DCM32,DCM33 = np.loadtxt(gmatDCM,unpack=True,skiprows = 2,dtype = float)
File "/home/alumno/.local/lib/python3.5/site-packages/numpy/lib/npyio.py", line 962, in loadtxt
fh = np.lib._datasource.open(fname, 'rt', encoding=encoding)
File "/home/alumno/.local/lib/python3.5/site-packages/numpy/lib/_datasource.py", line 266, in open
return ds.open(path, mode, encoding=encoding, newline=newline)
File "/home/alumno/.local/lib/python3.5/site-packages/numpy/lib/_datasource.py", line 624, in open
raise IOError("%s not found." % path)
OSError: dsc_hydra1.txt not found.
I've already checked the folder and the file is there, why do I keep getting this error?
I had this same issue. It cannot locate the .txt file because you're in the wrong directory. Make sure that where you're trying to execute the code is within the directories of which the code needs. Hope this helps.
I had the same problem. In my case, inside the file.txt, I had a space at the end of the string. You should control the spaces! For example, inside the file.txt (space = -):
-365-
string1-
string2
-string3
if you remove all the spaces (-) it should work!

How do I to turn my .tar.gz file into a file-like object for shutil.copyfileobj?

My goal is to extract a file out of a .tar.gz file without also extracting out the sub directories that precede the desired file. I am trying to module my method off this question. I already asked a question of my own but it seemed like the answer I thought would work didn't work fully.
In short, shutil.copyfileobj isn't copying the contents of my file.
My code is now:
import os
import shutil
import tarfile
import gzip
with tarfile.open('RTLog_20150425T152948.gz', 'r:*') as tar:
for member in tar.getmembers():
filename = os.path.basename(member.name)
if not filename:
continue
source = tar.fileobj
target = open('out', "wb")
shutil.copyfileobj(source, target)
Upon running this code the file out was successfully created however, the file was empty. I know that this file I wanted to extract does, in fact, have lots of information (approximately 450 kb). A print(member.size) returns 1564197.
My attempts to solve this were unsuccessful. A print(type(tar.fileobj)) told me that tar.fileobj is a <gzip _io.BufferedReader name='RTLog_20150425T152948.gz' 0x3669710>.
Therefore I tried changing source to: source = gzip.open(tar.fileobj) but this raised the following error:
Traceback (most recent call last):
File "C:\Users\dzhao\Desktop\123456\444444\blah.py", line 15, in <module>
shutil.copyfileobj(source, target)
File "C:\Python34\lib\shutil.py", line 67, in copyfileobj
buf = fsrc.read(length)
File "C:\Python34\lib\gzip.py", line 365, in read
if not self._read(readsize):
File "C:\Python34\lib\gzip.py", line 433, in _read
if not self._read_gzip_header():
File "C:\Python34\lib\gzip.py", line 297, in _read_gzip_header
raise OSError('Not a gzipped file')
OSError: Not a gzipped file
Why isn't shutil.copyfileobj actually copying the contents of the file in the .tar.gz?
fileobj isn't a documented property of TarFile. It's probably an internal object used to represent the whole tar file, not something specific to the current file.
Use TarFile.extractfile() to get a file-like object for a specific member:
…
source = tar.extractfile(member)
target = open("out", "wb")
shutil.copyfile(source, target)

unable to read some .wav files using scipy.io.wavread.read()

I am trying to read .wav file using scipy.io.wavread. It reads some file properly.
For some files its giving following error...
Warning (from warnings module):
File "D:\project\cardiocare-1.0\src\scipy\io\wavfile.py", line 121
warnings.warn("chunk not understood", WavFileWarning)
WavFileWarning: chunk not understood
Traceback (most recent call last):
File "D:\project\cardiocare-1.0\src\ccare\plot.py", line 37, in plot
input_data = read(p.bitfile)
File "D:\project\cardiocare-1.0\src\scipy\io\wavfile.py", line 119, in read
data = _read_data_chunk(fid, noc, bits)
File "D:\project\cardiocare-1.0\src\scipy\io\wavfile.py", line 56, in _read_data_chunk
data = data.reshape(-1,noc)
ValueError: total size of new array must be unchanged
Can any one suggest me any solution?
I use the below code to read wav files. I know it does not solve your problem, but maybee you could read your wav file with this code and maybe figure out what is wrong?
My experience is, that wav files sometimes contains "strange" things, that must be handled or removed.
Hope it helps you out
Rgds
Cyrex
import wave
import struct
def wavRead(fileN):
waveFile = wave.open(fileN, 'r')
NbChanels = waveFile.getnchannels()
data = []
for x in range(NbChanels):
data.append([])
for i in range(0,waveFile.getnframes()):
waveData = waveFile.readframes(1)
data[i%(NbChanels)].append(int(struct.unpack("<h", waveData)[0]))
RetAR = []
BitDebth = waveFile.getsampwidth()*8
for x in range(NbChanels):
RetAR.append(np.array(data[x]))
RetAR[-1] = RetAR[-1]/float(pow(2,(BitDebth-1)))
fs = waveFile.getframerate()
return RetAR,fs

python errno 24 on cgi script using subprocess

I have a python cgi script that runs an application via subprocess over and over again (several thousand times). I keep getting the same error...
Traceback (most recent call last):
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 413, in <module>
webpage()
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 406, in main
displayOmpResult(form['odfFile'].value)
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 342, in displayContainerDiv
makeSection(position,sAoiInput)
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 360, in displayData
displayTable(i,j,lAmpAndVars,dOligoSet[key],position)
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 247, in displayTable
p = subprocess.Popen(['/usr/bin/pDat',sInputFileLoc,sOutputFileLoc],stdout=fh, stderr=fh)
File "/usr/lib/python2.6/subprocess.py", line 633, in __init__
errread, errwrite)
File "/usr/lib/python2.6/subprocess.py", line 1039, in _execute_child
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
The function causing it is below.
def displayTable(sData):
# convert the data to the proper format
sFormattedData = convertToFormat(sData)
# write the formatted data to file
sInputFile = tempfile.mkstemp(prefix='In_')[1]
fOpen = open(sInputFile,'w')
fOpen.write(sFormattedData)
fOpen.close()
sOutputFileLoc = sInputFile.replace('In_','Out_')
# run app, requires two files; an input and an output
# temp file to holds stdout stderr of subprocess
fh = tempfile.TemporaryFile(mode='w',dir=tempfile.gettempdir())
p = subprocess.Popen(['/usr/bin/pDat',sInputFileLoc,sOutputFileLoc],stdout=fh, stderr=fh)
p.communicate()
fh.close()
# open output file and print parsed data into a list of dictionaries
sOutput = open(sOutputFileLoc).read()
lOutputData = parseOutput(sOutput)
displayTableHeader(lOutputData)
displaySimpleTable(lOutputData)
As far as I can tell, I'm closing the files properly. When I run...
import resource
print resource.getrlimit(resource.RLIMIT_NOFILE)
I get...
(1024, 1024)
Do I have to increase this value? I read that subprocess opens several file descriptors. I tried adding "close_fds = True" and I tried using the with statement when creating my file but the result was the same. I suspect the problem may be with the application that I'm subprocessing, pDat, but this program was made by someone else. It requires two inputs; an input file and the location of where you want the output file written to. I suspect it may not be closing the output file that it creates. Aside from this, I can't see what I might be doing wrong. Any suggestions? Thanks.
EDIT:
I'm on ubuntu 10.04 running python 2.6.5 and apache 2.2.14
Instead of this...
sInputFile = tempfile.mkstemp(prefix='In_')[1]
fOpen = open(sInputFile,'w')
fOpen.write(sFormattedData)
fOpen.close()
I should have done this...
iFileHandle,sInputFile = tempfile.mkstemp(prefix='In_')
fOpen = open(sInputFile,'w')
fOpen.write(sFormattedData)
fOpen.close()
os.close(iFileHandle)
The mkstemp function makes OS level handles to a file and I wasn't closing them. The solution is described in more detail here...
http://www.logilab.org/blogentry/17873
You want to add close_fds=True to the popen call (just in case).
Then, here:
# open output file and print parsed data into a list of dictionaries
sOutput = open(sOutputFileLoc).read()
lOutputData = parseOutput(sOutput)
...I might remember wrong, but unless you use the with syntax, I do not think that the output file descriptor has been closed.
UPDATE: the main problem is that you need to know which files are open. On Windows this would require something like Process Explorer. In Linux it's a bit simpler; you just have to invoke the CGI from command line, or be sure that there is only one instance of the CGI running, and fetch its pid with ps command.
Once you have the pid, run a ls -la on the content of the /proc/<PID>/fd directory. All open file descriptors will be there, with the name of the files they point to. Knowing that file so-and-so is opened 377 times, that goes a long way towards finding out where exactly that file is opened (but not closed).

Categories