Large Wave File not being read in Python

Large Wave File not being read in Python - python

I am trying to do sound analysis on a file in Python, and I have a sound file from a show that is high definition and it is very large (2.39 GB). However, whenever I try to open this using the wave module, I get the following error:
wave.Error: unknown format: 65534
I got this file by converting a .ts file into a .wav file. I used the same method on standard definition shows and it worked just fine. I am able to do some analysis using
data = np.memmap(audioclip,dtype='h',mode='r')
however, this does not get accurate results, as it thinks the audioclip is 3 hours long when it is only one hour long. Any help would be appreciated, I have similar issues with different error codes, however those have not been much help to this issue. Thank you so much!

Disclaimer: I don't really know that much about python.
I googled wave.py and found the following link: http://www.opensource.apple.com/source/python/python-3/python/Lib/wave.py
If you look for the function named _read_fmt_chunk you'll see the source of the error message. In short, the wave module only supports WAVE_FORMAT_PCM. Format 65534 is a format called WAVE_FORMAT_EXTENSIBLE defined by Microsoft and is used for multi-channel wave files. It's pretty uncommon.
I think you have a few options:
Find a new method of converting the file that doesn't produce WAVE_FORMAT_EXTENSIBLE
Modify the source for wave.py to support WAVE_FORMAT_EXTENSIBLE - assuming the SubFormat field is PCM or IEEE_FLOAT that wouldn't be a big deal. From that perspective it just increases the size of the header. If it is another SubFormat then you'll need to run an appropriate decoder before you can even get to PCM.
Use another tool to convert the WAVE_FORMAT_EXTENSIBLE .wav file to one which is not. sox may be able to handle this.
Regarding the second part of your question. It's not clear from your question how you are determining the duration of the file. But if you make incorrect assumptions about the number of channels that could be throwing you off.

Related

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)

I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Python WAV audio play Without External Libraries

I have been having some issues for the past few days to get this to work. All I need is to learn it once to get it later. So what I need is a good example of working source code to play a simple wav file. I do not want to use an external library only to get this to work. I honestly don't see the point in getting a huge library to substitute for one problem :/. So if I can get a (Once again, NON-EXTERNAL) example, that would be great. (I'm using windows, so winsound should work, but I can't get the winsound.PlaySound('Example'.wav, SND_FILENAME) thing to work.) Thanks!

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?

borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch

As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.

I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

How to parse a .shp file?

I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!

Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.

If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.

might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls

python with .pdb files

I am working on bio project.
I have .pdb (protein data bank) file which contains information about the molecule.
I want to find out the following of a molecule in the .pdb file:
Molecular Mass.
H bond donor.
H bond acceptor.
LogP.
Refractivity.
Is there any module in python which can deal with .pdb file in finding this?
If not then can anyone please let me know how can I do the same?
I found some modules like sequtils and protienparam but they don't do such things.
I have researched first and then posted, so, please don't down-vote.
Please comment, if you still down-vote as to why you did so.
Thanks in advance.

I don't know if it fits your needs, but Biopython looks like it might help.

PDB file also outputs an XML file PDBML that can be easily parsed using an xml parsing library
http://pdbml.pdb.org/

A pdb file can contain pretty much anything.
A lot of projects allows you to parse them. Some specific to biology and pdb files, other less specific but that will allow you to do more (setup calculations, measure distances, angles, etc.).
I think you got downvoted because these projects are numerous: you are not the only one wanting to do that so the chances that something perfectly fitting your needs exists are really high.
That said, if you just want to parse pdb files for this specific need, just do it yourself:
Open the files with a text editor.
Identify where the relevant data are (keywords, etc.).
Make a Python function that opens the file and look for the keywords.
Extract the figures from the file.
Done.
This can be done with a short script written in less than 10 minutes (other reason why downvoting).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.