I've been trying to parse Tiff and Jpeg image EXIF data using the BitStream module and just can't get it all going. Has anyone had better luck using it? If so, could you point me towards or can you share any example snippets?
I've spent a long time looking at and testing existing Python EXIF modules and so far all of them seem to be incomplete and unable to fully parse current EXIF image data.
I also looked at the Perl based EXIFTool which is a very complete tool, but when called from a Python shell call is 10x slower than calling native Python code to do the same.
This BitStream module really seems like it is the Python tool to use to parse binary data with.
Yet right now the learning curve has been steep.
The biggest problem I am running up against is all the EXIF Image Spec doc's just are not matching up to what I am seeing as I parse an image using BitStream.
It feels like I am getting closer here, but could use a little advice or code snippets from some others who have spent time parsing image file EXIF headers using BitStream and Python to get this back on track.
Related
I am trying to extract data from pdf using PyPDF2 but instead of showing actual text it showing something else in the output what could be the reason behind it?
Here is my code
xfile=open('filename','rb')
pdfReader = PyPDF2.PdfFileReader(xfile)
num=pdfReader.numPages
pageobj=pdfReader.getPage(0)
print(pageobj.extractText())
when I run above program I get this output what could be the reason?
!"#$%#&'(%!#)
(((((((((((((((((((((((((((((((((((((((((((((((((!"#$%#&'(%!#)*+,-./0!$1(230
4444444444445674+8,8,9:+*8
4&*)+!,$-.
4,*7;44444444444444444444444444
4$/012/($/3414546(78(,69:/7;7<=(>"#)?#(A2B2/231
(444<(4=&2#4$>4?&#!0$24A>/$>&&#$>/B4?CDEF4+(;8
4,*7,444*B62C;2/0(#B(%69(%9:77;#("1;23D5B
((((?C<GA47,H#B48:(,*I
4,*7*444E2F2:2B(.2G702=2(A10=2;2=2#("1;23D5B
((((?<GA47*H#B4?CDEF46(8
44%'$HH%(!.*($.,&I&%,%
Pdf is a file format oriented around page layout. Thus, text present in a pdf can be stored in various methods. It is not guaranteed that your pdf is stored in a format readable by PyPDF.
Moving forward: you can try extracting data from other pdfs before concluding if there is a fault with your PyPdf implementation.
you can also try extracting data from pytesseract and see if your result improves.
From PyPDF2s documentation:
This works well for some PDF files, but poorly for others, depending on the generator used.
Your PDF might be of the latter category and you are SOL...
With PyPDF2 not being actively developed anymore (no updates to the Pypi package since 2016) maybe try a more up-to-date package like pdftotext
I am trying to do sound analysis on a file in Python, and I have a sound file from a show that is high definition and it is very large (2.39 GB). However, whenever I try to open this using the wave module, I get the following error:
wave.Error: unknown format: 65534
I got this file by converting a .ts file into a .wav file. I used the same method on standard definition shows and it worked just fine. I am able to do some analysis using
data = np.memmap(audioclip,dtype='h',mode='r')
however, this does not get accurate results, as it thinks the audioclip is 3 hours long when it is only one hour long. Any help would be appreciated, I have similar issues with different error codes, however those have not been much help to this issue. Thank you so much!
Disclaimer: I don't really know that much about python.
I googled wave.py and found the following link: http://www.opensource.apple.com/source/python/python-3/python/Lib/wave.py
If you look for the function named _read_fmt_chunk you'll see the source of the error message. In short, the wave module only supports WAVE_FORMAT_PCM. Format 65534 is a format called WAVE_FORMAT_EXTENSIBLE defined by Microsoft and is used for multi-channel wave files. It's pretty uncommon.
I think you have a few options:
Find a new method of converting the file that doesn't produce WAVE_FORMAT_EXTENSIBLE
Modify the source for wave.py to support WAVE_FORMAT_EXTENSIBLE - assuming the SubFormat field is PCM or IEEE_FLOAT that wouldn't be a big deal. From that perspective it just increases the size of the header. If it is another SubFormat then you'll need to run an appropriate decoder before you can even get to PCM.
Use another tool to convert the WAVE_FORMAT_EXTENSIBLE .wav file to one which is not. sox may be able to handle this.
Regarding the second part of your question. It's not clear from your question how you are determining the duration of the file. But if you make incorrect assumptions about the number of channels that could be throwing you off.
I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls
I'm gathering basic metadata for images - mainly their dimensions, although it'd be nice to get any other available metadata as well. The image formats I'm interested in are png, jpg, and gif.
I'm using PIL at the moment, but it occurred to me there may be a simpler way that doesn't involve external dependencies or binary libraries. Is there one?
I don't think there is anything built in, but if you look up those file formats, you will find that the size is encoded near the beginning of the file.
You can use the struct module to parse just enough of the header to work out the size
Answer: No there is not a simpler way than using an external library.
If you are only going to care about one and one file format only, then yes. Then it's easy to implement something specific for that. But if you want to be generic, you need to support a lot of file formats, and then you don't want to do all that work yourself.
To simplify install of PIL, you might look at Pillow, a friendly forkĀ§ that makes PIL easy_installable.
See ImageMagick, a fantastic library for dealing with bitmap images. The identify tool from the command line suite will do what you want. There are also a few Python interfaces.
I'm trying to use a python script called deepzoom.py to convert large overhead renders (often over 1GP) to the Deep Zoom image format (ie, google maps-esque tile format), but unfortunately it's powered by PIL, which usually ends up crashing due to memory limitations. The creator has said he's delving into VIPS, but even nip2 (the GUI frontend for VIPS) fails to open the image. In another question by someone else (though on the same topic), someone suggested OpenImageIO, which looks like it has the ability, and has Python wrappers, but there aren't any proper binaries provided, and trying to compile it on Windows is a nightmare.
Are there any alternative libraries for Python I can use? I've tried PythonMagickWand (wrapper for ImageMagick) and PythonMagick (wrapper for GraphicsMagick), but both of those also run into memory problems.
I had a very similar problem and I ended up solving it by using netpbm, which works fine on windows. Netpbm had no problem with converting huge .png files and then slicing, cropping, re-combining (using pamcrop, pamdice, and pamundice) and converting back to .png without using much memory at all. I just included the necessary netpbm binaries and dlls with my application and called them from python.
It sounds like you're trying to use georeferenced imagery or something similar, for which a GIS solution sounds more appropriate. I'd use GDAL -- it's an excellent library and comes with easy-to-use Python bindings via Swig.
On Windows, the easiest way to install it is via Frank Warmerdam's FWTools package.
I'm able to use pyvips to read images with size (50000, 50000, 3):
img = pyvips.Image.new_from_file('xxx.jpg')
arr = np.ndarray(buffer=img.write_to_memory(),
dtype=np.uint8,
shape=[img.height, img.width, img.bands])
Is a partial load useful? If you use PIL and the image format is .BMP: you can open() an image file (which doesn't load it), then do a crop(), and then load - which will only actually load the part of the image which you've selected by crop. Will probably also work with TGA, maybe even for JPG and less efficiently for PNG and other formats.
libvips comes with a very fast DeepZoom creator that can work with images of any size. Try:
$ vips dzsave huge.tif mydz
Will write the tiles to mydz_files and also write a mydz.dzi info file for you. It's typically 10x faster than deepzoom.py and has no size limit.
See this chapter in the manual for an introduction to dzsave.
You can do the same thing from Python using pyvips like this:
import pyvips
my_image = pyvips.Image.new_from_file("huge.tif", access="sequential")
my_image.dzsave("mydz")
The access="sequential" tells pyvips it can stream the image rather than having to read the whole thing into memory.