I recently acquired a ton of data stored in Visual FoxPro 9.0 databases. The text I need is in Cyrillic (Russian), but of the 1000 .dbf files (complete with .fpt and .cdx files), only 4 or 5 return readable text. The rest (usually in the form of memos) returns something like this:
??9Y?u?
yL??x??itZ?????zv?|7?g?̚?繠X6?~u?ꢴe}
?aL1? Ş6U?|wL(Wz???8???7?#R?
.FAc?TY?H???#f U???K???F&?w3A??hEڅԦX?MiOK?,?AZ&GtT??u??r:?q???%,NCGo0??H?5d??]?????O{??
z|??\??pq?ݑ?,??om???K*???lb?5?D?J+z!??
?G>j=???N ?H?jѺAs`c?HK\i
??9a*q??
For the life of me, I can't figure out how this is encoded. I have tried all kinds of online decoders, opened up the .dbfs in many database programs, and used Python to open and manipulate them. All of them returns the similar messiness as above, but never readable Russian.
Note: I know that these databases are not corrupt, because they came accompanied by enterprise software that can open, query and read them successfully. However, that software will not export the data, so I am left working directly with the .dbfs.
Happy to share an example .dbf if would help get to the bottom of this.
I would expect if it is FoxPro database, that the Russian there is encoded in some pre-Unicode encoding for Russian as for most Eastern European languages in ancient times.
For example: Windows-1251 or ISO 8859-5.
'?' characters don't convey much. Try looking at the contents of the memo fields as hex, and see whether what you're seeing looks anything like text in any encodings. (Apologies if you've tried this using Python already). Of course if it is actually encrypted you may be out of luck unless you can find out the key and method.
There are two possibilities:
the encoding has not been correctly stored in the dbf file
the dbf file has been encrypted
If it's been encrypted I can't help you. If it's a matter of finding the correct encoding, my dbf package may be of use. Feel free to send me a sample dbf file if you get stuck.
Related
I have an old delphi program exporting electrical measurement data from a device as .log files as a database dump. I need to convert them to .csv or any other easily accessible dataformat.
As I have no access to the source code, my options to find the actual database format which was used is relatively low. I know, the author was using UniDac and the password for database access is "admin".
When I try to open the .log files in pythons binary-read mode, they don't make much sense as I don't know the correct encoding. Here is an example line:
b'\xa0=\xb1\xfc\xaa<\xb0\xd6\x8e=\xbbr\xd7<\xde\x92\x1a=\x82\xab"=|5%>o\xa4\xbe=\xce:\xe3>\xdf\xb9\xe2=~\x14\x8b=\xbb\xac(=-\x8bE=g\r\x03=\xeb\xc9-=\xcb\x81\xbf</(\x86=\xb8\x07\xad<\xaf\xd0)=\x94\x18\xa6<j#\x05=N\xd0\xa7<\xfe\x92\xbd=\xf0I\xfa<\x9dl\xec<\x08t\xba<9(\xf5<R}\xc8<E\xecI=?\x17\xaa=\xd9\xf4\x01=\xeet\x92=}\xb8\x82<>D\xe1<\x83w\x99<\xab\xd7\xe9<\x92\n'
As you can see, it is a mixture of hex- and ASCII-Codes, sometimes even containing comprehensible text headers.
I have also tried opening the database with pyodbc:
import pyodbc as db
conn = db.connect('DRIVER={SQL Server};DBQ={link_to_.log};UID=admin;PWD=admin')
But as I am not sure of the UniDac database format, this doesn't work.
I am not used to working with databases so even basic information is welcome. Which are the formats I need to try with pyodbc, when I want to open UniDac-databases? Do I need to find the correct encoding first? Are there other possibilities to make the.log files accessible?
I can upload an example .log if someone is interested in this problem.
I tried to use a escpos package but it doesn't generate unique characters for my language, even with encoding ISO-8859-2 and windows-1250 this same behavior appears with python version
I have already printed a pdf, the printer is able to print a unique character so it seems to be a package problem, as I can not solve a problem with escpos I decided to generate a pdf and then print it, but I am not sure what kind of tools to use I saw a pdf kit but some people advice to use XML.
I am looking for some advice with it or maybe you know how to deal with escpos
Good afternoon,
I am looking for a bit of insight into working with KeyNote files (~2017 ver 8.x).
My objective is fairly basic. I just want to extract the text and images from about 3000 KeyNote files. I am working in Python 2.7 due to the age of many of the tools, but I would like to upgrade to 3.x or 4.x eventually. Despite a lot of reading and experimenting I seen to have hit a wall extracting messages from the IWA objects.
I have been experimenting with various approaches and have also been trying to manually deconstruct the IWA files by hand using the protobuf encoding information. However something just does not add up. Testing with messages created using the Protobuf sample code I can deconstruct 100%, but .IWA blocks from KeyNote files end up with invalid wire types, repeat field numbers or field sizes that don't makes sense (e.g. larger that the size of the IWA block).
What I think I know.
1/ The .key files are a grouping of objects that are zipped and can be unzipped using a generic module like zipfile.
Once unzipped, the key file can be separated giving access to the/index branch and constituant IWA objects.
2/ The IWA files have a 4 byte little endian header, and the rest should follow the google protobuf encoding.
3/ The protobuf encoding does hold for some aspects of the IWA files. e.g recognized blocks of text have the correct tags. However other parts of the IWA does not seem to follow the rules either resulting in invalid wire-type codes (e.g. wire-type=6 ) or, field numbers are zero or are reused.
What I would appreciate is if:
A/ Someone could confirm that the KeyNote encoding does comply with the Google protobuf encoding, or point me at a valid encoding schedule or scheme that I can use.
B/ Someone could clarify if the IAW objects are or are not individually compressed in addition to the compressing applied to the whole .key file. The documentation is unclear, but my attempts to further decompress the IWA objects was not successful.
C/ Someone could direct me to a functional Python library that can extract data from KeyNote files.
As much as I am having fun playing with file deconstruction at the byte and bit level, I still have an objective to achieve :-)
Thank you.
Rusty
Any insights gratefully accepted
I know this is a relatively old question, but I came across it and would offer up some information.
The page
https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md#iwa
seems to have a lot of info on the format. In particular, it seems (from what I gather from that page) that the IWA does not follow exactly the ProtoBuf encoding, which is probably the cause of your problems with invalid wire numbers and non-sensable field lengths.
I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.
I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls