I am trying to create my own python package where hopefully one day other people will be able to contribute with their own content. For the package to work it must be possible to have small data files that will be installed as part of the package. It turns out, loading data file that are part of a Python module is not easy. I managed to load very basic ASCII files using something like this:
data = pkgutil.get_data(__name__, "data/test.txt")
data=data.decode('utf-8')
rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", data)
datanp=np.zeros(len(rr))
for k in range(len(rr)):
datanp[k]=np.float(rr[k])
I found many comments online that say you should not use commands like np.load(path_to_package) because on some systems packages might actually be stored in zip files or something. That is why I am using pkgutil.get_data() , it is apparently the more robust. Here for example they talk in great length about different ways to safely load data, but not so much how you would actually load different data types.
My question: Are there ways to load .npy files from inside a Python package?
Related
I've written a web scraper in Python and I have a ton (thousands) of files that are extremely similar, but not quite identical. The disk space used currently used by the files is 1.8 GB, but if I compress them into a tar.xz, they compress to 14.4 MB. I want to be closer to that 14.4 MB than the 1.8 GB.
Here are some things I've considered:
I could just use tarfile in Python's standard library and store the files there. The problem with that is I wouldn't be able to modify the files within the tar without recompressing all of the files which would take a while.
I could just use the difflib in Python's standard library, but I've found that this library doesn't offer any way of applying "patches" to recreate the new file.
I could use Google's diff-match-patch Python library, but when I was reading the documentation, they said "Attempting to feed HTML, XML or some other structured content through a fuzzy match or patch may result in problems.", well considering I wanted to use this library to more efficiently store HTML files, that doesn't sound like it'll help me.
So is there a way of saving disk space when storing a large amount of similar HTML files?
You can use a dictionary.
Python's zlib interface supports dictionaries. The compressobj and decompressobj functions both take an optional zdict argument, which is a "dictionary". A dictionary in this case is nothing more than 32K of data with sequences of bytes that you expect will appear in the data you are compressing.
Since your files are about 30K each, this works out quite well for your application. If indeed your files are "extremely similar", then you can take one of those files and use it as the dictionary to compress all of the other files.
Try it, and measure the improvement in compression over not using a dictionary.
Is there any way of doing parallel IO for Netcdf files in Python?
I understand that there is a project called PyPNetCDF, but apparently it's old, not updated and doesn't seem to work at all. Has anyone had any success with parallel IO with NetCDF in Python at all?
Any help is greatly appreciated
It's too bad PyPnetcdf is not a bit more mature. I see hard-coded paths and abandoned domain names. It doesn't look like it will take a lot to get something compiled, but then there's the issue of getting it to actually work...
in setup.py you should change the library_dirs_list and include_dirs_list to point to the places on your system where Northwestern/Argonne Parallel-NetCDF is installed and where your MPI distribution is installed.
then one will have to go through and update the way pypnetcdf calls pnetcdf. A few years back (quite a few, actually) we promoted a lot of types to larger versions.
I haven't seen good examples from either of the two python NetCDF modules, see https://github.com/Unidata/netcdf4-python/issues/345
However, if You only need to read files and they are NetCDF4 format, You should be able to use HDF5 directly -- http://docs.h5py.org/en/latest/mpi.html
because NetCDF4 is basically HDF5 with restricted data model. Probably won't work with NetCDF3.
I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls
We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.
What would be a neat way to share configuration parameters\settings\constants between various projects in Python?
Using a DB seems like an overkill. Using a file raises the question of which project should host the file in its source control...
I'm open for suggestions :)
UPDATE:
For clarification - assume the various projects are deployed differently on different systems. In some cases in different directories, in other cases some of the projects are there and some are not.
I find that in many cases, using a configuration file is really worth the (minor) hassle. The builtin ConfigParser module is very handy, especially the fact that it's really easy to parse multiple files and let the module merge them together, with values in files parsed later overriding values from files parsed earlier. This allows for easy use of a global file (e.g. /etc/yoursoftware/main.ini) and a per-user file (~/.yoursoftware/main.ini).
Each of your projects would then open the config file and read values from it.
Here's a small code example:
basefile.ini:
[sect1]
value1=1
value2=2
overridingfile.ini:
[sect1]
value1=3
configread.py:
#!/usr/bin/env python
from ConfigParser import ConfigParser
config = ConfigParser()
config.read(["basefile.ini", "overridingfile.ini"])
print config.get("sect1", "value1")
print config.get("sect1", "value2")
Running this would print out:
3
2
Why don't you just have a file named constants.py and just have CONSTANT = value
Create a Python package and import it in the various projects...
Why is a database overkill? You're describing sharing data across different projects located on different physical systems with different paths to each project's directory. Oh, and sometimes the projects just aren't there. I can't imagine a better means of communicating the data. It only has to be a single table, that's hardly overkill if it provides the consistent access you need across platforms, computers, and even networks.