Store pkl / binary in MetaData

Store pkl / binary in MetaData - python

I am writing a function that is supposed to store a text representation of a custom class object, cl
I have some code that writes to a file and takes the necessary information out of cl.
Now I need to go backwards, read the file and return a new instance of cl. The problem is, the file doesn't keep all of the important parts of cl because for the purpose of this text document parts of it are unnecessary.
A .jpg file allows you to store meta data like shutter speed and location. I would like to store the parts of cl that are not supposed to be in the text portion in the meta data of a .txt or .csv file. Is there a way to explicitly write something to the metadata of a text file in Python?
Additionally, would it be possible to write the byte-code .pkl representation of the entire object in the metadata?

Text files don't have meta data in the same way that a jpg file does. A jpeg file is specifically designed to have ways of including meta data as extra structured information in the image. Text files aren't: every character in the text file is generally displayed to the user.
Similarly, every thing in a CSV file is part of one cell in the table represented by the file.
That said, there are some things similar to text file metadata that have existed or exist over the years that might give you some ideas. I don't think any of these is ideal, but I'll give some examples to give you an idea how complex the area of meta data is and what people have done in similar situations.
Some filesystems have meta data associated with each file that can be extended. As an example, NTFS has streams; HFS and HFSplus have resource forks or other attributes; Linux has extended attributes on most of its filesystems. You could potentially store your pickle information in those filesystem metadata. There are disadvantages. Some filesystems don't have this meta data. Some tools for copying and manipulating files will not recognize (or intentionally strip) meta data.
You could have a .txt file and a .pcl file, where the .txt file contains your text representation and the .pkl file contained the other information.
Back in the day, some DOS programs would stop reading a text file at a DOS EOF (decimal character 26). I don't think anything behaves like that, but it's an example that there are file formats that allowed you to end the file and then still have extra data that programs could use.
With a format like HTML or an actual spreadsheet instead of CSV, there are ways you could include things in meta data easily.

Related

How to store all files universally in memory (variable) Python

Essentially, I would want to be able to go through a folder with text files, jpg files, csv files, png files, any kind of file, and be able to load it into memory as some kind of object. When necessary, I would then like to be able to save it and create an instance on disk. This would need to work for any kind of file type.
I would create a class that would contain the file data itself as well as metadata, but that is not necessary for my question.
Is this possible ,and if so, how can I do this?

different pipelines based on files in compressed file

I have a compressed file in a google cloud storage bucket. This file contains a big csv file and a small xml based metadata file. I would like to extract both files and determine the metadata and process the csv file. I am using the Python SDK, and the pipeline will run on Google Dataflow at some point.
The current solution is to use Google Cloud Functions to extract both files and start the pipeline with the parameters parsed from the xml file.
I would like to eliminate the Google Cloud Function and process the compressed file in Apache Beam itself. The pipeline should process the XML file and then process the csv file.
However, I am stuck at extracting the two files into separate collections. I would like to understand if my solution is flawed, or if not, an example on how to deal with different files in a single compressed file.

In my understanding, this is not achievable through any existing text IO in beam.
The problem of your design is that, you are enforcing a dependency of file reading order (metadata xml must be read before processing CSV file and a logic to understand the CSV. Both are not supported in any concrete text IO.
If you do want to have this flexibility, I would suggest that you take a look at vcfio. You might want to write your own reader that inherits from filebasedsource.FileBasedSource too. There is some similarity in the implementation of vcfio to your case, in that there is always a header that explains how to interpret the CSV part in a VCF-formatted file.
Actually if you can somehow rewrite your xml metdata and add it as a header to the csv file, you probably can use vcfio instead.

adding a cover page to a csv/excel file in python

I am trying to add a cover page to a csv file in python which would display like the general information such as the date and name. My program currently exports mysql data to a csv file in python. When I open the csv, its an excel file. I am trying to add a cover page to this excel file which is in the csv format. Could you give me some ideas as to how I could go about doing this?

You can't add a cover page to a CSV file.
CSV is short for "Comma-separated values". It is defined to just be values separated by commas and nothing else. Wikipedia states that:
RFC 4180 proposes a specification for the CSV format, and this is the
definition commonly used. However, in popular usage "CSV" is not a
single, well-defined format. As a result, in practice the term "CSV"
might refer to any file that:
is plain text using a character set such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS,
consists of records (typically one record per line),
with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or
tab; sometimes the delimiter may include optional spaces),
where every record has the same sequence of fields.
This assertion is important for any application which wants to read the file. How would an application deal with weird unexpected data in some proprietary format?
You can, however, invent your own proprietary format, which only you know how to read. This could include data for a cover (as an image, pdf, latex or something else) and the data for your CSV. But this would be quite an undertaking and there are a million ways to approach this problem. The scope on how to implement such a thing is beyond the scope of stackoverflow. Try breaking down your question.

the difference between .bin file and .mat files

can the tensorflow read a file contain a normal images for example in JPG, .... or the tensorflow just read the .bin file contains images
what is the difference between .mat file and .bin file
Also when I rename the .bin file name to .mat, does the data of the file changed??
sorry maybe my language not clear because I cannot speak English very well

A file-name suffix is just a suffix (which sometimes help to get info about that file; e.g. Windows decides which tool is called when double-clicked). A suffix does not need to be correct. And of course, changing the suffix will not change the content.
Every format will need their own decoder. JPG, PNG, MAT and co.
To some extent, these are automatically used by reading out metadata (giving some assumptions!). Many image-tools have some imread-function which works for jpg and png, even if there is no suffix (because there is checking for common and supported image-formats).
I'm not sure what tensorflow does automatically, but:
jpg, png, bmp should be no problem
worst-case: use scipy to read and convert
mat is usually a matrix (with infinite different encodings) and often matlab-based
scipy can read many matlab-based formats
bin can be anything (usually stands for binary; no clear mapping like the above)
Don't get me wrong, but i expect someone trying to use tensorflow (not a small, not a simple tool) to know that changing a suffix should never magically transform the content to the new format (especially in the lossless/lossy case like png, jpg). I hope you evaluated this decision and you are not running blindly into using a popular tool.

A '.mat' file contains Matlab formatted Data (not matlab code like you would expect from a '.m' file). I'm not sure if you're even using Matlab since you didn't include the the tag in your question. '.mat' files are associated with matlab workspace; if you wanted to save your current workspace in Matlab, you would save it as a '.mat' file.
A '.bin' file is a binary file read by the computer. In general, executable (ready-to-run) programs are often identified as binary files. I think this is what you would want to use. I am unsure what you really want though because the wording of the question is difficult to understand and it seems like you have two questions here.
Changing the suffix of a file just changes what will run the file. For example, if I were to change test.txt to test.py, the data inside the text file remains the same, but the way the file is opened has changed. In this case, the file was a text file usually opened using Notepad (or some variation) then it was opened by python once changed. If you were to change a .jpg file to a txt file, you wouldn't be able to view it as a picture again, but instead, you would open a text file with a bunch of seemingly random characters which describe the picture. The picture data never changed, but the way you see it and are able to use it does.
Take a look at this website which describes the .bin extension pretty well. Also, a quick Google search goes a long way especially with questions like this.

Generic way to display any file from GridFS

Given a file from GridFS, I'd like to be able to display it on a webpage.
The files in my database can be of any common type, including jpgs, pngs, xml, txt, csv, etc.
A user would like to be able to click on the name of the file and in a new tab the file is displayed whether it is an image or text file, or click download and download the file with its original extension.
The application is in Python. I have seen some solution on here, but they require reading the bytes into a buffer, concatenating, and formatting some markup for an image with the bytes as a base64 string and require that the programmer knows what the extension of the file is and for the code to handle and format each extension case separately.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.