adding a cover page to a csv/excel file in python - python

I am trying to add a cover page to a csv file in python which would display like the general information such as the date and name. My program currently exports mysql data to a csv file in python. When I open the csv, its an excel file. I am trying to add a cover page to this excel file which is in the csv format. Could you give me some ideas as to how I could go about doing this?

You can't add a cover page to a CSV file.
CSV is short for "Comma-separated values". It is defined to just be values separated by commas and nothing else. Wikipedia states that:
RFC 4180 proposes a specification for the CSV format, and this is the
definition commonly used. However, in popular usage "CSV" is not a
single, well-defined format. As a result, in practice the term "CSV"
might refer to any file that:
is plain text using a character set such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS,
consists of records (typically one record per line),
with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or
tab; sometimes the delimiter may include optional spaces),
where every record has the same sequence of fields.
This assertion is important for any application which wants to read the file. How would an application deal with weird unexpected data in some proprietary format?
You can, however, invent your own proprietary format, which only you know how to read. This could include data for a cover (as an image, pdf, latex or something else) and the data for your CSV. But this would be quite an undertaking and there are a million ways to approach this problem. The scope on how to implement such a thing is beyond the scope of stackoverflow. Try breaking down your question.

Related

CSV Standard - Multiple Tables

I am working on a python project that does some analysis on csv files. I know there is no well-definedstandard for csv files, but as far as I understood the definition (https://www.rfc-editor.org/rfc/rfc4180#page-2), I think that a csv file should not contain more than one table. Is this thinking correct, or did I misunderstood the definitions?
How often do you see more than one table in csv's?
You are correct. There is no universal accepted standard. The definition is written to suggest that each file contains one table, and this is by far the most common practice.
There's technically nothing stopping you from having more than one table, using a format you decide on and implement and keep consistent. For instance, you could parse the file yourself and use a line with 5 hyphens to designate a separate table.
However I wouldn't recommend this. It goes against the common practice, and you will eliminate the possibility of using existing CSV libraries to help you.

What are the recommended alternatives to the comma separator [duplicate]

This question already has answers here:
CSV writing strings of text that need a unique delimiter
(3 answers)
How to give 2 characters for "delimiter" using csv module?
(2 answers)
Closed 3 years ago.
I need to write my program output to a file. However, some of the fields already contain spaces, commas, semicolons, tabs. So I do not want to use spaces, tabs, commas as a field separator. The data are from the web so there are possibilities of server admins using wrong format.
I though of using any made up string, like my name. But this can be unprofessional if the output might be used by other researchers or so.
Are there any recommendations in this matter? What should I use if I am afraid to use commas, semicolons, tabs, spaces as separator?
EDIT:
For those answers suggesting using json or csv module, please note that I want to load the file into a MySQL database. I just can specify to MySQL that fields are separated by [some separator]. I also need a simple solution.
Use commas (or tabs), but use a proper serializer that knows how to escape characters on write, and unescape them on read. The csv module knows how to do this and seems to match your likely requirements.
Yes, you could try to find some random character that never appears in your data, but that just means you'll die horribly if that character ever does appear, and it means producing output that no existing parser knows how to handle. CSV is a well-known format (if surprisingly complex, with varied dialects), and can likely be parsed by existing libraries in whatever language needs to consume it.
JSON (handled in Python by the json module) is often useful as well as a language-agnostic format, as is pickle (though it's only readable in Python), but from what you described, CSV is probably the go to solution to start with.
Generally, good separators can be any kind of normal, keyboard-typable symbol that isn't used anywhere else in the data. My suggestion would be either '|' or '/'.
CSV files typically use quotes around fields that contain field separator characters, and use a backslash or another quote to escape a literal quote.
CSV is not a well defined format, however, and there are many variants implemented by different vendors. If you want a better-rounded text format that can store structured data you should look into using one of the better defined serialization formats such as JSON and YAML instead.

Store pkl / binary in MetaData

I am writing a function that is supposed to store a text representation of a custom class object, cl
I have some code that writes to a file and takes the necessary information out of cl.
Now I need to go backwards, read the file and return a new instance of cl. The problem is, the file doesn't keep all of the important parts of cl because for the purpose of this text document parts of it are unnecessary.
A .jpg file allows you to store meta data like shutter speed and location. I would like to store the parts of cl that are not supposed to be in the text portion in the meta data of a .txt or .csv file. Is there a way to explicitly write something to the metadata of a text file in Python?
Additionally, would it be possible to write the byte-code .pkl representation of the entire object in the metadata?
Text files don't have meta data in the same way that a jpg file does. A jpeg file is specifically designed to have ways of including meta data as extra structured information in the image. Text files aren't: every character in the text file is generally displayed to the user.
Similarly, every thing in a CSV file is part of one cell in the table represented by the file.
That said, there are some things similar to text file metadata that have existed or exist over the years that might give you some ideas. I don't think any of these is ideal, but I'll give some examples to give you an idea how complex the area of meta data is and what people have done in similar situations.
Some filesystems have meta data associated with each file that can be extended. As an example, NTFS has streams; HFS and HFSplus have resource forks or other attributes; Linux has extended attributes on most of its filesystems. You could potentially store your pickle information in those filesystem metadata. There are disadvantages. Some filesystems don't have this meta data. Some tools for copying and manipulating files will not recognize (or intentionally strip) meta data.
You could have a .txt file and a .pcl file, where the .txt file contains your text representation and the .pkl file contained the other information.
Back in the day, some DOS programs would stop reading a text file at a DOS EOF (decimal character 26). I don't think anything behaves like that, but it's an example that there are file formats that allowed you to end the file and then still have extra data that programs could use.
With a format like HTML or an actual spreadsheet instead of CSV, there are ways you could include things in meta data easily.

New line with invisible character

I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for  (a-circumflex). However, when I copy and paste  into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.
This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.

Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet?

Regularly I have to go through a list of pdf files and search for specific data and add them to an excel sheet for later review. As the number of pdf files are around 50 per month, it is both time taking and frustrating to do it manually.
Can the process be automated in windows by python or any other scripting language? I require to have all the pdf files in a folder and run the script which will generate an excel sheet with all the data added. The pdf files with which I work are tabular and have similar structures.
Yes. And no. And maybe.
The problem here is not extracting something from a PDF document. Extracting something is almost always possible and there are plenty of tools available to extract content from a PDF document. Text, images, whatever you need.
The major problem (and the reason for the "no" or "maybe") is that PDF in general is not a structured file format. It doesn't care about columns, paragraphs, tables, sentences or even words. In the general case it cares only about characters on a page in a specific location.
This means that in the general case you cannot query a PDF document and ask it for every paragraph or for the third sentence in the fifth paragraph. You can ask a library to get all of the text or all of the text in a specific location. And then you have to hope the library is able to extract the text you need in a legible format. Because there doesn't even have to be the case that you can copy and paste or otherwise extra understandable characters from a PDF file. Many PDF files don't even contain enough information for that.
So... If you have a certain type of document and you can test that it predictably behaves a certain way with a certain extraction engine, then yes, you can extract information from a PDF file.
If the PDF files you receive are different all the time or the layout on the page is totally different every time than the answer is probably that you cannot reliably extract the information you want.
As a side note:
There are certain types of PDF documents that are easier to handle than others so if you're lucky that might make your life easier. Two examples:
Many PDF files will in fact contain textual information in such a way that it can be extracted in a legible way. PDF files that follow certain standards (such as PDF/A-1a, PDF/A-2a or PDF/A-2u etc...) are even required to be created this way.
Some PDF files are "tagged" which means they contain additional structural information that allows you to extract information in an easier and more meaningful way. This structure would in fact identify paragraphs, images, tables etc and if the tagging was done in a good way it could make the job of content extraction much easier.
You could use pdf2text2 in Python to extract data from your PDF.
Alternatively you can use pdftotext that is part of the Xpdf suite

Categories