This question already has answers here:
CSV writing strings of text that need a unique delimiter
(3 answers)
How to give 2 characters for "delimiter" using csv module?
(2 answers)
Closed 3 years ago.
I need to write my program output to a file. However, some of the fields already contain spaces, commas, semicolons, tabs. So I do not want to use spaces, tabs, commas as a field separator. The data are from the web so there are possibilities of server admins using wrong format.
I though of using any made up string, like my name. But this can be unprofessional if the output might be used by other researchers or so.
Are there any recommendations in this matter? What should I use if I am afraid to use commas, semicolons, tabs, spaces as separator?
EDIT:
For those answers suggesting using json or csv module, please note that I want to load the file into a MySQL database. I just can specify to MySQL that fields are separated by [some separator]. I also need a simple solution.
Use commas (or tabs), but use a proper serializer that knows how to escape characters on write, and unescape them on read. The csv module knows how to do this and seems to match your likely requirements.
Yes, you could try to find some random character that never appears in your data, but that just means you'll die horribly if that character ever does appear, and it means producing output that no existing parser knows how to handle. CSV is a well-known format (if surprisingly complex, with varied dialects), and can likely be parsed by existing libraries in whatever language needs to consume it.
JSON (handled in Python by the json module) is often useful as well as a language-agnostic format, as is pickle (though it's only readable in Python), but from what you described, CSV is probably the go to solution to start with.
Generally, good separators can be any kind of normal, keyboard-typable symbol that isn't used anywhere else in the data. My suggestion would be either '|' or '/'.
CSV files typically use quotes around fields that contain field separator characters, and use a backslash or another quote to escape a literal quote.
CSV is not a well defined format, however, and there are many variants implemented by different vendors. If you want a better-rounded text format that can store structured data you should look into using one of the better defined serialization formats such as JSON and YAML instead.
Related
I am working on a python project that does some analysis on csv files. I know there is no well-definedstandard for csv files, but as far as I understood the definition (https://www.rfc-editor.org/rfc/rfc4180#page-2), I think that a csv file should not contain more than one table. Is this thinking correct, or did I misunderstood the definitions?
How often do you see more than one table in csv's?
You are correct. There is no universal accepted standard. The definition is written to suggest that each file contains one table, and this is by far the most common practice.
There's technically nothing stopping you from having more than one table, using a format you decide on and implement and keep consistent. For instance, you could parse the file yourself and use a line with 5 hyphens to designate a separate table.
However I wouldn't recommend this. It goes against the common practice, and you will eliminate the possibility of using existing CSV libraries to help you.
I am trying to add a cover page to a csv file in python which would display like the general information such as the date and name. My program currently exports mysql data to a csv file in python. When I open the csv, its an excel file. I am trying to add a cover page to this excel file which is in the csv format. Could you give me some ideas as to how I could go about doing this?
You can't add a cover page to a CSV file.
CSV is short for "Comma-separated values". It is defined to just be values separated by commas and nothing else. Wikipedia states that:
RFC 4180 proposes a specification for the CSV format, and this is the
definition commonly used. However, in popular usage "CSV" is not a
single, well-defined format. As a result, in practice the term "CSV"
might refer to any file that:
is plain text using a character set such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS,
consists of records (typically one record per line),
with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or
tab; sometimes the delimiter may include optional spaces),
where every record has the same sequence of fields.
This assertion is important for any application which wants to read the file. How would an application deal with weird unexpected data in some proprietary format?
You can, however, invent your own proprietary format, which only you know how to read. This could include data for a cover (as an image, pdf, latex or something else) and the data for your CSV. But this would be quite an undertaking and there are a million ways to approach this problem. The scope on how to implement such a thing is beyond the scope of stackoverflow. Try breaking down your question.
I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for  (a-circumflex). However, when I copy and paste  into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.
This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.
I am trying to convert the list of jsons into list of pipe delimited strings. One of the problem that some json tags can be missed in some lines and the final list has to include ""|""|"" empty strings in place of missed tags. There is also no guarantee that every json will have same tags in same sequence. The number of pipe delimited strings has to remain the same. I changed json module to simplejson and use multiprocessing with strongs CPU's(32). But the result still poor. The use of pyinstaller doesn't improve anything. I definitely need community help.
I tested different json parsers: simplejson, ujson, json, yajl and nothing helped. Then I came into cjson and it decreased time from 28 minutes to 1.5 minutes. I have no idea "why?" but works like a rocket despite other solutions also implemented in "C".
I need to import content from WordPress into Plone, a Python-based CMS, and I have a dump of the posts table as a huge CSV vanilla file using ";" as a delimiter.
The problem is the standard CSV reader from the csv module is not smart enough to parse the HTML content inside a row (the post_content field).
For instance, when the parser encounters something like <p> </p> it interprets the semicolon as a field delimiter and I end up with more items than fields and with fields with wrong content.
Is there any other option to solve this kind of issues? Processing the row with a regex seems pretty scary to me.
After some additional research, I discovered the excel-tab dialect by reading the text of the PEP 0305 (which proposed the addition of the cvs module to Python); this is mentioned in the module documentation, but I haven't noticed at first.
I then re-exported the posts using a tab as a delimiter (\t).
I made a test reading a batch of 1,000 rows and found no errors at all.
The CSV module provides the escapechar format parameter, which allows you to escape the delimiter (which you have set to semicolon). If you can provide escapechar='\\' in the call to csv.reader(), you could then replace the character \ in your CSV file with \\, and replace with  \; (using a text editor's find/replace option).
Another option, for smaller sites, could be using pywordpress, a pythonic interface to WordPress XML-RPC API.