How to parse a WordPress CSV export using Python

How to parse a WordPress CSV export using Python - python

I need to import content from WordPress into Plone, a Python-based CMS, and I have a dump of the posts table as a huge CSV vanilla file using ";" as a delimiter.
The problem is the standard CSV reader from the csv module is not smart enough to parse the HTML content inside a row (the post_content field).
For instance, when the parser encounters something like <p> </p> it interprets the semicolon as a field delimiter and I end up with more items than fields and with fields with wrong content.
Is there any other option to solve this kind of issues? Processing the row with a regex seems pretty scary to me.

After some additional research, I discovered the excel-tab dialect by reading the text of the PEP 0305 (which proposed the addition of the cvs module to Python); this is mentioned in the module documentation, but I haven't noticed at first.
I then re-exported the posts using a tab as a delimiter (\t).
I made a test reading a batch of 1,000 rows and found no errors at all.

The CSV module provides the escapechar format parameter, which allows you to escape the delimiter (which you have set to semicolon). If you can provide escapechar='\\' in the call to csv.reader(), you could then replace the character \ in your CSV file with \\, and replace with &nbsp\; (using a text editor's find/replace option).

Another option, for smaller sites, could be using pywordpress, a pythonic interface to WordPress XML-RPC API.

Related

How to modify a JSON file using Python without losing comments in the file

I have a settings.json file full with useful comments (sometimes C-style and sometimes Python), and I'm programmatically modifying them with e.g. json library, but when I save the modified one I lose all the comments explaining the fields. Another inconvenience is losing the indentations and spacing therein.
Is there a 'neat' way of modifying the file programmatically?

Standard json files cannot possibly have comments and still be compliant json.
There is another format that was designed to overcome this problem: json5. It has libraries designed to keep json5 properties like comments intact - you can python library for it here.
Another approach is to keep using standard JSON but add "doc" fields for each JSON block in question. In this case, doc field(s) become data payload and will survive any transformation. For example, Apache Avro is using doc fields to document avro schema.

How to transform and stream large XML files to postgres ? Mediawiki dump postgres

I want to use postgres' full text search on the largest public natural language corpus available. I downloaded this wikimedia dump stub of a few megs for the example, the target is to work further with dumps around 70GB uncompressed. Here is the xsd.
I know there are other open parallel corpora easier to work with, I want to focus on wikimedia here.
This might seem like a duplicate, but I would like to investigate a simpler approach compared to the other proposals I found : postgres mailing list and lo, postgres mailing list and js, here with pg_read_file, here with nodejs, here with splitting, here with splitting + csv...
I would like to preprocess the xml before entering postgres, and stream it with the copy command. BaseX allows serialization of xml to csv/text with commandline and xpath. I already have some stub xpath command within postgres.
The text tag in the XML encompasses huge text blobs, the wikipedia article contents in wikitext, and those are tricky to put into csv (quotes, double quotes, newlines + all the wikitext weird syntax) so I wonder about the format. I would ideally like a stream, thinking currently of :
basex [-xpath command] | psql -c 'COPY foo FROM stdin (format ??)'
Here is my question : can basex process the xml input, and output the transform in stream rather than in batch? If yes, what output format could I use to load into postgres?
I intend to eventually have the data stored in the mediawiki postgresql schema (at the bottom of the link), but I will fiddle with a toy schema with no index, no trigger... first.
The problem of wikitext remains, but that's another story.

What are the recommended alternatives to the comma separator [duplicate]

This question already has answers here:
CSV writing strings of text that need a unique delimiter
(3 answers)
How to give 2 characters for "delimiter" using csv module?
(2 answers)
Closed 3 years ago.
I need to write my program output to a file. However, some of the fields already contain spaces, commas, semicolons, tabs. So I do not want to use spaces, tabs, commas as a field separator. The data are from the web so there are possibilities of server admins using wrong format.
I though of using any made up string, like my name. But this can be unprofessional if the output might be used by other researchers or so.
Are there any recommendations in this matter? What should I use if I am afraid to use commas, semicolons, tabs, spaces as separator?
EDIT:
For those answers suggesting using json or csv module, please note that I want to load the file into a MySQL database. I just can specify to MySQL that fields are separated by [some separator]. I also need a simple solution.

Use commas (or tabs), but use a proper serializer that knows how to escape characters on write, and unescape them on read. The csv module knows how to do this and seems to match your likely requirements.
Yes, you could try to find some random character that never appears in your data, but that just means you'll die horribly if that character ever does appear, and it means producing output that no existing parser knows how to handle. CSV is a well-known format (if surprisingly complex, with varied dialects), and can likely be parsed by existing libraries in whatever language needs to consume it.
JSON (handled in Python by the json module) is often useful as well as a language-agnostic format, as is pickle (though it's only readable in Python), but from what you described, CSV is probably the go to solution to start with.

Generally, good separators can be any kind of normal, keyboard-typable symbol that isn't used anywhere else in the data. My suggestion would be either '|' or '/'.

CSV files typically use quotes around fields that contain field separator characters, and use a backslash or another quote to escape a literal quote.
CSV is not a well defined format, however, and there are many variants implemented by different vendors. If you want a better-rounded text format that can store structured data you should look into using one of the better defined serialization formats such as JSON and YAML instead.

adding a cover page to a csv/excel file in python

I am trying to add a cover page to a csv file in python which would display like the general information such as the date and name. My program currently exports mysql data to a csv file in python. When I open the csv, its an excel file. I am trying to add a cover page to this excel file which is in the csv format. Could you give me some ideas as to how I could go about doing this?

You can't add a cover page to a CSV file.
CSV is short for "Comma-separated values". It is defined to just be values separated by commas and nothing else. Wikipedia states that:
RFC 4180 proposes a specification for the CSV format, and this is the
definition commonly used. However, in popular usage "CSV" is not a
single, well-defined format. As a result, in practice the term "CSV"
might refer to any file that:
is plain text using a character set such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS,
consists of records (typically one record per line),
with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or
tab; sometimes the delimiter may include optional spaces),
where every record has the same sequence of fields.
This assertion is important for any application which wants to read the file. How would an application deal with weird unexpected data in some proprietary format?
You can, however, invent your own proprietary format, which only you know how to read. This could include data for a cover (as an image, pdf, latex or something else) and the data for your CSV. But this would be quite an undertaking and there are a million ways to approach this problem. The scope on how to implement such a thing is beyond the scope of stackoverflow. Try breaking down your question.

New line with invisible character

I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for Â (a-circumflex). However, when I copy and paste Â into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.

This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.