Processing a CSV with colon separated pairs in fields - python

I have a CSV in the format of
Fruit:Apple,Seeds:Yes,Colour:Red or Green
Fruit:Orange,Seeds:No,Colour:Orange
Fruit:Pear,Seeds:Yes,Colour:Green,Shape:Odd
Fruit:Banana,Seeds:No,Colour:Yellow,Shape:Also Odd
and I want to be able to use create a JSON object for these values that looks something like
{"requestdata":{
"testdata":"example",
"testcategory":"category",
"fruits":{
"Fruit":{
"value":"Apple"
"type":"string"},
"Seeds":{
"value":"Yes"
"type":"bool"}
}
etc
I know I can load the CSV with a delimiter of my choosing, but how would I specify the second delimiter? Or should I try and build a dictionary instead for each cell of data and treat it as a string to split?

You should just split on the comma and use a string split to process the remaining elements, building a dictionary, then have the json module produce JSON from the dictionary. It is fairly easy to create malformed JSON when trying to be clever with text processing, such as
Forgetting to quote keys.
Quoting Values you didn't mean to
Not escaping JSON special characters
Building the dictionary and then having the moule do its thing will make your code much more maintainable and less error prone.

Related

Python 3: how to parse a csv file where the text fields can contain embedded new line characters

When exporting excel/libreoffice sheets where the cells can contain new lines as CSV, the resulting file will have those new lines preserved as literal newline characters not something like the char string "\n".
The standard csv module in Python 3 apparently does not handle this as would be necessary. The documentation says "Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future." . Well, duh.
Is there some other way to read in such csv files properly? What csv really should do is ignore any new lines withing quoted text fields and only recognise new line characters outside a field, but since it does not, is there a different way to solve this short of implementing my own CSV parser?
Try using pandas with something like df = pandas.read_csv('my_data.csv'). You'll have more granular control over how the data is read in. If you're worried about formatting, you can also set the delimiter for the csv from libreoffice to something that doesn't occur in nature like ;;

How to give double quotes to column with strings that have comma's in csv

I have a csv file that has a column of strings that has comma's inside the string. If i want to read the csv using pandas it sees the extra comma's as extra columns.Which gives me the error of have more rows then expected. I thought of using double quotes around the strings as solution to the problem.
This is how the csv currently looks
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,Hello, how are you,1
How it should look like
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,"Hello, how are you",1
Is using double quotes around the strings the best solution? and if yes how do i do that? And if not what other solution can you recommend?
if you got the original file / database through which you generated the csv, you should do it again using a different kind of separator (the default is comma), one which you would not have within your strings, such as "|" (vertical bar).
than, when reading the csv with pandas, you can just pass the argument:
pd.read_csv(file_path, sep="your separator symbol here")
hope that helps

Tuple and CSV Reader in Python

Attempting something relatively simple.
First, I have dictionary with tuples as keys as follows:
(0,1,1,0): "Index 1"
I'm reading in a CSV file which has a corresponding set of fields with various combinations of those zeroes and ones. So for example, the row in the CSV may read 0,1,1,0 without any quoting. I'm trying to match the combination of zeroes and ones in the file to the keys of the dictionary. Using the standard CSV module for this
However the issue is that the zeroes and ones are being read in as strings with single quotes rather than integers. In other words, the tuple created from each row is structured as ('0','1','1','0') which does not match (0,1,1,0)
Can anyone shed some light on how to bring the CSV in and remove the single quotes? Tuple matching and CSV reading seem to work -- just need to straighten out the data format.
Thanks!
tuple(int(x) for x in ('0','1','1','0')) # returns (0,1,1,0)
So, if your CSV reader object is called csv_reader, you just need a loop like this:
for row in csv_reader:
tup = tuple(int(x) for x in row)
# ...
when you read in the CSV file, depending on what libraries you're using, you can specify the delimiter.
typically, the comma is interpreted as the delimiter. perhaps you can specify the delimiter to be something else, e.g. '-', so that set of digits are read together as a string, and you can convert it to a tuple using variety of methods, such as using ast.literal_eval mentioned in converting string to tuple
hope that helps!

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.
When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.
Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.
Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

Wildcards in a Search and Replace Dictionary

I am fairly new to Python so if my terminology is wrong I apologize. I am running Python 2.6.5, I am not sure about updating to 3.0 since Python was initially downloaded with my spatial analysis software.
I am writing a program to search and replace column headers in multiple comma delimited text files. Since there are over a hundred headers and they are the same in all the files I decided to create a dictionary and 'pickle' to save all the replacements (got the idea from reading other posts). My issue comes in when I noticed there are tabs and spaces within the text file column headings, for example:
..."Prev Roll #: ","Prev Prime/Sub","Frontage : ","Depth : ","Area : ","Unit of Measure : ",...
So I thought why not just stick in a wildcard at the end of my key term so the search will match it no matter how many spaces are dividing the name and the colon. I was trying the * wildcard, but it doesn't work, when I run it no matches/replacements are made. Am I using the wildcard correctly? Is what I'm trying to do even possible? Or should I do away with the dictionary pickle?
Below is a sample of what I'm trying to do
import cPickle as pickle
general_D = { ....
"Prev Prime/Sub" : "PrvPrimeSub",
"Frontage*" : "Frontage",
"Depth*" : "Depth",
"Area*" : "Area",
"Unit of Measure*" : "UnitMeasure",
Thanks for the input!
Use the csv module to parse and write your comma-separated data.
Use the string strip() method to remove unwanted spaces and tabs.
Do not include * in your dict key names. They will not glob as you
hope. They just represent literal *s there.
It is probably better to use json instead of pickle. JSON is
human-readable, independent of programming language. Pickle may have
problems even across different versions of Python.

Categories