Malformed CSV quoting - python

I pass data from SAS to Python using CSV format. Have a problem with a quoting format SAS uses. Strings like "480 КЖИ" ОАО aren't quoted, but Python csv module thinks they're.
dat = ['18CA4,"480 КЖИ" ОАО', '1142F,"""Росдорлизинг"" Российская дор,лизинг,компания"" ОАО"']
for i in csv.reader(dat):
print(i)
>>['18CA4', '480 КЖИ ОАО']
>>['1142F', '"Росдорлизинг" Российская дор,лизинг,компания" ОАО']
The 2nd string is fine, but I need 480 КЖИ ОАО string to be "480 КЖИ" ОАО. Don't find such an option in csv module. Maybe it's possible to force proc export to quote all " chars?
UPD: Here's a similar problem Python CSV : field containing quotation mark at the beginning
UPD2: #Quentin have asked for details. Here they're: I have SAS8.2 connected to 9.1 server. I download custom format data from server side with proc format cntlout=..; proc download... So i get a dictionary-like dataset <key>, <value>. Then i pass this dataset in CSV format using proc export via DDE interface to Python. But proc export quotes only strings which include delimiter (comma) as i understand. So i think, i need SAS to quote quotation marks too or Python to unquote only those strings which include commas.
UPDATE: switching from proc export via DDE to direct reading of dataset with a modified SAS7BDAT Python module hugely improved performance. And i got rid of the problem above.

SAS will add extra quotes if the value has quotes in it already.
data _null_;
file log dsd ;
string='"480 КЖИ" ОАО';
put string;
run;
Generates this result:
"""480 КЖИ"" ОАО"
Perhaps the quotes are being removed at some other point in the flow from SAS to Python? Try saving the CSV file to a disk and having Python read from the disk file.

Related

Convert Powershell data clean up code to Python Code

I would like to convert the Powershell script below to Python code.
Here is the objective of this code.
The code takes in a comma delimited filename and file extension.
The code below exports file as a pipe delimited file
Then it removes commas that exists within the data
Finally it also removes the double quotes used to qualify the data.
This results in the final file being pipe delimited with no double quotes or commas in the data. In doing this work I used this order because if you try to just replace double quotes and commas before establishing pipes the columns and data would break.
Param([string]$RootString, [string]$Ext)
$OrgFile = $RootString
$NewFile = $RootString.replace($Ext,"out")
Import-Csv $OrgFile -Encoding UTF8 | Export-Csv tempfile.csv -Delimiter "|" -NoTypeInformation
(Get-Content tempfile.csv).Replace(",","").Replace('"',"") | Out-File $NewFile -Encoding UTF8
Copy-Item -Force $NewFile $OrgFile
Remove-Item –path $NewFile -Force
I got dinged a point for this but. Did not see a point in posting bad code that does not work. Here is my version of non working code.
for index in range(len(dfcsv)):
filename = dfcsv['csvpath'].iloc[index]
print(filename)
print(i)
with open(filename, 'r+') as f:
text = f.read()
print(datetime.now())
text = re.sub('","', '|', text)
print(datetime.now())
f.seek(0)
f.write(text)
f.truncate()
i = i + 1
Issues with this code is the method of find and replace. This was creating extra column in the beginning due to double quote. Then sometimes extra column at the end since sometimes there was a double quote at the end. This caused data from different rows to merge together. I didn't post this part as I didn't think it was necessary for my objective. More relevant seemed to put working code to create a better idea of objective. Here is the non working code.
Seem no one here wanted to answer question so I found solution else where. Here is the link for anyone needing to convert comma file to pipe delimited:
https://www.experts-exchange.com/questions/29188372/How-can-I-Convert-Powershell-data-clean-up-code-to-Python-Code.html

Which newline character is in my CSV?

We receive a .tar.gz file from a client every day and I am rewriting our import process using SSIS. One of the first steps in my process is to unzip the .tar.gz file which I achieve via a Python script.
After unzipping we are left with a number of CSV files which I then import into SQL Server. As an aside, I am loading using the CozyRoc DataFlow Task Plus.
Most of my CSV files load without issue but I have five files which fail. By reading the log I can see that the process is reading the Header and First line as though there is no HeaderRow Delimiter (i.e. it is trying to import the column header as ColumnHeader1ColumnValue1
I took one of these CSVs, copied the top 5 rows into Excel, used Text-To-Columns to delimit the data then saved that as a new CSV file.
This version imported successfully.
That makes me think that somehow the original CSV isn't using {CR}{LF} as the row delimiter but I don't know how to check. Any suggestions?
I ended up using the suggestion commented by #vahdet because I already had notepad++ installed. I can't find the same option in EmEditor but it may exist
For those who are curious, the files are using {LF} which is consistent with the other files. My investigation continues...
Seeing that you have EmEditor, you can use EmEditor to find the eol character in two ways:
Use View > Character Code Value... at the end of a line to display a dialog box showing information about the character at the current position.
Go to View > Marks and turn on Newline Characters and CR and LF with Different Marks to show the eol while editing. LF is displayed with a down arrow while CRLF is a right angle.
Some other things you could try checking for are: file encoding, wrong type of data for a field and an inconsistent number of columns.

Writing out text with double double quotes - Python on Linux

I'm trying to take the text output of a query to an SSD (pulling a log page, similar to pulling SMART data. I'm then trying to write this text data out of a log file I update periodically.
My problem happens when the log data for some drives has double double-quotes as a placeholder for a blank field. Here is a snippet of the input:
VER 0x10200
VID 0x15b7
BoardRev 0x0
BootLoadRev ""
When this gets written out (appended) to my own log file, the text gets replaced with several null characters and then when I try to open all the text editors tell me it's corrupted.
The "" characters are replaced by something like this on my Linux system:
BootLoadRev "\00\00\00\00"
Some fields are even longer with the \00 characters. If the "" is not there, things write out OK.
The code is similar to this:
f=open(fileName, 'w')
test_bench.send_command('get_log_page')
identify_data = test_bench.get_data_in()
f.write(identify_data)
f.close()
Is there a way to send this text to a file w/o these nulls causing problems?
Assuming that this is Python 2 (and that your content is thus what Python 3 would call a bytestring), and that your intended data format is raw ASCII, the trivial solution is simply to remove the NULs from your content before you write to disk:
f.write(identify_data.replace('\0', ''))

Cannot read in files

I have a small problem with reading in my file. My code:
import csv as csv
import numpy
with open("train_data.csv","rb") as training:
csv_file_object = csv.reader(training)
header = csv_file_object.next()
data = []
for row in csv_file_object:
data.append(row)
data = numpy.array(data)
I get the error no such file "train_data.csv", so I know the problem lies with the location. But whenever I specify the pad like this: open("C:\Desktop...etc) it doesn't work either. What am I doing wrong?
If you give the full file path, your script should work. Since it is not, it must be that you have escape characters in your path. To fix this, use a raw-string to specify the file path:
# Put an 'r' at the start of the string to make it a raw-string.
with open(r"C:\path\to\file\train_data.csv","rb") as training:
Raw strings do not process escape characters.
Also, just a technical fact, not giving the full file path causes Python to look for the file in the directory that the script is launched from. If it is not there, an error is thrown.
When you use open() and Windows you need to deal with the backslashes properly.
Option 1.) Use the raw string, this will be the string prefixed with an r.
open(r'C:\Users\Me\Desktop\train_data.csv')
Option 2.) Escape the backslashes
open('C:\\Users\\Me\\Desktop\\train_data.csv')
Option 3.) Use forward slashes
open('C:/Users/Me/Desktop/train_data.csv')
As for finding the file you are using, if you just do open('train_data.csv') it is looking in the directory you are running the python script from. So, if you are running it from C:\Users\Me\Desktop\, your train_data.csv needs to be on the desktop as well.

Convert CSV to mongoimport-friendly JSON using Python

I have a 300 mb CSV with 3 million rows worth of city information from Geonames.org. I am trying to convert this CSV into JSON to import into MongoDB with mongoimport. The reason I want JSON is that it allows me to specify the "loc" field as an array and not a string for use with the geospatial index. The CSV is encoded in UTF-8.
A snippet of my CSV looks like this:
"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,
The desired JSON output (except the charset) that works with mongoimport is below:
{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
I have tried all available online CSV-JSON converters and they do not work because of the file size. The closest I got was with Mr Data Converter (the one pictured above) which would import to MongoDb after removing the start and end bracket and commas between documents. Unfortunately that tool doesn't work with a 300 mb file.
The JSON above is set to be encoded in UTF-8 but still has charset problems, most likely due to a conversion error?
I spent the last three days learning Python, trying to use Python CSVKIT, trying all the CSV-JSON scripts on stackoverflow, importing CSV to MongoDB and changing "loc" string to array (this retains the quotation marks unfortunately) and even trying to manually copy and paste 30,000 records at a time. A lot of reverse engineering, trial and error and so forth.
Does anyone have a clue how to achieve the JSON above while keeping the encoding proper like in the CSV above? I am at a complete standstill.
Python standard library (plus simplejson for decimal encoding support) has all you need:
import csv, simplejson, decimal, codecs
data = open("in.csv")
reader = csv.DictReader(data, delimiter=",", quotechar='"')
with codecs.open("out.json", "w", encoding="utf-8") as out:
for r in reader:
for k, v in r.items():
# make sure nulls are generated
if not v:
r[k] = None
# parse and generate decimal arrays
elif k == "loc":
r[k] = [decimal.Decimal(n) for n in v.strip("[]").split(",")]
# generate a number
elif k == "geonameid":
r[k] = int(v)
out.write(simplejson.dumps(r, ensure_ascii=False, use_decimal=True)+"\n")
Where "in.csv" contains your big csv file. The above code is tested as working on Python 2.6 & 2.7, with about 100MB csv file, producing a properly encoded UTF-8 file. Without surrounding brackets, array quoting nor comma delimiters, as requested.
It is also worth noting that passing both the ensure_ascii and use_decimal parameters is required for the encoding to work properly (in this case).
Finally, being based on simplejson, the python stdlib json package will also gain decimal encoding support sooner or later. So only the stdlib will ultimately be needed.
Maybe you could try importing the csv directly into mongodb using
mongoimport -d <dB> -c <collection> --type csv --file location.csv --headerline

Categories