Inconsistent quotes on .csv file - python

I have a comma delimited file which also contains commas in the actual field values, something like this:
foo,bar,"foo, bar"
This file is very large so I am wondering if there is a way in python to either put double quotes around ever field:
eg: "foo","bar","foo, bar"
or just change the delimeter overall?
eg: foo|bar|foo, bar
End goal:
The goal is to ultimately load this file into sql server. Given the size of the file bulk insert is only feasible approach for loading but I cannot specify a text qualifier/field quote due to the version of ssms I have.
This leads me to believe the only remaining approach is to do some preprocessing on the source file.

Changing the delimiter just requires parsing and re-encoding the data.
with open("data.csv") as input, open("new_data.csv", "w") as output:
r = csv.reader(input, delimiter=",", quotechar='"')
w = csv.writer(output, delimiter="|")
w.writerows(r)
Given that your input file is a fairly standard version of CSV, you don't even need to specify the delimiter and quote arguments to reader; the defaults will suffice.
r = csv.reader(input)

It is not an inconsistent quotes. If a value in a CSV file has comma or new line in it quotes are added to it. It shoudn't be a problem, because all standard CSV readers can read it properly.

Related

Python 3.8.5 alternative to .replace with csv.reader and UTF-8 mystery encodings

I have spent 5 hours throughout the dark recesses of SO so I am posting this question as a last resort, and I am genuinely hoping someone can point me in the right direction here:
Scenario:
I have some .csv files (UTF-8 CSVs: verified with the file -I command) from Google surveys that are in multiple languages. Output:
download.csv: application/csv; charset=utf-8
I have a "dictionary" file that has the translations for the questions and answers (one column is the $language and the other is English).
There are LOTS of special type characters (umlauts and French accent letters, etc..) in the data from Google, because French, German, Dutch
The dictionary file I built reads fine as UTF-8 including special characters and creates the find/replace keys accurately (verified with print commands)
The issue is that the Google files only read correctly (maintain proper characters) using the csv.read function in Python. However, that function does not have a .replace and so I can do one or the other:
read in the source file, make no replacements, and get a perfect copy (not what I need)
convert the csv files/rows to a fileinput/string (UTF-8 still, mind) and get an utterly thrashed output file with missing replacements because the data "looses" the encoding between the csv read and the string somehow?
The code (here) comes closest to working, except there is no .replace method on csv.reader:
import csv
#set source, output
source = 'fr_to_trans.csv'
output = 'fr_translated.csv'
dictionary = 'frtrans.csv'
find = []
replace = []
# build the dictionary itself:
with open(dictionary, encoding='utf-8') as dict_file:
for line in dict_file:
#print(line)
temp_split = []
temp_split = line.split(',')
if "!!" in temp_split[0] :
temp_split[0] = temp_split[0].replace("!!", ",")
find.append(temp_split[0])
if "!!" in temp_split[1] :
temp_split[1] = temp_split[1].replace("!!", ",")
replace.append(temp_split [1])
#print(len(find))
#print(len(replace))
#set loop counters
check_each = len(find)
# Read in the file to parse
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
output_writer = csv.writer(t_file)
for row in csv.reader(s_file):
the_row = row
print(the_row) #THIS RETURNS THE CORRECT, FORMATTED, UTF-8 DATA
i = 0
# find and replace everything in the find array with it's value in the replace array
while i < check_each :
print(find[i])
print(replace[i])
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
i = i + 1
output_writer.writerow(the_row)
I have to assume that even though the Google files say they are UTF-8, they are a special "Google branded UTF-8" or some such nonsense. The fact that the file opens correctly with csv.reader, but then you can do nothing to it is infuriating beyond measure.
Just to clarify what I have tried:
Treat files as text and let Python sort out the encoding (fails)
Treat files as UTF-8 text (fails)
Open file as UTF-8, replace strings, and write out using the csv.writer (fails)
Convert the_row to a string, then replace, then write out with csv.writer (fails)
Quick edit - tried utf-8-sig with strings - better, but the output is still totally mangled because it isn't reading it as a csv, but strings
I have not tried:
"cell by cell" comparison instead of the whole row (working on that while this percolates on SO)
Different encoding of the file (I can only get UTF-8 CSVs so would need some sort of utility?)
If these were ASCII text I would have been done ages ago, but this whole "UTF-8 that isn't but is" thing is driving me mad. Anyone got any ideas on this?
Each row yielded by csv.reader is a list of cell values like
['42', 'spam', 'eggs']
Thus the line
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
cannot possibly work, because lists don't have a replace method.
What might work is to iterate over the row list and find/replace on each cell value (I'm assuming they are all strings)
the_row = [cell.replace(find[i], replace[i]) for cell in the row]
However, if all you want to do is replace all instances of some characters in the file with some other characters then it's simpler to open the file as a text file and replace without invoking any csv machinery:
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
for old, new in zip(find, replace):
text = text.replace(old, new)
t_file.write(text)
If the find/replace mapping is the same for all files, you can use str.translate to avoid the for loop.
# Make a reusable translation table
trans_table = str.maketrans(dict(zip(find, replace)))
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
text = text.translate(trans_table)
t_file.write(text)
For clarity: csvs are text files, only formatted so that their contents can be interpreted as rows and columns. If you want to manipulate their contents as pure text it's fine to edit them as normal text files: as long as you don't change any of the characters used as delimiters or quote marks they will still be usuable as csvs when you want to use them as such.

Read a csv into pandas that has commas *within* first/index cells of the csv rows without changing value

Ok, I get this error...:
"pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 12, saw 7"
...when trying to import a csv into a python script with pandas.read_csv():
path,Drawing_but_no_F5,Paralell_F5,Fixed,Needs_Attention,Errors
R:\13xx Original Ranch Buildings\1301 Stonehouse\1301-015\F - Bid Documents and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-026A Carriage House, Redo North Side Landscape\F - Bid Document and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-028\F - Bid Documents and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-029\F - Bid Documents and Contract Award,Yes,No,No,No,No
Obviously, in the above entries, it is the third line that throws the error. Caveats include that I have to use that column as a path to process files there so changing the entry is not allowed. CSV is created elsewhere; I get it as-is.
I do want to preserve the column header.
This filepath column is used later as an index, so I would like to preserve that.
Many, many similar issues, but solutions seem very specific and I cannot get them to cooperate for my use case:
Pandas, read CSV ignoring extra commas
Solutions seem to change entry values or rely on the cells being in the last column
Commas within CSV Data
Solution involves sql tools methinks. I don't want to read the csv into sql tables...
csv file is already delimited by commas so I don't think I changing the sep value will work.. (I cannot get it to work -- yet)
Problems reading CSV file with commas and characters in pandas
Solution throws error: "for line in reader:_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)"
Not too optimistic since op had the cell value in quotes whereas I do not.
Here is a solution which is a minor modification of the accepted answer by #DSM in the last thread to which you linked (Problems reading CSV file with commas and characters in pandas).
import csv
with open('original.csv', 'r') as infile, open('fixed.csv', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
newline = [','.join(line[:-5])] + line[-5:]
writer.writerow(newline)
After running the above preprocessing code, you should be able to read fixed.csv using pd.read_csv().
This solution depends on knowing how many of the rightmost columns are always formatted correctly. In your example data, the rightmost five columns are always good, so we treat everything to the left of these columns as a single field, which csv.writer() wraps in double quotes.

How can I use python to change the delimiter of a csv file while also stripping the fields of the new delimiter?

I receive a well formated csv file, with double-quotes around text fields that contain commas.
Alas, I need to load it into SQL Server, which, as far as I have learned (please tell me how I am wrong here) cannot handle quote-enclosed fields that contain the delimiter.
So, I would like to write a python script which will a) convert the file to pipe-delimited, and b) strip whatever pipes exist in the fields (my sense is that commas are more common, so I'd like to save them, plus I also have some numeric fields that might, at least in the future, contain commas).
Here is the code that I have to do a:
import csv
import sys
source_file=sys.argv[1]
good_file=sys.argv[2]
bad_file=sys.argv[3]
with open(source_file, 'r') as csv_file:
csv_reader = csv.DictReader(csv_file)
with open(good_file, 'w') as new_file:
csv_writer = csv.DictWriter(new_file, csv_reader.fieldnames, delimiter='|')
headers = dict( (n,n) for n in csv_reader.fieldnames)
csv_writer.writerow(headers)
for line in csv_reader:
csv_writer.writerow(str.replace(line, '|', ' '))
How can I augment it to do b?
ps--I am using python 2.6, IIRC.
SQL Server can load the type of file you describe. The file can most certainly be loaded with an SSIS package and can also be loaded with the SQL Server bcp utility. Writing the python script would not be the way to go (to introduce another technology into the mix when not needed... just imho). SQL Server is equipped to handle exactly what you are wanting to do.
ssis is pretty straightforward.
For BCP, you'll need to not use the -t option (to specify a field terminator for the entire file) and instead use a format file. Using a format file, you can customize each fields terminator. For the fields that are quoted you'll want to use a custom delimiter. See this post or many others like it that detail how to use BCP and files with delimiters and quoted fields to hide delimiters that might appear in the data.
SQL Server BCP Export where comma in SQL field

Importing file format similar to csv file with | delimiters into Python

I have a data format, that appears similar to a csv file, however has vertical bars between character strings, but not between Boolean fields. For example:
|2000|,|code_no|,|first name, last name|,,,0,|word string|,0
|2000|,|code_no|,|full name|,,,0,|word string|,0
I'm not sure what format this is (it is saved as a txt file). What format is this, and how would i import into python?
For referece, I had been trying to use:
with open(csv_file, 'rb') as f:
r = unicodecsv.reader(f)
And then stripping out the | from the start and end of the fields. This works ok, with the exception of fields which have a comma in them (e.g. |first name, last name| where the field gets split because of the comma.
It looks like the pipes are being used as quote characters, not delimiters. Have you tried initializing the reader to use pipe ('|') as the quote character, and perhaps to use csv.QUOTE_NONNUMERIC as the quoting rules?
csv.reader(f, quotechar='|', quoting=csv.QUOTE_NONNUMERIC)
Have you tried .reader(f, delimiter=',', quotechar='|') ?

CSV writing strings of text that need a unique delimiter

I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?
As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row
Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.
CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".

Categories