Is it possible to specify the delimiting character the csv module will use when writing a file?
For example, instead of ',' use ';' or something?
I know you can change to tab delimited by setting dialect='excel-tab, but not sure if there is an option for freely choosing the delimiter.
Thanks
I believe you can just set the delimiter:
writer = csv.writer(csvfile,delimiter=';')
There's also an example of this in the documentation for csv.writer
Related
I have a comma delimited file which also contains commas in the actual field values, something like this:
foo,bar,"foo, bar"
This file is very large so I am wondering if there is a way in python to either put double quotes around ever field:
eg: "foo","bar","foo, bar"
or just change the delimeter overall?
eg: foo|bar|foo, bar
End goal:
The goal is to ultimately load this file into sql server. Given the size of the file bulk insert is only feasible approach for loading but I cannot specify a text qualifier/field quote due to the version of ssms I have.
This leads me to believe the only remaining approach is to do some preprocessing on the source file.
Changing the delimiter just requires parsing and re-encoding the data.
with open("data.csv") as input, open("new_data.csv", "w") as output:
r = csv.reader(input, delimiter=",", quotechar='"')
w = csv.writer(output, delimiter="|")
w.writerows(r)
Given that your input file is a fairly standard version of CSV, you don't even need to specify the delimiter and quote arguments to reader; the defaults will suffice.
r = csv.reader(input)
It is not an inconsistent quotes. If a value in a CSV file has comma or new line in it quotes are added to it. It shoudn't be a problem, because all standard CSV readers can read it properly.
I receive a well formated csv file, with double-quotes around text fields that contain commas.
Alas, I need to load it into SQL Server, which, as far as I have learned (please tell me how I am wrong here) cannot handle quote-enclosed fields that contain the delimiter.
So, I would like to write a python script which will a) convert the file to pipe-delimited, and b) strip whatever pipes exist in the fields (my sense is that commas are more common, so I'd like to save them, plus I also have some numeric fields that might, at least in the future, contain commas).
Here is the code that I have to do a:
import csv
import sys
source_file=sys.argv[1]
good_file=sys.argv[2]
bad_file=sys.argv[3]
with open(source_file, 'r') as csv_file:
csv_reader = csv.DictReader(csv_file)
with open(good_file, 'w') as new_file:
csv_writer = csv.DictWriter(new_file, csv_reader.fieldnames, delimiter='|')
headers = dict( (n,n) for n in csv_reader.fieldnames)
csv_writer.writerow(headers)
for line in csv_reader:
csv_writer.writerow(str.replace(line, '|', ' '))
How can I augment it to do b?
ps--I am using python 2.6, IIRC.
SQL Server can load the type of file you describe. The file can most certainly be loaded with an SSIS package and can also be loaded with the SQL Server bcp utility. Writing the python script would not be the way to go (to introduce another technology into the mix when not needed... just imho). SQL Server is equipped to handle exactly what you are wanting to do.
ssis is pretty straightforward.
For BCP, you'll need to not use the -t option (to specify a field terminator for the entire file) and instead use a format file. Using a format file, you can customize each fields terminator. For the fields that are quoted you'll want to use a custom delimiter. See this post or many others like it that detail how to use BCP and files with delimiters and quoted fields to hide delimiters that might appear in the data.
SQL Server BCP Export where comma in SQL field
In my csv file the data is separated by a special character. When I view in Notepad++ it shows 'SOH'.
ATT_AI16601A.PV01-Apr-2014 05:02:192.94752310FalseFalseFalse
ATT_AI16601A.PV[]01-Apr-2014 05:02:19[]2.947523[]1[]0[]False[]False[]False[]
It is present in the data but not visible. I have put markers in the second string where those characters are.
My point is that I need to read that data in Python delimited by these markers. How can I use these special characters as delimiters while reading data?
You can use Python csv module by specifying , as delimiter like this.
import csv
reader = csv.reader(file, delimiter='what ever is your delimiter')
In your case
reader = csv.reader(file, delimiter='\x01')
This is because SOH is an ASCII control character with a code point of 1
I'm using Python, and with the library gdata I can upload a .csv file, but the delimiter stays as default, that is comma (","). How can I do to change the delimiter to, for example, ";" ?
What I want is, from Python, change the delimiter of an uploading file. I don't want to change the "," to ";", I want to change the delimiter.
You could open the .csv using excel, then it knows its a csv (Comma Delimeted file) but you can set other things to 'delimit' the file by such as spaces etc.
Edit: Sorry, should've mentioned, don't open the file using the 'open with' method, open Excel first, then open the file from within Excel. This should open the 'Text Import Wizard' - Where you can choose what to delimt the file with such as tab,semicolon,comma,space etc.
I am assuming you really need to select this delimiter through gdata, right?
Otherwise you can easily change the delimiter in a shell with something like:
cat my_csv.csv | tr ',' ';' > my_csv_other_delimiter.csv
You can also easily replace these symbols in your python code. It could be an overload if you receive your csv files from somewhere else and you cannot control the symbol you use as a delimiter, but if there is no choice that could be an option.
I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?
As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row
Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.
CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".