I am working on a Python3 script to download some data from an API and save as a CSV file, but having trouble figuring out the right way to do this, accounting for commas in quoted text.
>>> export = get_lead_export_file('9833672b')
>>> print(type(export))
<class 'bytes'>
>>> print(export)
b'company,billingStreet,website,email\n-"Acme, Inc","123 Main St, Suite 23",acme.com,joe#acme.com\n-"Acme, Inc","123 Main St, Suite 23",acme.com,joe#acme.com\n-
My function is as follows:
def store_leads(lead_export_job_id):
export = get_lead_export_file(lead_export_job_id)
export_decoded = export.decode()
row_list = export_decoded.rsplit('\n-')
with open(lead_export_job_id, 'w') as csvfile:
writer = csv.writer(csvfile)
for row in row_list:
items = row.rsplit(',')
writer.writerows(items)
But the data gets corrupted, as commas appear within the quoted text (i.e. "123 Main St, Suite 23")
There must be a better way to write the byte data to the CSV without first decoding? And then also a better way for splitting on unquoted commas?
You can try to use quotechar parameter inside the csv.writer example syntax is given below
csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
Related
I am incredibly new to python, so I might not have the right terminology...
I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:
with pdfplumber.open('Bell_2014.pdf') as pdf:
page = pdf.pages[0]
bell = page.extract_text()
print(bell)
So "bell" is all of the text from the first page of the imported PDF.
what bell looks like I need to write all of that text as a string to a csv. I tried using:
with open('Bell_2014_ex.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(bell)
and
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(bell)
All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(['bell'])
to create a csv that has "bell" in one cell of the csv, but that's as close as I can get.
I feel like this should be super easy, but I just can't seem to get it to work.
Any thoughts?
Please and thank you for helping my inexperienced self.
page.extract_text() is defined as: "Collates all of the page's character objects into a single string." which would make bell just a very long string.
The CSV writerow() expects by default a list of strings, with each item in the list corresponding to a single column.
Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell object to convert it into a format acceptable to be written to a CSV.
Without having any knowledge of what bell contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell into a proper iterable of lists of strings, you can then write it to a CSV.
Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:
Maybe this is helpful:
retval = ''
mode = 'r'
out_file = os.path.join('input', 'listfile.csv')
"""
Convert tab-delimited listfile.txt to comma separated values (.csv) file
"""
in_text = open(listfile.txt, 'r')
in_reader = csv.reader(in_text, delimiter='\t')
out_csv = open(out_file, 'w', newline='\n')
out_writer = csv.writer(out_csv, dialect=csv.excel)
for _line in in_reader:
out_writer.writerow(_line)
out_csv.close()
... and that's it, not too tough
So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:
from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)
new_csv = 'filename.csv'
with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
file_writer = csv.writer(csvfile,delimiter=' ')
file_writer.writerow(object)
However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly.
Thank you for everyone that helped!
I am trying to open a CSV file after I create it with python. My goal is to be able to read back the file without editing it and my problem has been that I cannot get the delimiter to work. My file is created with python csv writer and then I am attempting to use the reader to read the data from the file. This is where I am getting stuck. My CSV file is saved in the same location that my python program is saved in thus I know it is not an access issue. My file is created with special character delimiter I am using Semicolons; because the raw data already contains commas,, colons;, plus signs+, ampersands&, periods., and possibly underscores_ and/or dashes-. This is the code that I am using to read my CSV file:
with open('Cool.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', dialect=csv.excel_tab)
for row in csv_reader:
print row[0]
csv_file.close()
Now this is my csv file (Cool.csv):
"Sat, 20 Apr 2019 00:17:05 +0000;Need to go to store;Eggs & Milk are needed ;Store: Grocery;Full Name: Safeway;Email: safewayiscool#gmail.com;Safeway <safewayiscool#gmail.com>, ;"
"Tue, 5 Mar 2019 05:54:24 +0000;Need to buy ham;Green eggs and Ham are needed for dinner ;Username: Dr.Seuss;Full Name: Theodor Seuss Geisel;Email: greeneggs+ham#seuss.com;"
So I would expect my output to be the following when I run the code:
Sat, 20 Apr 2019 00:17:05 +0000
Tue, 5 Mar 2019 05:54:24 +0000
I either get a null error of some kind or it will print out the entire line. How can I get it to separate out data into what I want to have defined the columns delimited by the ;?
I am not sure if the issue is that I am trying to use the semicolon or if it is something else. If it is just the semicolon I could change it if necessary but many other characters are already taken in the incoming data.
Also please do not suggest I simply just read it in from the original file. It is a massive file that has a lot of other data and I want to trim it before then executing with this second program.
UPDATE:
This is the code that builds the file:
with open('Cool.csv', 'w') as csvFile:
writer = csv.writer(csvFile, delimiter=';')
for m in file:
message = m['msg']
message2 = message.replace('\r\n\r\n', ';')
message3 = message2.replace('\r\n', ';')
entry = m['date'] + ";" + m['subject'] + ";" + message3
list = []
list.append(entry)
writer.writerow(list)
csvFile.close()
It looks like the file was created incorrectly. The sample data provided shows the whole line double-quoted, which treats it as one long single column. Here's correct code to write and read and semicolon-delimited file:
import csv
with open('Cool.csv','w',newline='',encoding='utf-8-sig') as csv_file:
csv_writer = csv.writer(csv_file,delimiter=';')
csv_writer.writerow(['data,data','data;data','data+-":_'])
with open('Cool.csv','r',newline='',encoding='utf-8-sig') as csv_file:
csv_reader = csv.reader(csv_file,delimiter=';')
for row in csv_reader:
print(row)
Output (matches data written):
['data,data', 'data;data', 'data+-":_']
Cool.csv:
data,data;"data;data";"data+-"":_"
Notes:
utf-8-sig is the most compatible encoding with Excel. Any Unicode character you put in the file will work and look correct when the CSV is opened in Excel.
newline='' is required per the csv documentation. The csv module handles its own newlines per the dialect used (default 'excel').
; delimiter is not needed. The default , will work. Note how the second entry has a semicolon, so the field was quoted. The first field with comma would have been quoted instead if the delimiter was a comma and it would still work.
csv_writer.writerow takes a sequence containing the column data.
csv_reader returns each row as a list of the column data.
A column in the .CSV is double-quoted if it contains the delimiter, and quotes are doubled if present in the data to escape them as well. Note the third field has a double quote.
csv_writer.close() and csv_reader.close() are not needed if using with.
RTFM.
From help (csv)
DIALECT REGISTRATION:
Readers and writers support a dialect argument, which is a convenient
handle on a group of settings. When the dialect argument is a string,
it identifies one of the dialects previously registered with the module.
If it is a class or instance, the attributes of the argument are used as
the settings for the reader or writer:
class excel:
delimiter = ','
quotechar = '"'
escapechar = None
doublequote = True
skipinitialspace = False
lineterminator = '\r\n'
quoting = QUOTE_MINIMAL
And you use dialect=csv.excel_tab.
You effectively overwrite your delimiter. Just don't use the dialect option.
Sidenote: with handles closing of the file handle for you. Read here
Second sidenote: The whole line of your CSV file is in double quotes. Either get rid of them, or disable the quoting. i.e.
with open('b.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', quoting=csv.QUOTE_NONE)
for row in csv_reader:
print (row[0])
I'm trying to remove the " sign from the csv file I'm creating, but it either gives me an error: "_csv.Error: need to escape, but no escapechar set" or it gives me the quotechar='"' every first character and latest one.
I'm running 3.7 Python and I tried changing my code with the below changes:
passphrase_writer = csv.writer(file, lineterminator='\n' ,quoting=csv.QUOTE_NONE)
#passphrase_writer = csv.writer(file, delimiter=',', lineterminator='\n',quoting=csv.QUOTE_NONE,)
#passphrase_writer = csv.writer(file, delimiter=',' ,lineterminator='\n', quoting=csv.QUOTE_NONE)
def print_dict(d,site_id):
with open('passphrases.csv', mode='w', newline='') as file:
passphrase_writer = csv.writer(file, lineterminator='\n' ,quoting=csv.QUOTE_NONE)
#passphrase_writer = csv.writer(file, delimiter=',', lineterminator='\n',quoting=csv.QUOTE_NONE,)
#passphrase_writer = csv.writer(file, delimiter=',' ,lineterminator='\n', quotechar='|')
for idx, val in enumerate(d['data']):
x = (u'{},{},{},{},{},{},{},{}'.format(val['id'],
val['Name'],
val['domain'],
val['Version'],
val['lastLoggedIn'],
val['networkInterfaces'][0]['inet'][0],
val['id2'],
passphrase(val['id'], site_id)
))
print(x)
passphrase_writer.writerow([x])
The results in the print are good:
54356,tomer-a36,WORKGROUP,2.,tom,192.168.30.133,eafa2eb,DREAM
However, the csv file will have:
"54356,tomer-a36,WORKGROUP,2.,tom,192.168.30.133,eafa2eb,DREAM"
I wish to remove the extra "
note - when changing the quotechar='|', I'm getting:
|54356,tomer-a36,WORKGROUP,2.,tom,192.168.30.133,eafa2eb,DREAM|
trying to set quotechar='' gives an error.
You're overcomplicating it & mixing custom formatting with built-in csv module
You're passing a list of an already formatted argument, with the separator as ,. csv module has to quote it to make it remain a single "cell" because default csv separator is already ,. csv module has a safety built-in to avoid losing/corrupting data: not quoting data would make 1 cell with commas in it indistinguishable from several cells separated with commas.
Instead, just write your argument as a tuple of your data and stop building x using str.format.
for val in d['data']:
passphrase_writer.writerow((val['id'],
val['Name'],
val['domain'],
val['Version'],
val['lastLoggedIn'],
val['networkInterfaces'][0]['inet'][0],
val['id2'],
passphrase(val['id'], site_id)
))
Since your data never contains , (are you sure of that? seems that passphrase could, actually) you don't have to worry about quotes being inserted.
And if there were (passphrase could be a good candidate), csv.reader would be able to parse them and remove them to unserialize your data (as Excel would do too). Don't try to read & parse back the file manually splitting with ,, use csv module again.
Python: v 3.6
Update:
I'm trying code where EVERYTHING is quoted, i.e. quoting=csv.QUOTE_ALL. For some reason even that is not working, i.e. file is outputting, but WITHOUT quotes.
If this can be resolved, it may help with the remaining question.
Code
import csv
in_path = "eateries.csv"
with open(in_path,"r") as infile, open("out.csv","w", newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, delimiter=",", quoting=csv.QUOTE_ALL)
writer.writerows(reader)
Original Question:
I am trying to write python script that reads csv file and outputs csv file. In output, cells with comma (",") will have quotes
Input:
Expected Output:
Actual Output:
Below is code, please assist
import csv
in_path = "eateries.csv"
with open(in_path,"r") as infile, open("out.csv","w", newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, delimiter=",", quotechar=",", quoting=csv.QUOTE_MINIMAL)
writer.writerows(reader)
quotechar doesn't mean "quote this character". It means "this is the character you use to quote things".
You do not want to use commas to quote things. Remove quotechar=",".
With quotechar corrected, your CSV will quote field values that have commas in them, but importing the CSV into Excel or some other spreadsheet application may not produce cell values with quotation marks. (Also, eateries.csv probably had quoting already.) It is quite likely that you don't actually need quotes in Excel or whatever your spreadsheet app is; the fact that the value is in a single cell instead of spread across multiple is the spreadsheet version of quoting.
I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rb") as text:
for line in text:
line=line.strip()
with open(csvfile, "ab") as f:
if line.startswith("# Online_Resource"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
if line.startswith("##"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
Here is a sample of some strings from the original text file:
# Online_Resource: https://www.ncdc.noaa.gov/
## Corg% percent organic carbon,,,%,,paleoceanography,,,N
What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:
Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.
Thanks.
You've selected excel dialect but you overrode it with weird parameters:
You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
Also quotechar shouldn't be a space.
This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!
The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.
Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rU") as text, open(csvfile, "wb") as f:
write = csv.writer(f, dialect='excel',
delimiter='\t')
reader = csv.reader(text, delimiter=",")
for row in reader:
if not row:
continue # skip possible empty rows
if row[0].startswith("# Online_Resource"):
write.writerow([row[0].lstrip("# ")])
elif row[0].startswith("##"):
write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)
Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.
here's how the output looks now (note that the empty cells are because there are several commas in a row)
more problems:
line=line.strip(" ") removes leading and trailing spaces. It doesn't remove \r or \n ... try line=line.strip() which removes leading and trailing whitespace
you get all your line including commas in one cell because you haven't split it up somehow ... like using a csv.reader instance. See here:
https://docs.python.org/2/library/csv.html#csv.reader
str.lstrip non-default arg is treated as a set of characters to be removed, so '## ' has the same effect as '# '. if guff.startswith('## ') then do guff = guff[3:] to get rid of the unwanted text
It is not very clear at all what the sentence containing "bizarre" means. We need to see exactly what is in the output csv file. Create a small test file with 3 records (1) with '# Online_Resource' (2) with "## " (3) none of the above, run your code, and show the output, like this:
print repr(open('testout.csv', 'rb').read())