I want to pull a csv for the below url. There is a column where some of the value contain text with commas in them which is causing issues. For example in the columns below the last 2 items should be a single column but are being split
"""SL""","""2019-09-29""","""88.6""","""-0.6986""","""5.8034""","""Josh Phegley""",572033,542914,"""field_out""","""hit_into_play_score""",,,,,14,"""Josh Phegley grounds out"," second baseman Donnie Walton to first baseman Austin Nola. Sean Murphy scores. """
My code is as follows
import requests
import csv
file_name = 'test.csv'
url = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2019%7C&hfSit=&player_type=&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&team=OAK&position=&hfRO=&home_road=&hfFlag=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=details&'
req = requests.get(url)
with open(file_name, 'w') as f:
writer = csv.writer(f, quotechar = '"')
for line in raw_data.iter_lines():
writer.writerow(line.decode('utf-8').split(','))
I've tried removing split(',') , but this just results in each character being separated by a comma. I've tried various combinations of quotechar, quoting, and escapechar for the writed but no luck. Is there a way of ignoring columns if they appear within quotes?
Your incoming data is already CSV; you shouldn't be using the csv module to write it (unless you need to change the dialect for some reason, but even then, you'd need to read it with the csv module in the original dialect, then write it in the new dialect).
Just do:
# newline='' preserves original line endings to avoid messing with existing dialect
with open(file_name, 'w', newline='') as f:
f.writelines(line.decode('utf-8') for line in raw_data.iter_lines())
to perform the minimal decode to UTF-8 and otherwise dump the data raw. If your locale encoding is UTF-8 anyway (or you want to write as UTF-8 regardless of locale), you can simplify further by dumping the raw bytes:
# newline='' not needed for binary mode, which doesn't translate line endings anyway
with open(file_name, 'wb') as f:
f.writelines(raw_data.iter_lines())
Related
Python: v 3.6
Update:
I'm trying code where EVERYTHING is quoted, i.e. quoting=csv.QUOTE_ALL. For some reason even that is not working, i.e. file is outputting, but WITHOUT quotes.
If this can be resolved, it may help with the remaining question.
Code
import csv
in_path = "eateries.csv"
with open(in_path,"r") as infile, open("out.csv","w", newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, delimiter=",", quoting=csv.QUOTE_ALL)
writer.writerows(reader)
Original Question:
I am trying to write python script that reads csv file and outputs csv file. In output, cells with comma (",") will have quotes
Input:
Expected Output:
Actual Output:
Below is code, please assist
import csv
in_path = "eateries.csv"
with open(in_path,"r") as infile, open("out.csv","w", newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, delimiter=",", quotechar=",", quoting=csv.QUOTE_MINIMAL)
writer.writerows(reader)
quotechar doesn't mean "quote this character". It means "this is the character you use to quote things".
You do not want to use commas to quote things. Remove quotechar=",".
With quotechar corrected, your CSV will quote field values that have commas in them, but importing the CSV into Excel or some other spreadsheet application may not produce cell values with quotation marks. (Also, eateries.csv probably had quoting already.) It is quite likely that you don't actually need quotes in Excel or whatever your spreadsheet app is; the fact that the value is in a single cell instead of spread across multiple is the spreadsheet version of quoting.
This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed last month.
Hi and many thanks in advance!
I'm working on a Python script handling utf-8 strings and replacing specific characters. Therefore I use msgText.replace(thePair[0], thePair[1]) while looping trough a list which defines unicode characters and their desired replacement, as shown below.
theList = [
('\U0001F601', '1f601.png'),
('\U0001F602', '1f602.png'), ...
]
Up to here everything works fine. But now consider a csv file which contains the characters to be replaced, as shown below.
\U0001F601;1f601.png
\U0001F602;1f602.png
...
I miserably failed in reading the csv data into the list due to the escape characters. I read the data using the csv module like this:
with open('Data.csv', newline='', encoding='utf-8-sig') as theCSV:
theList=[tuple(line) for line in csv.reader(theCSV, delimiter=';')]
This results in pairs like ('\\U0001F601', '1f601.png') which evade the escape characters (note the double backslash). I tried several methods of modifying the string or other methods of reading the csv data, but I was not able to solve my problem.
How could I accomplish my goal to read csv data into pairs which contain escape characters?
I'm adding the solution for reading csv data containing escape characters for the sake of completeness. Consider a file Data.csv defining the replacement pattern:
\U0001F601;1f601.png
\U0001F602;1f602.png
Short version (using list comprehensions):
import csv
# define replacement list (short version)
with open('Data.csv', newline='', encoding='utf-8-sig') as csvFile:
replList=[(line[0].encode().decode('unicode-escape'), line[1]) \
for line in csv.reader(csvFile, delimiter=';') if line]
csvFile.close()
Prolonged version (probably easier to understand):
import csv
# define replacement list (step by step)
replList=[]
with open('Data.csv', newline='', encoding='utf-8-sig') as csvFile:
for line in csv.reader(csvFile, delimiter=';'):
if line: # skip blank lines
replList.append((line[0].encode().decode('unicode-escape'), line[1]))
csvFile.close()
I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rb") as text:
for line in text:
line=line.strip()
with open(csvfile, "ab") as f:
if line.startswith("# Online_Resource"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
if line.startswith("##"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
Here is a sample of some strings from the original text file:
# Online_Resource: https://www.ncdc.noaa.gov/
## Corg% percent organic carbon,,,%,,paleoceanography,,,N
What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:
Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.
Thanks.
You've selected excel dialect but you overrode it with weird parameters:
You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
Also quotechar shouldn't be a space.
This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!
The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.
Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rU") as text, open(csvfile, "wb") as f:
write = csv.writer(f, dialect='excel',
delimiter='\t')
reader = csv.reader(text, delimiter=",")
for row in reader:
if not row:
continue # skip possible empty rows
if row[0].startswith("# Online_Resource"):
write.writerow([row[0].lstrip("# ")])
elif row[0].startswith("##"):
write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)
Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.
here's how the output looks now (note that the empty cells are because there are several commas in a row)
more problems:
line=line.strip(" ") removes leading and trailing spaces. It doesn't remove \r or \n ... try line=line.strip() which removes leading and trailing whitespace
you get all your line including commas in one cell because you haven't split it up somehow ... like using a csv.reader instance. See here:
https://docs.python.org/2/library/csv.html#csv.reader
str.lstrip non-default arg is treated as a set of characters to be removed, so '## ' has the same effect as '# '. if guff.startswith('## ') then do guff = guff[3:] to get rid of the unwanted text
It is not very clear at all what the sentence containing "bizarre" means. We need to see exactly what is in the output csv file. Create a small test file with 3 records (1) with '# Online_Resource' (2) with "## " (3) none of the above, run your code, and show the output, like this:
print repr(open('testout.csv', 'rb').read())
I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?
The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")
Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.
I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)
I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?
As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row
Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.
CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".