Read csv file containing escape characters in Python [duplicate]

Read csv file containing escape characters in Python [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed last month.
Hi and many thanks in advance!
I'm working on a Python script handling utf-8 strings and replacing specific characters. Therefore I use msgText.replace(thePair[0], thePair[1]) while looping trough a list which defines unicode characters and their desired replacement, as shown below.
theList = [
('\U0001F601', '1f601.png'),
('\U0001F602', '1f602.png'), ...
]
Up to here everything works fine. But now consider a csv file which contains the characters to be replaced, as shown below.
\U0001F601;1f601.png
\U0001F602;1f602.png
...
I miserably failed in reading the csv data into the list due to the escape characters. I read the data using the csv module like this:
with open('Data.csv', newline='', encoding='utf-8-sig') as theCSV:
theList=[tuple(line) for line in csv.reader(theCSV, delimiter=';')]
This results in pairs like ('\\U0001F601', '1f601.png') which evade the escape characters (note the double backslash). I tried several methods of modifying the string or other methods of reading the csv data, but I was not able to solve my problem.
How could I accomplish my goal to read csv data into pairs which contain escape characters?

I'm adding the solution for reading csv data containing escape characters for the sake of completeness. Consider a file Data.csv defining the replacement pattern:
\U0001F601;1f601.png
\U0001F602;1f602.png
Short version (using list comprehensions):
import csv
# define replacement list (short version)
with open('Data.csv', newline='', encoding='utf-8-sig') as csvFile:
replList=[(line[0].encode().decode('unicode-escape'), line[1]) \
for line in csv.reader(csvFile, delimiter=';') if line]
csvFile.close()
Prolonged version (probably easier to understand):
import csv
# define replacement list (step by step)
replList=[]
with open('Data.csv', newline='', encoding='utf-8-sig') as csvFile:
for line in csv.reader(csvFile, delimiter=';'):
if line: # skip blank lines
replList.append((line[0].encode().decode('unicode-escape'), line[1]))
csvFile.close()

Related

Python 3 CSV writer splitting lines which contain commas

I want to pull a csv for the below url. There is a column where some of the value contain text with commas in them which is causing issues. For example in the columns below the last 2 items should be a single column but are being split
"""SL""","""2019-09-29""","""88.6""","""-0.6986""","""5.8034""","""Josh Phegley""",572033,542914,"""field_out""","""hit_into_play_score""",,,,,14,"""Josh Phegley grounds out"," second baseman Donnie Walton to first baseman Austin Nola. Sean Murphy scores. """
My code is as follows
import requests
import csv
file_name = 'test.csv'
url = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2019%7C&hfSit=&player_type=&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&team=OAK&position=&hfRO=&home_road=&hfFlag=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=details&'
req = requests.get(url)
with open(file_name, 'w') as f:
writer = csv.writer(f, quotechar = '"')
for line in raw_data.iter_lines():
writer.writerow(line.decode('utf-8').split(','))
I've tried removing split(',') , but this just results in each character being separated by a comma. I've tried various combinations of quotechar, quoting, and escapechar for the writed but no luck. Is there a way of ignoring columns if they appear within quotes?

Your incoming data is already CSV; you shouldn't be using the csv module to write it (unless you need to change the dialect for some reason, but even then, you'd need to read it with the csv module in the original dialect, then write it in the new dialect).
Just do:
# newline='' preserves original line endings to avoid messing with existing dialect
with open(file_name, 'w', newline='') as f:
f.writelines(line.decode('utf-8') for line in raw_data.iter_lines())
to perform the minimal decode to UTF-8 and otherwise dump the data raw. If your locale encoding is UTF-8 anyway (or you want to write as UTF-8 regardless of locale), you can simplify further by dumping the raw bytes:
# newline='' not needed for binary mode, which doesn't translate line endings anyway
with open(file_name, 'wb') as f:
f.writelines(raw_data.iter_lines())

CSV Writer truncates characters in sequence in Excel 2013

I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rb") as text:
for line in text:
line=line.strip()
with open(csvfile, "ab") as f:
if line.startswith("# Online_Resource"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
if line.startswith("##"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
Here is a sample of some strings from the original text file:
# Online_Resource: https://www.ncdc.noaa.gov/
## Corg% percent organic carbon,,,%,,paleoceanography,,,N
What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:
Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.
Thanks.

You've selected excel dialect but you overrode it with weird parameters:
You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
Also quotechar shouldn't be a space.
This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!
The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.
Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rU") as text, open(csvfile, "wb") as f:
write = csv.writer(f, dialect='excel',
delimiter='\t')
reader = csv.reader(text, delimiter=",")
for row in reader:
if not row:
continue # skip possible empty rows
if row[0].startswith("# Online_Resource"):
write.writerow([row[0].lstrip("# ")])
elif row[0].startswith("##"):
write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)
Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.
here's how the output looks now (note that the empty cells are because there are several commas in a row)

more problems:
line=line.strip(" ") removes leading and trailing spaces. It doesn't remove \r or \n ... try line=line.strip() which removes leading and trailing whitespace
you get all your line including commas in one cell because you haven't split it up somehow ... like using a csv.reader instance. See here:
https://docs.python.org/2/library/csv.html#csv.reader
str.lstrip non-default arg is treated as a set of characters to be removed, so '## ' has the same effect as '# '. if guff.startswith('## ') then do guff = guff[3:] to get rid of the unwanted text
It is not very clear at all what the sentence containing "bizarre" means. We need to see exactly what is in the output csv file. Create a small test file with 3 records (1) with '# Online_Resource' (2) with "## " (3) none of the above, run your code, and show the output, like this:
print repr(open('testout.csv', 'rb').read())

Importing file format similar to csv file with | delimiters into Python

I have a data format, that appears similar to a csv file, however has vertical bars between character strings, but not between Boolean fields. For example:
|2000|,|code_no|,|first name, last name|,,,0,|word string|,0
|2000|,|code_no|,|full name|,,,0,|word string|,0
I'm not sure what format this is (it is saved as a txt file). What format is this, and how would i import into python?
For referece, I had been trying to use:
with open(csv_file, 'rb') as f:
r = unicodecsv.reader(f)
And then stripping out the | from the start and end of the fields. This works ok, with the exception of fields which have a comma in them (e.g. |first name, last name| where the field gets split because of the comma.

It looks like the pipes are being used as quote characters, not delimiters. Have you tried initializing the reader to use pipe ('|') as the quote character, and perhaps to use csv.QUOTE_NONNUMERIC as the quoting rules?
csv.reader(f, quotechar='|', quoting=csv.QUOTE_NONNUMERIC)

Have you tried .reader(f, delimiter=',', quotechar='|') ?

CSV writing strings of text that need a unique delimiter

I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?

As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row

Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.

CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".

Mulitple Lines in a single Excel cell

What is the easiest method for writing multple lines into a single cell within excel using python. Ive trying the csv module without success.
import csv
with open('xyz.csv', 'wb') as outfile:
w = csv.writer(outfile)
w.writerow(['stringa','string_multiline',])
Also each of the mutliline stringshave a number of characters in which are typically used for csv`s ie commas.
Any help would be really appreciated.

To figure this out, I created a file in Excel with a single multiline cell.
Then I saved it as CSV and opened it up in a text editor:
"a^Mb"
It looks like Excel interprets Ctrl-M characters as newlines.
Let’s try that with Python:
#!/usr/bin/env python2.7
import csv
with open('xyz.csv', 'wb') as outfile:
w = csv.writer(outfile)
w.writerow(['stringa','multiline\015string',])
Yup, that worked!

you need to pass extra double quotes (") at the start and end of the string. Seperate the different lines of the cell using newline character (\n) .
e.g "line1\nline2\nline3"
f=open("filename.csv","w")
f.write("\"line1\nline2\nline3\"")`
The code creates this csv

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read csv file containing escape characters in Python [duplicate] - python

Related

Python 3 CSV writer splitting lines which contain commas

CSV Writer truncates characters in sequence in Excel 2013

Importing file format similar to csv file with | delimiters into Python

CSV writing strings of text that need a unique delimiter

Mulitple Lines in a single Excel cell

Categories

Resources