I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?
As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row
Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.
CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".
Related
I have a text file with lots of lines. Out of it I need to find 2 pattern of strings and save it to csv.
Example:
Text file contains:
NA: 2.0
slit uniformity at power: 3.6
integrated slit uniformity at power: 4.7
slit uniformity: 8.6
and the output in the csv I want
[NA] [2.0]
[slit uniformity] [8.6]
In short, I want to save an exact string in one column and the number next to it in the next column.
If this data format happens to match a well-known format perfectly, you can parse it as that format.
In your sample data, the text field never includes any colons, or quotes, or backslash escapes, or anything "weird". Is that guaranteed to always be true?
If so, this is a valid CSV file, with colons for delimiters and optional whitespace heading the fields. So you can parse it that way. (Your output format is a little weird for CSV—normally you can't use separate "open" and "close" quoting character like that. But you're not asking about the output part here, so I'll cheat a bit.)
with open(inpath) as fin, open(outpath, 'w') as fout:
w = csv.writer(fout, delimiter=' ')
for text, number in csv.reader(fin, delimiter=':', skipinitialspace=True):
w.writerow((f'[{text}]', f'[{number.strip()}]))
On the other hand, this may be simpler to do without thinking of either file as a weird CSV dialect and just parsing and generating the lines manually:
with open(inpath) as fin, open(outpath, 'w') as fout:
for line in fin:
text, _, number = line.rstrip().partition(': ')
fout.write(f'[{text}] [{number}]\n')
Of course the error handling won't be as nice if you have lines that break the format, since you're spreading the format specification implicitly over a few lines rather than defining it explicitly as a CSV dialect, but that may not be a problem.
prefixes = ['NA:', 'slit uniformity:']
with open('file.txt') as input, open('file.csv', 'w') as output:
for line in input:
for prefix in prefixes:
if line.startswith(prefix):
output.write('[%s] [%s]\n' % (prefix[:-1], line[len(prefix)+1:-1]))
In my csv file the data is separated by a special character. When I view in Notepad++ it shows 'SOH'.
ATT_AI16601A.PV01-Apr-2014 05:02:192.94752310FalseFalseFalse
ATT_AI16601A.PV[]01-Apr-2014 05:02:19[]2.947523[]1[]0[]False[]False[]False[]
It is present in the data but not visible. I have put markers in the second string where those characters are.
My point is that I need to read that data in Python delimited by these markers. How can I use these special characters as delimiters while reading data?
You can use Python csv module by specifying , as delimiter like this.
import csv
reader = csv.reader(file, delimiter='what ever is your delimiter')
In your case
reader = csv.reader(file, delimiter='\x01')
This is because SOH is an ASCII control character with a code point of 1
This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed last month.
Hi and many thanks in advance!
I'm working on a Python script handling utf-8 strings and replacing specific characters. Therefore I use msgText.replace(thePair[0], thePair[1]) while looping trough a list which defines unicode characters and their desired replacement, as shown below.
theList = [
('\U0001F601', '1f601.png'),
('\U0001F602', '1f602.png'), ...
]
Up to here everything works fine. But now consider a csv file which contains the characters to be replaced, as shown below.
\U0001F601;1f601.png
\U0001F602;1f602.png
...
I miserably failed in reading the csv data into the list due to the escape characters. I read the data using the csv module like this:
with open('Data.csv', newline='', encoding='utf-8-sig') as theCSV:
theList=[tuple(line) for line in csv.reader(theCSV, delimiter=';')]
This results in pairs like ('\\U0001F601', '1f601.png') which evade the escape characters (note the double backslash). I tried several methods of modifying the string or other methods of reading the csv data, but I was not able to solve my problem.
How could I accomplish my goal to read csv data into pairs which contain escape characters?
I'm adding the solution for reading csv data containing escape characters for the sake of completeness. Consider a file Data.csv defining the replacement pattern:
\U0001F601;1f601.png
\U0001F602;1f602.png
Short version (using list comprehensions):
import csv
# define replacement list (short version)
with open('Data.csv', newline='', encoding='utf-8-sig') as csvFile:
replList=[(line[0].encode().decode('unicode-escape'), line[1]) \
for line in csv.reader(csvFile, delimiter=';') if line]
csvFile.close()
Prolonged version (probably easier to understand):
import csv
# define replacement list (step by step)
replList=[]
with open('Data.csv', newline='', encoding='utf-8-sig') as csvFile:
for line in csv.reader(csvFile, delimiter=';'):
if line: # skip blank lines
replList.append((line[0].encode().decode('unicode-escape'), line[1]))
csvFile.close()
I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rb") as text:
for line in text:
line=line.strip()
with open(csvfile, "ab") as f:
if line.startswith("# Online_Resource"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
if line.startswith("##"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
Here is a sample of some strings from the original text file:
# Online_Resource: https://www.ncdc.noaa.gov/
## Corg% percent organic carbon,,,%,,paleoceanography,,,N
What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:
Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.
Thanks.
You've selected excel dialect but you overrode it with weird parameters:
You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
Also quotechar shouldn't be a space.
This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!
The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.
Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rU") as text, open(csvfile, "wb") as f:
write = csv.writer(f, dialect='excel',
delimiter='\t')
reader = csv.reader(text, delimiter=",")
for row in reader:
if not row:
continue # skip possible empty rows
if row[0].startswith("# Online_Resource"):
write.writerow([row[0].lstrip("# ")])
elif row[0].startswith("##"):
write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)
Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.
here's how the output looks now (note that the empty cells are because there are several commas in a row)
more problems:
line=line.strip(" ") removes leading and trailing spaces. It doesn't remove \r or \n ... try line=line.strip() which removes leading and trailing whitespace
you get all your line including commas in one cell because you haven't split it up somehow ... like using a csv.reader instance. See here:
https://docs.python.org/2/library/csv.html#csv.reader
str.lstrip non-default arg is treated as a set of characters to be removed, so '## ' has the same effect as '# '. if guff.startswith('## ') then do guff = guff[3:] to get rid of the unwanted text
It is not very clear at all what the sentence containing "bizarre" means. We need to see exactly what is in the output csv file. Create a small test file with 3 records (1) with '# Online_Resource' (2) with "## " (3) none of the above, run your code, and show the output, like this:
print repr(open('testout.csv', 'rb').read())
I have a data format, that appears similar to a csv file, however has vertical bars between character strings, but not between Boolean fields. For example:
|2000|,|code_no|,|first name, last name|,,,0,|word string|,0
|2000|,|code_no|,|full name|,,,0,|word string|,0
I'm not sure what format this is (it is saved as a txt file). What format is this, and how would i import into python?
For referece, I had been trying to use:
with open(csv_file, 'rb') as f:
r = unicodecsv.reader(f)
And then stripping out the | from the start and end of the fields. This works ok, with the exception of fields which have a comma in them (e.g. |first name, last name| where the field gets split because of the comma.
It looks like the pipes are being used as quote characters, not delimiters. Have you tried initializing the reader to use pipe ('|') as the quote character, and perhaps to use csv.QUOTE_NONNUMERIC as the quoting rules?
csv.reader(f, quotechar='|', quoting=csv.QUOTE_NONNUMERIC)
Have you tried .reader(f, delimiter=',', quotechar='|') ?