I have a large CSV file with one column and line breaks in some of its rows. I want to read the content of each cell and write it to a text file but the CSV reader is splitting the cells with line breaks into multiple ones (multiple rows) and writing each one to a separate text file.
Using Python 3.6.2 on a MAC Sierra
Here is an example:
"content of row 1"
"content of row 2
continues here"
"content of row 3"
And here is how I am reading it:
with open(csvFileName, 'r') as csvfile:
lines= csv.reader(csvfile)
i=0
for row in lines:
i+=1
content= row
outFile= open("output"+str(i)+".txt", 'w')
outFile.write(content)
outFile.close()
This is creating 4 files instead of 3 for each row. Any suggestions on how to ignore the line break in the second row?
You could define a regular expression pattern to help you iterate over the rows.
Read the entire file contents - if possible.
s = '''"content of row 1"
"content of row 2
continues here"
"content of row 3"'''
Pattern - double-quote, followed by anything that isn't a double-quote, followed by a double-quote.:
row_pattern = '''"[^"]*"'''
row = re.compile(row_pattern, flags = re.DOTALL | re.MULTILINE)
Iterate the rows:
for r in row.finditer(s):
print r.group()
print '******'
>>>
"content of row 1"
******
"content of row 2
continues here"
******
"content of row 3"
******
>>>
The file you describe is NOT a CSV (comma separated values) file. A CSV file is a list of records one per line where each record is separated from the others by commas. There are various "flavors" of CSV which support various features for quoting fields (in case fields have embedded commas in them, for example).
I think your best bet would be to create an adapter class/instance which would pre-process the raw file, find and merge the continuation lines into records and them pass those to your instance of csv.reader. You could model your class after StringIO from the Python standard libraries.
The point is that you create something which processes data but behaves enough like a file object that it can be used, transparently, as the input source for something like csv.reader().
(Done properly you can even implement the Python context manager protocol. io.StringIO does support this protocol and could be used as a reference. This would allow you to use instances of this hypothetical "line merging" adapter class in a Python with statement just as you're doing with your open file() object in your example code).
from io import StringIO
import csv
data = u'1,"a,b",2\n2,ab,2.1\n'
with StringIO(data) as infile:
reader = csv.reader(infile, quotechar='"')
for rec in reader:
print(rec[0], rec[2], rec[1])
That's just a simple example of using the io.StringIO in a with statement Note that io.StringIO requires Unicode data, io.BytesIO requires "bytes" or string data (at least in 2.7.x). Your adapter class can do whatever you like.
Related
I am trying to open a CSV file after I create it with python. My goal is to be able to read back the file without editing it and my problem has been that I cannot get the delimiter to work. My file is created with python csv writer and then I am attempting to use the reader to read the data from the file. This is where I am getting stuck. My CSV file is saved in the same location that my python program is saved in thus I know it is not an access issue. My file is created with special character delimiter I am using Semicolons; because the raw data already contains commas,, colons;, plus signs+, ampersands&, periods., and possibly underscores_ and/or dashes-. This is the code that I am using to read my CSV file:
with open('Cool.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', dialect=csv.excel_tab)
for row in csv_reader:
print row[0]
csv_file.close()
Now this is my csv file (Cool.csv):
"Sat, 20 Apr 2019 00:17:05 +0000;Need to go to store;Eggs & Milk are needed ;Store: Grocery;Full Name: Safeway;Email: safewayiscool#gmail.com;Safeway <safewayiscool#gmail.com>, ;"
"Tue, 5 Mar 2019 05:54:24 +0000;Need to buy ham;Green eggs and Ham are needed for dinner ;Username: Dr.Seuss;Full Name: Theodor Seuss Geisel;Email: greeneggs+ham#seuss.com;"
So I would expect my output to be the following when I run the code:
Sat, 20 Apr 2019 00:17:05 +0000
Tue, 5 Mar 2019 05:54:24 +0000
I either get a null error of some kind or it will print out the entire line. How can I get it to separate out data into what I want to have defined the columns delimited by the ;?
I am not sure if the issue is that I am trying to use the semicolon or if it is something else. If it is just the semicolon I could change it if necessary but many other characters are already taken in the incoming data.
Also please do not suggest I simply just read it in from the original file. It is a massive file that has a lot of other data and I want to trim it before then executing with this second program.
UPDATE:
This is the code that builds the file:
with open('Cool.csv', 'w') as csvFile:
writer = csv.writer(csvFile, delimiter=';')
for m in file:
message = m['msg']
message2 = message.replace('\r\n\r\n', ';')
message3 = message2.replace('\r\n', ';')
entry = m['date'] + ";" + m['subject'] + ";" + message3
list = []
list.append(entry)
writer.writerow(list)
csvFile.close()
It looks like the file was created incorrectly. The sample data provided shows the whole line double-quoted, which treats it as one long single column. Here's correct code to write and read and semicolon-delimited file:
import csv
with open('Cool.csv','w',newline='',encoding='utf-8-sig') as csv_file:
csv_writer = csv.writer(csv_file,delimiter=';')
csv_writer.writerow(['data,data','data;data','data+-":_'])
with open('Cool.csv','r',newline='',encoding='utf-8-sig') as csv_file:
csv_reader = csv.reader(csv_file,delimiter=';')
for row in csv_reader:
print(row)
Output (matches data written):
['data,data', 'data;data', 'data+-":_']
Cool.csv:
data,data;"data;data";"data+-"":_"
Notes:
utf-8-sig is the most compatible encoding with Excel. Any Unicode character you put in the file will work and look correct when the CSV is opened in Excel.
newline='' is required per the csv documentation. The csv module handles its own newlines per the dialect used (default 'excel').
; delimiter is not needed. The default , will work. Note how the second entry has a semicolon, so the field was quoted. The first field with comma would have been quoted instead if the delimiter was a comma and it would still work.
csv_writer.writerow takes a sequence containing the column data.
csv_reader returns each row as a list of the column data.
A column in the .CSV is double-quoted if it contains the delimiter, and quotes are doubled if present in the data to escape them as well. Note the third field has a double quote.
csv_writer.close() and csv_reader.close() are not needed if using with.
RTFM.
From help (csv)
DIALECT REGISTRATION:
Readers and writers support a dialect argument, which is a convenient
handle on a group of settings. When the dialect argument is a string,
it identifies one of the dialects previously registered with the module.
If it is a class or instance, the attributes of the argument are used as
the settings for the reader or writer:
class excel:
delimiter = ','
quotechar = '"'
escapechar = None
doublequote = True
skipinitialspace = False
lineterminator = '\r\n'
quoting = QUOTE_MINIMAL
And you use dialect=csv.excel_tab.
You effectively overwrite your delimiter. Just don't use the dialect option.
Sidenote: with handles closing of the file handle for you. Read here
Second sidenote: The whole line of your CSV file is in double quotes. Either get rid of them, or disable the quoting. i.e.
with open('b.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', quoting=csv.QUOTE_NONE)
for row in csv_reader:
print (row[0])
I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rb") as text:
for line in text:
line=line.strip()
with open(csvfile, "ab") as f:
if line.startswith("# Online_Resource"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
if line.startswith("##"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
Here is a sample of some strings from the original text file:
# Online_Resource: https://www.ncdc.noaa.gov/
## Corg% percent organic carbon,,,%,,paleoceanography,,,N
What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:
Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.
Thanks.
You've selected excel dialect but you overrode it with weird parameters:
You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
Also quotechar shouldn't be a space.
This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!
The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.
Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rU") as text, open(csvfile, "wb") as f:
write = csv.writer(f, dialect='excel',
delimiter='\t')
reader = csv.reader(text, delimiter=",")
for row in reader:
if not row:
continue # skip possible empty rows
if row[0].startswith("# Online_Resource"):
write.writerow([row[0].lstrip("# ")])
elif row[0].startswith("##"):
write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)
Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.
here's how the output looks now (note that the empty cells are because there are several commas in a row)
more problems:
line=line.strip(" ") removes leading and trailing spaces. It doesn't remove \r or \n ... try line=line.strip() which removes leading and trailing whitespace
you get all your line including commas in one cell because you haven't split it up somehow ... like using a csv.reader instance. See here:
https://docs.python.org/2/library/csv.html#csv.reader
str.lstrip non-default arg is treated as a set of characters to be removed, so '## ' has the same effect as '# '. if guff.startswith('## ') then do guff = guff[3:] to get rid of the unwanted text
It is not very clear at all what the sentence containing "bizarre" means. We need to see exactly what is in the output csv file. Create a small test file with 3 records (1) with '# Online_Resource' (2) with "## " (3) none of the above, run your code, and show the output, like this:
print repr(open('testout.csv', 'rb').read())
I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?
As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row
Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.
CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".
import csv, Tkinter
with open('most_common_words.csv') as csv_file: # Opens the file in a 'closure' so that when it's finished it's automatically closed"
csv_reader = csv.reader(csv_file) # Create a csv reader instance
for row in csv_reader: # Read each line in the csv file into 'row' as a list
print row[0] # Print the first item in the list
I'm trying to import this list of most common words using csv. It continues to give me the same error
for row in csv_reader: # Read each line in the csv file into 'row' as a list
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
I've tried a couple different ways to do it as well, but they didn't work either. Any suggestions?
Also, where does this file need to be saved? Is it okay just being in the same folder as the program?
You should always open a CSV file in binary mode (Python 2) or universal newline mode (Python 3). Also, make sure that the delimiters and quote characters are , and ", or you'll need to specify otherwise:
with open('most_common_words.csv', 'rb') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';', quotechar='"') # for EU CSV
You can save the file in the same folder as your program. If you don't, you can provide the correct path to open() as well. Be sure to use raw strings if you're on Windows, otherwise the backslashes may trick you: open(r"C:\Python27\data\table.csv")
It seems you have a file with one column as you say here:
It is a simple list of words. When I open it up, it opens into Excel
with one column and 500 rows of 500 different words.
If so, you don't need the csv module at all:
with open('most_common_words.csv') as f:
rows = list(f)
Note in this case, each item of the list will have the newline appended to it, so if your file is:
apple
dog
cat
rows will be ['apple\n', 'dog\n', 'cat\n']
If you want to strip the end of line, then you can do this:
with open('most_common_words.csv') as f:
rows = list(i.rstrip() for i in f)
I have two very large files:
File1 is formatted as such:
thisismy#email.com:20110708
thisisnotmy#email.com:20110908
thisisyour#email.com:20090807
...
File2 is a csv file that has the same email addresses in the row[0] field, and I need to put the date into the row[5] field.
I understand how to properly read & parse the csv, as well I comprehend how to read the File1 and cut it properly.
What I need assistance with is how to properly search the CSV file for ANY instances of the email address and update the csv with the corresponding date.
Thanks for your assistance.
You may want to check using module re::
import re
emails = re.findall(r'^(.*\#.*?):', open('filename.csv').read())
That will get you all the emails.
If the data you have to replace has a fixed size, which seems to be the case in your example. You can use seek(). While reading your file looking for your value, get the cursor position and write your replacement data from the desired position.
Cf: Writing in file's actual position in Python
However, if you are dealing with extra huge files, using command line tools such as sed could save a lot of processing time.
Below example tested on Python 2.7:
import csv
# 'b' flag for binary is necessary if on Windows otherwise crlf hilarity ensues
with open('/path/to/file1.txt','rb') as fin:
csv_reader = csv.reader(fin, delimiter=":")
# Header in line 1? Skip over. Otherwise no need for next line.
csv_reader.next()
# populate dict with email address as key and date as value
# dictionary comprehensions supported in 2.7+
# on a lower version? use: d = dict((line[0],line[1]) for line in csv_reader)
email_address_dict = {line[0]: line[1] for line in csv_reader}
# there are ways to modify a file in-place
# but it's easier to write to a new file
with open('/path/to/file2.txt','rb') as fin, \
open('/path/to/file3.txt','wb') as fou:
csv_reader = csv.reader(fin, delimiter=":")
csv_writer = csv.writer(fou, delimiter=":")
# Header in line 1? Skip over. Otherwise no need for next line.
csv_writer.writerow( csv_reader.next() )
for line in csv_reader:
# construct new line
# looking up date value in just-created dict
# the new date value is inserted position 5 (zero-based)
newline = line[0:5]
newline.append(email_address_dict[line[0]])
newline.extend(line[6:])
csv_writer.writerow(newline)