Remove line break inside line row from CSV with regular expression - python

Hello I have this text :
1,0.00,,2.00,10,"Block. CertNot Valid.
Query with me",2013-06-20,0,0.00
This is two lines in CSV file, but really is one line of data and I want remove the break line, and put this line in just one line using Regular Expressions.
I've tried: (\")(.*)(\n)(.*)(\") , but it doesn't work.

Don't. There is no need to remove the line break.
Use the csv module to read the CSV file, it'll handle the linebreak correctly:
import csv
with open(csvfilename, 'rb') as infile:
reader = csv.reader(infile)
for row in reader:
print repr(row[5])
will print:
'Block. CertNot Valid.\nQuery with me'
for that row.
This works because that column is correctly quoted.

You can check result here: https://www.debuggex.com/r/2_X5N-wTLZ2laJKh
Console output:
>>> regex = re.compile("\"(.+?)\"",re.MULTILINE|re.DOTALL|re.VERBOSE)
>>> regex.findall(string)
[u'Block. CertNot Valid.\nQuery with me', u'test\naaa', u'bbb\nvvvv']
And 'string' value is:
1,0.00,,2.00,10,"Block. CertNot Valid.
Query with me",2013-06-20,0,0.00
1,0.00,,2.00,10,"test
aaa",2013-06-20,0,0.00
1,0.00,,2.00,10,"bbb
vvvv",2013-06-20,0,0.00

Related

Python CSV writer keeps adding unnecessary quotes

I'm trying to write to a CSV file with output that looks like this:
14897,40.50891,-81.03926,168.19999
but the CSV writer keeps writing the output with quotes at beginning and end
'14897,40.50891,-81.03926,168.19999'
When I print the line normally, the output is correct but I need to do line.split() or else the csv writer puts output as 1,4,8,9,7 etc...
But when I do line.split() the output is then
['14897,40.50891,-81.03926,168.19999']
Which is written as '14897,40.50891,-81.03926,168.19999'
How do I make the quotes go away? I already tried csv.QUOTE_NONE but doesn't work.
with open(results_csv, 'wb') as out_file:
writer = csv.writer(out_file, delimiter=',')
writer.writerow(["time", "lat", "lon", "alt"])
for f in file_directory):
for line in open(f):
print line
line = line.split()
writer.writerow(line)
with line.split(), you're not splitting according to commas but to blanks (spaces, linefeeds, tabs). Since there are none, you end up with only 1 item per row.
Since this item contains commas, csv module has to quote to make the difference with the actual separator (which is also comma). You would need line.strip().split(",") for it to work, but...
using csv to read your data would be a better idea to fix this:
replace that:
for line in open(some_file):
print line
line = line.split()
writer.writerow(line)
by:
with open(some_file) as f:
cr = csv.reader(f) # default separator is comma already
writer.writerows(cr)
You don't need to read the file manually. You can simply use csv reader.
Replace the inner for loop with:
# with ensures that the file handle is closed, after the execution of the code inside the block
with open(some_file) as file:
row = csv.reader(file) # read rows
writer.writerows(row) # write multiple rows at once

Python: How to capitalize the first column of a .txt file.

I have a .csv formatted .txt file. I am deliberating over the best manner in which to .capitalize the text in the first column.
.capitalize() is a string method, so I considered the following; I would need to open the file, convert the data to a list of strings, capitalize the the required word and finally write the data back to file.
To achieve this, I did the following:
newGuestList = []
with open("guestList.txt","r+") as guestFile :
guestList = csv.reader(guestFile)
for guest in guestList :
for guestInfo in guest :
capitalisedName = guestInfo.capitalize()
newGuestList.append(capitalisedName)
Which gives the output:
[‘Peter’, ‘35’, ‘ spain’, ‘Caroline’, ‘37’, ‘france’, ‘Claire’,’32’, ‘ sweden’]
The problem:
Firstly; in order to write this new list back to file, I will need to convert it to a string. I can achieve this using the .join method. However, how can I introduce a newline, \n, after every third word (the country) so that each guest has their own line in the text file?
Secondly; this method, of nested for loops etc. seems highly convoluted, is there a cleaner way?
My .txt file:
peter, 35, spain\n
caroline, 37, france\n
claire, 32, sweden\n
You don't need to split the lines, since the first caracter of the first word is the first caracter of the line :
with open("lst.txt","r") as guestFile :
lines=guestFile.readlines()
newlines=[line.capitalize() for line in lines]
with open("lst.txt","w") as guestFile :
guestFile.writelines(newlines)
You can just use a CSV reader and writer and access the element you want to capitalize from the list.
import csv
import os
inp = open('a.txt', 'r')
out = open('b.txt', 'w')
reader = csv.reader(inp)
writer = csv.writer(out)
for row in reader:
row[0] = row[0].capitalize()
writer.writerow(row)
inp.close()
out.close()
os.rename('b.txt', 'a.txt') # if you want to keep the same name

Remove a specific row in a csv file with python

I am trying to remove a row from a csv file if the 2nd column matches a string. My csv file has the following information:
Name
15 Dog
I want the row with "Name" in it removed. The code I am using is:
import csv
reader = csv.reader(open("info.csv", "rb"), delimiter=',')
f = csv.writer(open("final.csv", "wb"))
for line in reader:
if "Name" not in line:
f.writerow(line)
print line
But the "Name" row isn't removed. What am I doing wrong?
EDIT: I was using the wrong delimiter. Changing it to \t worked. Below is the code that works now.
import csv
reader = csv.reader(open("info.csv", "rb"), delimiter='\t')
f = csv.writer(open("final.csv", "wb"))
for line in reader:
if "Name" not in line:
f.writerow(line)
print line
Seems that you are specifying the wrong delimiter (comma)in csv.reader
Each line yielded by reader is a list, split by your delimiter. Which, by the way, you specified as ,, are you sure that is the delimiter you want? Your sample is delimited by tabs.
Anyway, you want to check if 'Name' is in any element of a given line. So this will still work, regardless of whether your delimiter is correct:
for line in reader:
if any('Name' in x for x in line):
#write operation
Notice the difference. This version checks for 'Name' in each list element, yours checks if 'Name' is in the list. They are semantically different because 'Name' in ['blah blah Name'] is False.
I would recommend first fixing the delimiter error. If you still have issues, use if any(...) as it is possible that the exact token 'Name' is not in your list, but something that contains 'Name' is.

Effective way to get part of string until token

I'm parsing a very big csv (big = tens of gigabytes) file in python and I need only the value of the first column of every line. I wrote this code, wondering if there is a better way to do it:
delimiter = ','
f = open('big.csv','r')
for line in f:
pos = line.find(delimiter)
id = int(line[0:pos])
Is there a more effective way to get the part of the string before the first delimiter?
Edit: I do know about the CSV module (and I have used it occasionally), but I do not need to load in memory every line of this file - I need the first column. So lets focus on string parsing.
>>> a = '123456'
>>> print a.split('2', 1)[0]
1
>>> print a.split('4', 1)[0]
123
>>>
But, if you're dealing with a CSV file, then:
import csv
with open('some.csv') as fin:
for row in csv.reader(fin):
print int(row[0])
And the csv module will handle quoted columns containing quotes etc...
If the first field can't have an escaped delimiter in it such as in your case where the first field is an integer and there are no embed newlines in any field i.e., each row corresponds to exactly one physical line in the file then csv module is an overkill and you could use your code from the question or line.split(',', 1) as suggested by #Jon Clements.
To handle occasional lines that have no delimiter in them you could use str.partition:
with open('big.csv', 'rb') as file:
for line in file:
first, sep, rest = line.partition(b',')
if sep: # the line has ',' in it
process_id(int(first)) # or `yield int(first)`
Note: s.split(',', 1)[0] silently returns a wrong result (the whole string) if there is no delimiter in the string.
'rb' file mode is used to avoid unnecessary end of line manipulation (and implicit decoding to Unicode on Python 3). It is safe to use if the csv file has '\n' at the end of each raw i.e., newline is either '\n' or '\r\n'
Personnally , I would do with generators:
from itertools import imap
import csv
def int_of_0(x):
return(int(x[0]))
def obtain(filepath, treat):
with open(filepath,'rb') as f:
for i in imap(treat,csv.reader(f)):
yield i
for x in obtain('essai.txt', int_of_0):
# instructions

Sorting CSV file with delimiter in Python

How to do read a .csv file with the following content
$C=2$A=3$B=1$
Then create a new .csv file with the same content but the $ changed into , and sorted alphabetically like the following:
A=3,B=1,C=2
Thank you!
Edit:
Here's my following code. It ended up giving an extra comma at the beginning of the output.
input = csv.reader(open('inputfile.csv','r'), delimiter='$')
output = open('outputfile.csv','w')
try:
writer = csv.writer(output)
for column in input:
writer.writerow(sorted(column))
print (sorted(column))
finally:
out.close()
Right now my input is:
$C=2$A=3$B=1$
and my output is:
,A=3,B=1,C=2
I want it to be:
A=3,B=1,C=2
Thanks!
with open('test.csv') as in_file, open('new.csv', 'w') as out_file:
for line in csv.reader(in_file, delimiter='$'):
out_file.write(','.join(sorted(line)[2:])+'\n')
Basically what this does is:
open the input as in_file
open the output as out_file
initializes a CSV reader with $ as the delimiter using in_file as the input file
iterates through each row doing the following:
sort all of the elements (after parsing)
discard the first 2 (since they'll always be empty strings due to the start/end delimiters on each line)
recombine those elements using , as the delimiter
write that out to the file with a trailing newline \n
edit: fixed for the start/end $ symbols by removing the empty elements that get parsed out of the CSV (the [2:] bit)
You can use a csv.reader to read the file with the delimiter set to '$'. Then for each row returned, strip out the empty elements and sort the rest:
row = sorted([item for item in row if item])

Categories