Python – cleaning CSV file with split records - python

I have a delimited file in which some of the fields contain line termination characters. They can be LF or CR/LF.
The line terminators cause the records to split over multiple lines.
My objective is to read the file, remove the line termination characters, then write out a delimited file with quotes around the fields.
Sample input record:
444,2018-04-06,19:43:47,43762485,"Request processed"CR\LF
555,2018-04-30,19:17:56,43762485,"Added further note:LF
email customer a receipt" CR\LF
The first record is fine but the second has a LF (line feed) causing the record to fold.
import csv
with open(raw_data, 'r', newline='') as inp, open(csv_data, 'w') as out:
csvreader = csv.reader(inp, delimiter=',', quotechar='"')
for row in csvreader:
print(str(row))
out.write(str(row)[1:-1] + '\n')
My code nearly works but I don’t think it is correct.
The output I get is:
['444', '2020-04-06', '19:43:47', '344376882485', 'Request processed']
['555', '2020-04-30', '19:17:56', '344376882485', 'Added further note:\nemail customer a receipt']
I use the substring to remove the square brackets at the start and end of the line which I think is not the correct way.
Notice on the second record the new line character has been converted to \n. I would like to know how to get rid of that and also incorporate a csv writer to the code to place double quoting around the fields.
To remove the line terminators I tried replace but did not work.
(row.replace('\r', '').replace('\n', '') for row in csvreader)
I also tried to incorporate a csv writer but could not get it working with the list.
Any advice would be appreciated.

This snippet does what you want:
with open('raw_data.csv', 'r', newline='') as inp, open('csv_data.csv', 'w') as out:
reader = csv.reader(inp, delimiter=',', quotechar='"')
writer = csv.writer(out, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
for row in reader:
fixed = [cell.replace('\n', '') for cell in row]
writer.writerow(fixed)
Quoting all cells is handled by passing csv.QUOTE_ALL as the writer's "quoting" argument.
The line
fixed = [cell.replace('\n', '') for cell in row]
creates a new list of cells where embedded '\n' characters are replaced by the empty string.
By default, Python will set the end-of-line to your platform's default. If you want to override this you can pass a lineterminator argument to the writer.
To me the original csv seems fine: it's normal to have embedded newlines ("soft line-breaks") inside quoted cells, and csv-aware applications should as spreadsheets will handle them correctly. However they will look wrong in applications that don't understand csv formatting and so treat the embedded newlines as actual end of line characters.

Related

Is there a work around for using multi-character delimiters for `csv.reader`?

Currently, only one character is allowed
Dialect.delimiter A one-character string used to separate fields. It
defaults to ','.
https://docs.python.org/3.6/library/csv.html#csv.Dialect.delimiter
Is there a work around for multi-character delimiters ?
I am asking because I am working with very messy text data, which has pretty much instances of every sort of characters, so I need a rare combination of characters to effectively seperate values.
Here's the first part of my answer to the question CSV writing strings of text that need a unique delimiter adapted to work in Python 3.7:
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'r', newline='') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print(row)

python rstrip usage within loop while reading csv file

I have a CSV file where I am looping and matching with my database for getting results according to these matches.
I encountered a problem in the case where there is a space at the end of the text. So I did my research and found that I need to add the rstrip function to remove spaces at the end of the text.
Here is my code:
with open(path, encoding='utf-8') as f:
data = csv.reader(f, delimiter='|')
for row in data:
line = row[0]
cleanline = line.rstrip()
lines.append(cleanline)
query = line
The code is not working. I tried also strings like /s or strip, and replace functions as well but nothing is working. What can be the reason? What am I doing wrong?
CSV File with empty space at the end:
Sistem en az 23.8 inç boyutlarında olmalıdır.
1 adet HDMI port olmalıdır.
You could try the following approach:
import csv
path = 'input.csv'
lines = []
with open(path, newline='', encoding='utf-8') as f:
data = csv.reader(f, delimiter='|', skipinitialspace=True)
for row in data:
lines.append([c.strip() for c in row])
print(lines)
This removes all leading and trailing spaces from each cell in a row using the strip() command. Depending on your data, it might be just enough to add the additional skipinitialspace=True parameter. This though would not remove trailing spaces before the next delimiter. newline='' should also be used in Python 3.x when used with a csv.reader().
The file you have given just contains lines of text, as such you could read it as follows:
lines = []
with open('input.csv', encoding='utf-8') as f_input:
for line in f_input:
lines.append(line.strip())
print(lines)
This would give you lines containing:
['Sistem en az 23.8 inç boyutlarında olmalıdır.', '1 adet HDMI port olmalıdır.']

Python CSV writer keeps adding unnecessary quotes

I'm trying to write to a CSV file with output that looks like this:
14897,40.50891,-81.03926,168.19999
but the CSV writer keeps writing the output with quotes at beginning and end
'14897,40.50891,-81.03926,168.19999'
When I print the line normally, the output is correct but I need to do line.split() or else the csv writer puts output as 1,4,8,9,7 etc...
But when I do line.split() the output is then
['14897,40.50891,-81.03926,168.19999']
Which is written as '14897,40.50891,-81.03926,168.19999'
How do I make the quotes go away? I already tried csv.QUOTE_NONE but doesn't work.
with open(results_csv, 'wb') as out_file:
writer = csv.writer(out_file, delimiter=',')
writer.writerow(["time", "lat", "lon", "alt"])
for f in file_directory):
for line in open(f):
print line
line = line.split()
writer.writerow(line)
with line.split(), you're not splitting according to commas but to blanks (spaces, linefeeds, tabs). Since there are none, you end up with only 1 item per row.
Since this item contains commas, csv module has to quote to make the difference with the actual separator (which is also comma). You would need line.strip().split(",") for it to work, but...
using csv to read your data would be a better idea to fix this:
replace that:
for line in open(some_file):
print line
line = line.split()
writer.writerow(line)
by:
with open(some_file) as f:
cr = csv.reader(f) # default separator is comma already
writer.writerows(cr)
You don't need to read the file manually. You can simply use csv reader.
Replace the inner for loop with:
# with ensures that the file handle is closed, after the execution of the code inside the block
with open(some_file) as file:
row = csv.reader(file) # read rows
writer.writerows(row) # write multiple rows at once

How to search CSV line for string in certain column, print entire line to file if found

Sorry, very much a beginner with Python and could really use some help.
I have a large CSV file, items separated by commas, that I'm trying to go through with Python. Here is an example of a line in the CSV.
123123,JOHN SMITH,SMITH FARMS,A,N,N,12345 123 AVE,CITY,NE,68355,US,12345 123 AVE,CITY,NE,68355,US,(123) 555-5555,(321) 555-5555,JSMITH#HOTMAIL.COM,15-JUL-16,11111,2013,22-DEC-93,NE,2,1\par
I'd like my code to scan each line and look at only the 9th item (the state). For every item that matches my query, I'd like that entire line to be written to an CSV.
The problem I have is that my code will find every occurrence of my query throughout the entire line, instead of just the 9th item. For example, if I scan looking for "NE", it will write the above line in my CSV, but also one that contains the string "NEARY ROAD."
Sorry if my terminology is off, again, I'm a beginner. Any help would be greatly appreciated.
I've listed my coding below:
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for line in f:
if "NE" in line:
print ('Found: []'.format(line))
writer.writerow([line])
You're not actually using your reader to read the input CSV, you're just reading the raw lines from the file itself.
A fixed version looks like the following (untested):
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
if row[8] == 'NE':
print ('Found: {}'.format(row))
writer.writerow(row)
The changes are as follows:
Instead of iterating over the input file's lines, we iterate over the rows parsed by the reader (each of which is a list of each of the values in the row).
We check to see if the 9th item in the row (i.e. row[8]) is equal to "NE".
If so, we output that row to the output file by passing it in, as-is, to the writer's writerow method.
I also fixed a typo in your print statement - the format method uses braces (not square brackets) to mark replacement locations.
This snippet should solves your problem
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
if "NE" in row:
print ('Found: {}'.format(row))
writer.writerow(row)
if "NE" in line in your code is trying to find out whether "NE" is a substring of string line, which works not as intended. The lines are raw lines of your input file.
If you use if "NE" in row: where row is parsed line of your input file, you are doing exact element matching.

Python: writing an entire row to a CSV file. Why does it work this way?

I had exported a csv from Nokia Suite.
"sms","SENT","","+12345678901","","2015.01.07 23:06","","Text"
Reading from the PythonDoc, I tried
import csv
with open(sourcefile,'r', encoding = 'utf8') as f:
reader = csv.reader(f, delimiter = ',')
for line in reader:
# write entire csv row
with open(filename,'a', encoding = 'utf8', newline='') as t:
a = csv.writer(t, delimiter = ',')
a.writerows(line)
It didn't work, until I put brackets around 'line' as so i.e. [line].
So at the last part I had
a.writerows([line])
Why is that so?
The writerows method accepts a container object. The line object isn't a container. [line] turns it into a list with one item in it.
What you probably want to use instead is writerow.

Categories