Strip white spaces from CSV file - python

I need to stripe the white spaces from a CSV file that I read
import csv
aList=[]
with open(self.filename, 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
aList.append(row)
# I need to strip the extra white space from each string in the row
return(aList)

There's also the embedded formatting parameter: skipinitialspace (the default is false)
http://docs.python.org/2/library/csv.html#csv-fmt-params
aList=[]
with open(self.filename, 'r') as f:
reader = csv.reader(f, skipinitialspace=False,delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
aList.append(row)
return(aList)

In my case, I only cared about stripping the whitespace from the field names (aka column headers, aka dictionary keys), when using csv.DictReader.
Create a class based on csv.DictReader, and override the fieldnames property to strip out the whitespace from each field name (aka column header, aka dictionary key).
Do this by getting the regular list of fieldnames, and then iterating over it while creating a new list with the whitespace stripped from each field name, and setting the underlying _fieldnames attribute to this new list.
import csv
class DictReaderStrip(csv.DictReader):
#property
def fieldnames(self):
if self._fieldnames is None:
# Initialize self._fieldnames
# Note: DictReader is an old-style class, so can't use super()
csv.DictReader.fieldnames.fget(self)
if self._fieldnames is not None:
self._fieldnames = [name.strip() for name in self._fieldnames]
return self._fieldnames

with open(self.filename, 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
return [[x.strip() for x in row] for row in reader]

You can do:
aList.append([element.strip() for element in row])

The most memory-efficient method to format the cells after parsing is through generators. Something like:
with open(self.filename, 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
yield (cell.strip() for cell in row)
But it may be worth moving it to a function that you can use to keep munging and to avoid forthcoming iterations. For instance:
nulls = {'NULL', 'null', 'None', ''}
def clean(reader):
def clean(row):
for cell in row:
cell = cell.strip()
yield None if cell in nulls else cell
for row in reader:
yield clean(row)
Or it can be used to factorize a class:
def factory(reader):
fields = next(reader)
def clean(row):
for cell in row:
cell = cell.strip()
yield None if cell in nulls else cell
for row in reader:
yield dict(zip(fields, clean(row)))

You can create a wrapper object around your file that strips away the spaces before the CSV reader sees them. This way, you can even use the csv file with cvs.DictReader.
import re
class CSVSpaceStripper:
def __init__(self, filename):
self.fh = open(filename, "r")
self.surroundingWhiteSpace = re.compile("\s*;\s*")
self.leadingOrTrailingWhiteSpace = re.compile("^\s*|\s*$")
def close(self):
self.fh.close()
self.fh = None
def __iter__(self):
return self
def next(self):
line = self.fh.next()
line = self.surroundingWhiteSpace.sub(";", line)
line = self.leadingOrTrailingWhiteSpace.sub("", line)
return line
Then use it like this:
o = csv.reader(CSVSpaceStripper(filename), delimiter=";")
o = csv.DictReader(CSVSpaceStripper(filename), delimiter=";")
I hardcoded ";" to be the delimiter. Generalising the code to any delimiter is left as an exercise to the reader.

Read a CSV (or Excel file) using Pandas and trim it using this custom function.
#Definition for strippping whitespace
def trim(dataset):
trim = lambda x: x.strip() if type(x) is str else x
return dataset.applymap(trim)
You can now apply trim(CSV/Excel) to your code like so (as part of a loop, etc.)
dataset = trim(pd.read_csv(dataset))
dataset = trim(pd.read_excel(dataset))

and here is Daniel Kullmann excellent solution adapted to Python3:
import re
class CSVSpaceStripper:
"""strip whitespaces around delimiters in the file
NB has hardcoded delimiter ";"
"""
def __init__(self, filename):
self.fh = open(filename, "r")
self.surroundingWhiteSpace = re.compile(r"\s*;\s*")
self.leadingOrTrailingWhiteSpace = re.compile(r"^\s*|\s*$")
def close(self):
self.fh.close()
self.fh = None
def __iter__(self):
return self
def __next__(self):
line = self.fh.readline()
line = self.surroundingWhiteSpace.sub(";", line)
line = self.leadingOrTrailingWhiteSpace.sub("", line)
return line

I figured out a very simple solution:
import csv
with open('filename.csv') as f:
reader = csv.DictReader(f)
rows = [ { k.strip(): v.strip() for k,v in row.items() } for row in reader ]

The following code may help you:
import pandas as pd
aList = pd.read_csv(r'filename.csv', sep='\s*,\s*', engine='python')

Related

Issue with reading from CSV [duplicate]

I need to stripe the white spaces from a CSV file that I read
import csv
aList=[]
with open(self.filename, 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
aList.append(row)
# I need to strip the extra white space from each string in the row
return(aList)
There's also the embedded formatting parameter: skipinitialspace (the default is false)
http://docs.python.org/2/library/csv.html#csv-fmt-params
aList=[]
with open(self.filename, 'r') as f:
reader = csv.reader(f, skipinitialspace=False,delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
aList.append(row)
return(aList)
In my case, I only cared about stripping the whitespace from the field names (aka column headers, aka dictionary keys), when using csv.DictReader.
Create a class based on csv.DictReader, and override the fieldnames property to strip out the whitespace from each field name (aka column header, aka dictionary key).
Do this by getting the regular list of fieldnames, and then iterating over it while creating a new list with the whitespace stripped from each field name, and setting the underlying _fieldnames attribute to this new list.
import csv
class DictReaderStrip(csv.DictReader):
#property
def fieldnames(self):
if self._fieldnames is None:
# Initialize self._fieldnames
# Note: DictReader is an old-style class, so can't use super()
csv.DictReader.fieldnames.fget(self)
if self._fieldnames is not None:
self._fieldnames = [name.strip() for name in self._fieldnames]
return self._fieldnames
with open(self.filename, 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
return [[x.strip() for x in row] for row in reader]
You can do:
aList.append([element.strip() for element in row])
The most memory-efficient method to format the cells after parsing is through generators. Something like:
with open(self.filename, 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
yield (cell.strip() for cell in row)
But it may be worth moving it to a function that you can use to keep munging and to avoid forthcoming iterations. For instance:
nulls = {'NULL', 'null', 'None', ''}
def clean(reader):
def clean(row):
for cell in row:
cell = cell.strip()
yield None if cell in nulls else cell
for row in reader:
yield clean(row)
Or it can be used to factorize a class:
def factory(reader):
fields = next(reader)
def clean(row):
for cell in row:
cell = cell.strip()
yield None if cell in nulls else cell
for row in reader:
yield dict(zip(fields, clean(row)))
You can create a wrapper object around your file that strips away the spaces before the CSV reader sees them. This way, you can even use the csv file with cvs.DictReader.
import re
class CSVSpaceStripper:
def __init__(self, filename):
self.fh = open(filename, "r")
self.surroundingWhiteSpace = re.compile("\s*;\s*")
self.leadingOrTrailingWhiteSpace = re.compile("^\s*|\s*$")
def close(self):
self.fh.close()
self.fh = None
def __iter__(self):
return self
def next(self):
line = self.fh.next()
line = self.surroundingWhiteSpace.sub(";", line)
line = self.leadingOrTrailingWhiteSpace.sub("", line)
return line
Then use it like this:
o = csv.reader(CSVSpaceStripper(filename), delimiter=";")
o = csv.DictReader(CSVSpaceStripper(filename), delimiter=";")
I hardcoded ";" to be the delimiter. Generalising the code to any delimiter is left as an exercise to the reader.
Read a CSV (or Excel file) using Pandas and trim it using this custom function.
#Definition for strippping whitespace
def trim(dataset):
trim = lambda x: x.strip() if type(x) is str else x
return dataset.applymap(trim)
You can now apply trim(CSV/Excel) to your code like so (as part of a loop, etc.)
dataset = trim(pd.read_csv(dataset))
dataset = trim(pd.read_excel(dataset))
and here is Daniel Kullmann excellent solution adapted to Python3:
import re
class CSVSpaceStripper:
"""strip whitespaces around delimiters in the file
NB has hardcoded delimiter ";"
"""
def __init__(self, filename):
self.fh = open(filename, "r")
self.surroundingWhiteSpace = re.compile(r"\s*;\s*")
self.leadingOrTrailingWhiteSpace = re.compile(r"^\s*|\s*$")
def close(self):
self.fh.close()
self.fh = None
def __iter__(self):
return self
def __next__(self):
line = self.fh.readline()
line = self.surroundingWhiteSpace.sub(";", line)
line = self.leadingOrTrailingWhiteSpace.sub("", line)
return line
I figured out a very simple solution:
import csv
with open('filename.csv') as f:
reader = csv.DictReader(f)
rows = [ { k.strip(): v.strip() for k,v in row.items() } for row in reader ]
The following code may help you:
import pandas as pd
aList = pd.read_csv(r'filename.csv', sep='\s*,\s*', engine='python')

Trouble on Unicode encoded data in Python

Hello StackOverflow community.
I am a fairly new user of Python, so sorry in advance for the sillyness of this question ! But I have tried to fix it out for hours but still not having figured it out.
I am trying to import a large dataset of text to manipulate it in Python.
This data set is in .csv and I've had problems reading it because of encoding problems.
I have tried to encode it in UTF-8 text with notepad++
I have tried the csv.reader module in Python
Here is an example of my code :
import csv
with open('twitter_test_python.csv') as csvfile:
#for file5 in csvfile:
# file5.readline()
#csvfile = csvfile.encode('utf-8')
spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|')
for row in spamreader:
row = " ".join(row)
row2= str.split(row)
listsw = []
for mots in row2:
if mots not in sw:
del mots
print row2
But when I import my data in Python I still have encoding problems (accents, etc) whether method I use.
How can I encode my data so that it is readable properly with Python ?
Thanks !
csv module documentation provides an example of how to deal with unicode:
import csv,codecs,cStringIO
class UTF8Recoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
'''next() -> unicode
This function reads and returns the next line as a Unicode string.
'''
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
'''writerow(unicode) -> None
This function takes a Unicode string and encodes it to the output.
'''
self.writer.writerow([s.encode("utf-8") for s in row])
data = self.queue.getvalue()
data = data.decode("utf-8")
data = self.encoder.encode(data)
self.stream.write(data)
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
with open('twitter_test_python.csv','rb') as spamreader:
reader = UnicodeReader(fin)
for line in reader:
#do stuff
print line
Alexey Smirnov's answer is elegant but maybe a bit complicated for a beginner. So let me give an example closer to the code in the question.
When you read in files with Python 2 you get the content as str, not unicode. Probably you want to convert it as soon as possible. However, the documentation of the csv module says "This version of the csv module doesn’t support Unicode input." So you should encode the output of csv.reader, not the input. Inserting it into your code results in:
import csv
with open('twitter_test_python.csv') as csvfile:
spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|'))
for row in spamreader:
row = " ".join(row)
row = unicode(row, encoding="utf-8")
row2 = row.split()
However, you might want to consider whether joining the cells just to split them again is really what you want. Without that the code would look like following. The result is different if the list elements contain spaces.
import csv
with open('twitter_test_python.csv') as csvfile:
spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|'))
for row in spamreader:
row2 = list(unicode(cell, encoding="utf-8") for cell in row)
If you want to write something back to a file you should convert the unicode first back to a str like unicode.encode("utf-8").

Does anyone know a simple function that converts existing csv files to UTF-8 encoding?

I have huge csv files and they contain '\xc3\x84' style characters instead of German umlauts, because I scrapped HTML using BeautifulSoup and wrote it in the csv files using Python 2.7.8.
I managed to replace all those characters with the help of this:
Python 2.7.1: How to Open, Edit and Close a CSV file
and now my code looks like this:
import csv
new_rows = []
umlaut = {'\\xc3\\x84': 'Ä', '\\xc3\\x96': 'Ö', '\\xc3\\x9c': 'Ü', '\\xc3\\xa4': 'ä', '\\xc3\\xb6': 'ö', '\\xc3\\xbc': 'ü'}
with open('file1.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
new_row = row
for key, value in umlaut.items():
new_row = [ x.replace(key, value) for x in new_row ]
new_rows.append(new_row)
with open('file2.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
When I open the csv I see Köln instead of Köln and other "German umlaut" problems.
I can solve this problem manually by opening the CSV file with notepad and then save it as UTF-8, but I want to do it automated with python.
I do not quite get how to use the UnicodeWriter:
https://docs.python.org/2/library/csv.html#examples
The answers and solutions I found here on stackoverflow are all a little bit complicated.
My question are, how would I use for example the UnicodeWriter right in my case?
Do you know any super easy function that does something like file2.encode('utf-8')?
If such an easy like function doesn' t exist in Python, then why doesn't it exists yet, because encoding errors are very common?
Instead of using your own mapping, you can use string-escape encoding:
>>> print '\\xc3\\x84'.decode('string-escape')
Ä
import csv
def iter_decode(it):
for line in it:
yield line.decode('string-escape')
with open('file1.csv') as csvFile, open('file2.csv', 'w') as f:
reader = csv.reader(iter_decode(csvFile))
writer = csv.writer(f)
for row in reader:
writer.writerow(row)
Given that you have a unicode writer from the docs :
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
use it like so:
from __future__ import unicode_lterals
import codecs
f = codecs.open("somefile.csv", mode='w', encoding='utf-8')
writer = UnicodeWriter(f)
for data in some_buffer:
writer.writerow(data)

Reformat CSV according to certain field using python

http://example.com/item/all-atv-quad.html,David,"Punjab",+123456789123
http://example.com/item/70cc-2014.html,Qubee,"Capital",+987654321987
http://example.com/item/quad-bike-zenith.html,Zenith,"UP",+123456789123
I have this test.csv where I have scraped a few items from certain site but the thing is "number" field has redundancy. So I somehow need to remove a row that has the same number as before. This is just the example file, In the real file some numbers are repeated more than 50+ times.
import csv
with open('test.csv', newline='') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
for column in csvreader:
"Some logic here"
if (column[3] == "+123456789123"):
print (column[0])
"or here"
I need reformated csv like this:
http://example.com/item/all-atv-quad.html,David,"Punjab",+123456789123
http://example.com/item/70cc-2014.html,Qubee,"Capital",+987654321987
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
def direct():
seen = set()
with open("test.csv") as infile, open("formatted.csv", 'w') as outfile:
for line in infile:
parts = line.rstrip().split(',')
number = parts[-1]
if number not in seen:
seen.add(number)
outfile.write(line)
def using_pandas():
"""Alternatively, use Pandas"""
df = pd.read_csv("test.csv", header=None)
df = df.drop_duplicates(subset=[3])
df.to_csv("formatted_pandas.csv", index=None, header=None)
def main():
direct()
using_pandas()
if __name__ == "__main__":
main()
This would filter out duplicates:
seen = set()
for line in csvreader:
if line[3] in seen:
continue
seen.add(line[3])
# write line to output file
And the csv read and write logic:
with open('test.csv') as fobj_in, open('test_clean.csv', 'w') as fobj_out:
csv_reader = csv.reader(fobj_in, delimiter=',')
csv_writer = csv.writer(fobj_out, delimiter=',')
seen = set()
for line in csvreader:
if line[3] in seen:
continue
seen.add(line[3])
csv_writer.writerow(line)

proper use of class (csv reader example)

I've done the following CSV reader class:
class CSVread(object):
filtered = []
def __init__(self, file):
self.file = file
def get_file(self):
try:
with open(self.file, "r") as f:
self.reader = [row for row in csv.reader(f, delimiter = ";")]
return self.reader
except IOError as err:
print("I/O error({0}): {1}".format(errno, strerror))
return
def get_num_rows(self):
print(sum(1 for row in self.reader))
Which can be used with the following example:
datacsv = CSVread("data.csv") # ; seperated file
for row in datacsv.get_file(): # prints all the rows
print(row)
datacsv.get_num_rows() # number of rows in data.csv
My goal is to filter out the content of the csv file (data.csv) by filtering column 12 by the keyword "00GG". I can get it to work outside the class like this:
with open("data.csv") as csvfile:
reader = csv.reader(csvfile, delimiter = ";")
filtered = []
filtered = filter((lambda row: row[12] in ("00GG")), list(reader))
Code below returns an empty list (filtered) when it's defined inside the class:
def filter_data(csv_file):
filtered = filter((lambda row: row[12] in ("00GGL")), self.reader)
return filtered
Feedback for the existing code is also appreciated.
Could it be that in the first filter example you are searching for 00GG whereas in the second one you are searching for 00GGL?
Regardless, if you want to define filter_data() within the class you should write is as a method of the class. That means that it takes a self parameter, not a csv_file:
def filter_data(self):
filtered = filter((lambda row: row[12] in ("00GGL")), self.reader)
return filtered
Making it more general:
def filter_data(self, column, values):
return filter((lambda row: row[column] in values), self.reader)
Now you can call it like this:
datacsv.filter_data(12, ('00GGL',))
which should work if the input data does indeed contain rows with 00GGL in column 12.
Note that filter_data() should only be called after get_file() otherwise there is no self.reader. Unless you have a good reason not to read in the data when the CSVread object is created (e.g. you are aiming for lazy evaluation), you should read it in then. Otherwise, set self.reader = [] which will prevent failure in other methods.

Categories