How to clean large malformed CSV file using Python

How to clean large malformed CSV file using Python - python

I'm attempting to use Python 2.7.5 to clean up a malformed CSV file. The CSV file is fairly large (over 1GB). The first row of the file correctly lists the column headings, but after that each field is on a new line (unless it is blank) and some fields are multi-line. The multi-line fields are not surrounded by quotes, but need to be surrounded by quotes in the output. The number of columns is static and known. The pattern in the sample input provided is repeated throughout the length of the file.
Input file (sample):
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname
,my_username
,10.0.0.1
192.168.1.1
,2015-02-11 13:41:54 -0600
,,true
,false
my_2nd_hostname
,my_2nd_username
,10.0.0.2
192.168.1.2
,2015-02-11 14:04:41 -0600
,true
,,false
Desired output:
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname,my_username,"10.0.0.1 192.168.1.1",2015-02-11 13:41:54 -0600,,true,false
my_2nd_hostname,my_2nd_username,"10.0.0.2 192.168.1.2",2015-02-11 14:04:41 -0600,true,,false
I've gone down a couple paths that address one of the issues only to realize that it doesn't handle another aspect of the malformed data. I would appreciate if anyone could please help me identify an efficient way to clean up this file.
Thanks
EDIT
I have several code scraps from going down different paths, but here is the current iteration. It isn't pretty, just a bunch of hacks to try and figure this out.
import csv
inputfile = open('input.csv', 'r')
outputfile_1 = open('output.csv', 'w')
counter = 1
for line in inputfile:
#Skip header row
if counter == 1:
outputfile_1.write(line)
counter = counter + 1
else:
line = line.replace('\r', '').replace('\n', '')
outputfile_1.write(line)
inputfile.close()
outputfile_1.close()
with open('output.csv', 'r') as f:
text = f.read()
comma_count = text.count(',') #comma_count/6 = total number of rows
#need to insert a newline after the field contents after every 6th comma
#unfortunately the last field of the row and the first field of the next row are now rammed up together becaue of the newline replaces above...
#then process as normal CSV
#one path I started to go down... but this isn't even functional
groups = text.split(',')
counter2 = 1
while (counter2 <= comma_count/6):
line = ','.join(groups[:(6*counter2)]), ','.join(groups[(6*counter2):])
print line
counter2 = counter2 + 1
EDIT 2
Thanks to #DSM and #Ryan Vincent for getting me on the right track. Using their ideas I made the following code, which seems to correct my malformed CSV. I'm sure there are many places for improvement though, which I would happily accept.
import csv
import re
outputfile_1 = open('output.csv', 'wb')
wr = csv.writer(outputfile_1, quoting=csv.QUOTE_ALL)
with open('input.csv', 'r') as f:
text = f.read()
comma_indices = [m.start() for m in re.finditer(',', text)] #Find all the commas - the fields are between them
cursor = 0
field_counter = 1
row_count = 0
csv_row = []
for index in comma_indices:
newrowflag = False
if "\r" in text[cursor:index]:
#This chunk has two fields, the last of one row and first of the next
next_field=text[cursor:index].split('\r')
next_field_trimmed = next_field[0].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed]) #Add the last field of this row
#Reset the cursor to be in the middle of the chuck (after the last field and before the next)
#And set a flag that we need to start the next csvrow before we move on to the next comma index
cursor = cursor+text[cursor:index].index('\r')+1
newrowflag = True
else:
next_field_trimmed = text[cursor:index].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
#Advance the cursor to the character after the comma to start the next field
cursor = index + 1
#If we've done 7 fields then we've finished the row
if field_counter%7==0:
row_count = row_count + 1
wr.writerow(csv_row)
#Reset
csv_row = []
#If the last chunk had 2 fields in it...
if newrowflag:
next_field_trimmed = next_field[1].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
field_counter = field_counter + 1
field_counter = field_counter + 1
#Write the last row
wr.writerow(csv_row)
outputfile_1.close()
# Process output.csv as normal CSV file...

This is a comment about how i would tackle this.
For each line:
I can easily identify start and of end of certain groups:
Hostname - there is only one
usernames - read these until you meet something that does not look like a username (comma delimited)
ip address - read these until you meet a timestamp - identified with a pattern match - be aware these are separated by space rather than comma. The end of group is identified by the trailing comma.
timestamp - easy to identify with a pattern match
test1, test2, test3 - certain to be there as comma delimted fields
Notes: I would use the 'pattern' matches to enable me to identify i have the correct thing in the correct place. It enables spotting errors sooner.

From your data excerpt it seems like any line that starts with a comma needs to be joined to the preceding line and any line starting with anything other than a comma marks a new row.
If that's the case than you could use something the following code to clean up the CSV file such that the standard library csv parser can handle it.
#!/usr/bin/python
raw_data = 'somefilename.raw'
csv_data = 'somefilename.csv'
with open(raw_data, 'Ur') as inp, open(csv_data, 'wb') as out:
row = list()
for line in inp:
line.rstrip('\n')
if line.startswith(','):
row.append(line)
else:
out.write(''.join(row)+'\n')
row = list()
row.append(line))
# Don't forget to write the last row!
out.write(''.join(row)+'\n')
This is a miniature state machine ... accumulating lines into each row until we find a line that doesn't start with a comma, writing the previous row and so on.

Related

CSV file is skipping to next line in the middle of a row

I am having a weird issue that I just can't figure out, I am attempting to create a simple CSV file in a certain format. It mostly works, but for some reason, every row gets sent to the next line about half-way thru. Also, there are quotes around the source entries that shouldn't be there.
Example of CSV:
target,source,type,source_followers,source_following
central_insurance_companies,"carolpaul60
",directed,1382,1265
central_insurance_companies,"haif_250
",directed,1382,1265
central_insurance_companies,"hmadvisors
",directed,1382,1265
central_insurance_companies,"speakmanweaver
",directed,1382,1265
I am using the following main code:
targetFollowers = getFollowerCount() #int
targetFollowing = getFollowingCount() #int
setupCSV()
updateCSV(accountName, targetFollowers, targetFollowing)
with the following functions:
def setupCSV():
csvFile = open('edges.csv','w',encoding="utf-8")
csvWriter = csv.writer(csvFile)
#create header
csvWriter.writerow(['target','source','type','source_followers','source_following'])
csvFile.close()
def updateCSV(accountName, targetFollowers, targetFollowing):
#open followers
followerFile = open('followerFile.txt','r', encoding="utf-8")
Lines = followerFile.readlines()
# update rows of csv
lineCount = 1
while True:
sourceName = linecache.getline(r'followerFile.txt', lineCount)
rowUpdate(accountName, sourceName, targetFollowers, targetFollowing)
lineCount += 1
# stop loop
if not sourceName:
followerFile.close()
break
#close file
followerFile.close()
def rowUpdate(accountName, sourceName, targetFollowers, targetFollowing):
# row data
rowData = [accountName, sourceName, 'directed', targetFollowers, targetFollowing]
# write data
with open('edges.csv','a', encoding="utf-8") as csvFile:
csvWriter = csv.writer(csvFile)
csvWriter.writerow(rowData)
csvFile.close()
Thank you!

Both of your issues are actually the same issue. The reason why your rows are being pushed to the next line is because your source column values all end in a newline character which when written to a file results in a literal new line. This is also the reason for the quotes. The CSV writer detects the newline character so it automatically applies the quotes to the value.
The root cause of the issue looks like it is coming from linecache.getline(...).
According to the python docs for that function
the terminating newline character will be included for lines that are found
So in order to avoid passing that newline character into your csv file, probably the simplest way would be to call sourceName.strip() before adding the value to your csv file or simply adding .strip() to the end of the getline() function.
For example:
sourceName = linecache.getline(r'followerFile.txt', lineCount).strip() # <- added .strip() here
rowUpdate(accountName, sourceName, targetFollowers, targetFollowing)
lineCount += 1

split the data into separate file after encountering a column name

eno,ename,
101,'sam',
102,'bill',
eno,ename,
103,'jack',
eno,ename,
104,'pam',
I have a huge .csv file in which column names reappear after certain number of rows. is there a way in python to split such data into multiple files as soon as it encounter the "repeated column names"?
I would like the above data to be in 3 separate .csv files since the same column names appear 3 times.

Challenging! Here's my solution. There is likely a more straightforward way to do this though.
with open("./file.csv", "r") as readfile:
file_number = 0
current_line_no = 0
tmpline = None
for line in readfile:
# count which file you're on. Also use write mode "W" if the first line. Else append.
with open(f"./writefile{file_number}.csv", ("w" if current_line_no == 0 else "a")) as writefile:
# check if the "headers" are appearing and if the current file has more than 1 line.
# Not sure if the header check is the best for your use case. Maybe regex is best here.
if current_line_no != 0 and ("eno" in line and "ename" in line):
file_number += 1 # increment to next file
current_line_no = 0 # reset file number
tmpline = line # remember the "current line". This needs to be added to next file.
continue # continue to next line in readfile
# if there is a templine from previous, add it to this as header.
if tmpline is not None:
writefile.write(tmpline)
tmpline = None
# write the line and increment to new line
writefile.write(line)
current_line_no += 1
I've tried to comment as best as possible. The code basically opens the files one by one as it loops through the lines of the readfile. When it reads the contents it checks if the current line is a "header". Here I simply checked if "eno" and "ename" are in the line, but there is probably a better approach for your use case. If the current line is a header, then you need to close the current file and open a new one. Hopefully this helps!

I know you asked for Python, but there are some questions that just cry out for the power of AWK :)
awk '/eno,ename/{x="F"++i ".csv";}{print > x;}' input.csv

One way of doing it is to save the headers to a variable, and then when reading the file check if the current row matches the header. If it does, increment a counter that can be used to determine which file to write to.
import csv
HEADERS = next(csv.reader(open('data.csv')))
print(HEADERS)
with open('data.csv') as f:
reader = csv.reader(f)
file_name_counter = 0
for row in reader:
if row == HEADERS:
file_name_counter += 1
with open(f'data{file_name_counter}.csv', ('w' if row == HEADERS else "a"), newline="") as f:
writer = csv.writer(f)
writer.writerow(row)
NOTE: I believe the newline="" argument is necessary on Windows, as otherwise csv.writer() will add an extra new line between each entry.

Remove linebreak in csv

I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don't know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader:
test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines:
if("Foobar" in line):
line = line.replace("Foobar", "")
if(";\n" in line):
line = line.replace(";\n", ";")
The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list
Example:
Col_a;Col_b;Col_c;Col_d
2021;Foobar;Bla
;Blub
Blub belongs in the row above.

Here's a simple Python script to merge lines until you have the desired number of fields.
import sys
sep = ';'
fields = 4
collected = []
for line in sys.stdin:
new = line.rstrip('\n').split(sep)
if collected:
collected[-1] += new[0]
collected.extend(new[1:])
else:
collected = new
if len(collected) < fields:
continue
print(';'.join(collected))
collected = []
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.

This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.
Parameters of the function are:
message - content of the file - reader.read() in your case
columns - number of expected columns
filename - filename (I use it for logging)
def pre_parse(message, columns, filename):
parsed_message=[]
i =0
temp_line =''
for line in message.splitlines():
#print(line)
split = line.split(',')
if len(split) == columns:
parsed_message.append(line)
elif len(split) > columns:
print(f'Line {i} has been truncated in file {filename} - too much columns'))
split = split[:columns]
line = ','.join(split)
parsed_message.append(line)
elif len(split) < columns and temp_line =='':
temp_line = line.replace('\n','')
print(temp_line)
elif temp_line !='':
line = temp_line+line
if line.count(',') == columns-1:
print((f'Line {i} has been fixed in file {filename} - extra line feed'))
parsed_message.append(line)
temp_line =''
else:
temp_line=line.replace('\n', '')
i+=1
return parsed_message
make sure you use proper split character and proper line feed characer.

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!

You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib

I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.

Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.

This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

writing lines group by group in different files

I've got a little script which is not working nicely for me, hope you can help and find the problem.
I have two starting files:
traveltimes: contains the lines I need, it's a column file (every row has just a number). The lines I need are separated by a line which starts with 11 whitespaces
header lines: contains three header lines
output_file: I want to get 29 files (STA%s). What's inside? Every file will contain the same header lines after which I want to append the group of lines contained in the traveltimes file (one different group of lines for every file). Every group of lines is made by 74307 rows (1 column)
So far this script creates 29 files with the same header lines but then it mixes up everything, I mean it writes something but it's not what I want.
Any idea????
def make_station_files(traveltimes, header_lines):
"""Gives the STAxx.tgrid files required by loc3d"""
sta_counter = 1
with open (header_lines, 'r') as file_in:
data = file_in.readlines()
for i in range (29):
with open ('STA%s' % (sta_counter), 'w') as output_files:
sta_counter += 1
for i in data [0:3]:
values = i.strip()
output_files.write ("%s\n\t1\n" % (values))
with open (traveltimes, 'r') as times_file:
#collector = []
for line in times_file:
if line.startswith (" "):
break
output_files.write ("%s" % (line))

Suggestion:
Read the header rows first. Make sure this works before proceeding. None of the rest of the code needs to be indented under this.
Consider writing a separate function to group the traveltimes file into a list of lists.
Once you have a working traveltimes reader and grouper, only then create a new STA file, print the headers to it, and then write the timegroups to it.
Build your program up step-by-step, making sure it does what you expect at each step. Don't try to do it all at once because then you won't easily be able to track down where the issue lies.
My quick edit of your script uses itertools.groupby() as a grouper. It is a little advanced because the grouping function is stateful and tracks it state in a mutable list:
def make_station_files(traveltimes, header_lines):
'Gives the STAxx.tgrid files required by loc3d'
with open (header_lines, 'r') as f:
headers = f.readlines()
def station_counter(line, cnt=[1]):
'Stateful station counter -- Keeps the count in a mutable list'
if line.strip() == '':
cnt[0] += 1
return cnt[0]
with open(traveltimes, 'r') as times_file:
for station, group in groupby(times_file, station_counter):
with open('STA%s' % (station), 'w') as output_file:
for header in headers[:3]:
output_file.write ('%s\n\t1\n' % (header.strip()))
for line in group:
if not line.startswith(' '):
output_file.write ('%s' % (line))
This code is untested because I don't have sample data. Hopefully, you'll get the gist of it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to clean large malformed CSV file using Python - python

Related

CSV file is skipping to next line in the middle of a row

split the data into separate file after encountering a column name

Remove linebreak in csv

Reading a numbers off a list from a txt file, but only upto a comma

writing lines group by group in different files

Categories

Resources