Python CSV Reader - python

I have a CSV from a system that has a load of rubbish at the top of the file, so the header row is about row 5 or could even be 14 depending on the gibberish the report puts out.
I used to use:
idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)
to go through the rows that had less than 2 columns, then when it hit the col headers, of which there are 12, it would stop, and then I could use idx with skiprows when reading the CSV file.
The system has had an update and someone thought it would be good to have the CSV file valid by adding in 11 blank commas after their gibberish to align the header count.
so now I have a CSV like:
sadjfhasdkljfhasd,,,,,,,,,,
dsfasdgasfg,,,,,,,,,,
time,date,code,product
etc..
I tried:
idx = next(idx for idx, row in enumerate(csvreader) if row in (None, "") > 2)
but I think that's a Pandas thing and it just fails.
Any ideas on how i can get to my header row?
CODE:
lmf = askopenfilename(filetypes=(("CSV Files",".csv"),("All Files","*.*")))
# Section gets row number where headers start
with open(lmf, 'r') as fin:
csvreader = csv.reader(fin)
print(csvreader)
input('hold')
idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)
# Reopens file parsing the number for the row headers
lmkcsv = pd.read_csv(lmf, skiprows=idx)
lm = lm.append(lmkcsv)
print(lm)

Since your csv is now a valid file and you just want to filter out the header rows without a certain amount of columns, you can just do that in pandas directly.
import pandas as pd
minimum_cols_required = 3
lmkcsv = pd.read_csv()
lmkcsv = lmkcsv.dropna(thresh=minimum_cols_required, inplace=True)
If your csv data have a lot of empty values as well that gets caught in this threshold, then just slightly modify your code:
idx = next(idx for idx, row in enumerate(csvreader) if len(set(row)) > 3)
I'm not sure in what case a None would return, so the set(row) should do. If your headers for whatever are duplicates as well, do this:
from collections import Counter
# ...
idx = next(idx for idx, row in enumerate(csvreader) if len(row) - Counter(row)[''] > 2)

And how about erasing the starting lines, doing some logic, like checking many ',' exist's or some word. Something like:
f = open("target.txt","r+")
d = f.readlines()
f.seek(0)
for i in d:
if "sadjfhasdkljfhasd" not in i:
f.write(i)
f.truncate()
f.close()
after that, read normaly the file.

Related

csv file parsing and making it dict

i have a .csv file trying to make it in a dict. I tried pandas and csv.DictReader mostly but until now i can print the data (not in the way i want) with the DictReader.
So the main problem is that the file is like
header;data (1 column)
for about 50 rows and after that it changes the schema like
header1;header2;header3;header4
in row 50 and row 50+
data1;data2;data3;data4 etc..
with open(filename, 'r', encoding='utf-16') as f:
for line in csv.DictReader(f):
print(line)
thats the code i have for now.
Thanks for your help.
You can't use DictReader for this, because it requires all the rows to have the same fields.
Use csv.reader and check the length of the row that it returns. When the length changes, treat that as a new header.
Hopefully you don't have adjacent sections of the file that have the same number of fields but different headers. It will be difficult for the script to detect when the section changes.
data = []
with open(filename, 'r', encoding='utf-16') as f:
r = csv.reader(f, delimiter=';')
# process first 52 rows in format header;data
for _ in range(52):
row = next(r)
data.append({row[0]: row[1]})
# rest of file is a header row followed by variable number of data rows
header = next(r)
for row in r:
if len(row) != len(header): # new header
header = row
continue
d = dict(zip(header, row))
data.append(d)

Get the number of empty skipped rows in pandas when parsing

I noticed pandas is smart when using read_excel / read_csv, it skips the empty rows so if my input has a blank row like
Col1, Col2
Value1, Value2
It just works, but is there a way to get the actual # of skipped rows? (In this case 1)
I want to tie the dataframe row numbers back to the raw input file's row numbers.
You could use the skip_blank_lines=False and import the entire file including the empty lines. Then you can detect them, count them and filter them out:
def custom_read(f_name, **kwargs):
df = pd.read_csv(f_name, skip_blank_lines=False, **kwargs)
non_empty = df.notnull().all(axis=1)
print('Skipped {} blank lines'.format(sum(~non_empty)))
return df.loc[non_empty, :]
You can also use csv.reader to import your file row-by-row and only allow non-empty rows:
import csv
def custom_read2(f_name):
with open(f_name) as f:
cont = []
empty_counts = 0
reader = csv.reader(f, delimiter=',')
for row in reader:
if len(row) > 0:
cont.append(row)
else:
empty_counts += 1
print('Skipped {} blank lines'.format(empty_counts))
return pd.DataFrame(cont)
As far as I can tell, at most one blank line at a time will occupy your memory. This may be useful if you happened to have large files with many blank lines, but I am pretty sure option 1 will always be the better option in practice

Replace value of specific column in all non header rows

Below is some python code that runs on a file similar to this (old_file.csv).
A,B,C,D
1,2,XX,3
11,22,XX,33
111,222,XX,333
How can I iterate through all lines in the old_file.csv (if I don't know the length of the file) and replace all values in column C or index 2 or cells[row][2] (based on cells[row][col]). But I'd like to ignore the header row. In the new_file.csv, all values containing 'XX' will become 'YY' for example.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
cells[1][2] = 'YY'
cells[2][2] = 'YY'
cells[3][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
Just small change in #Soviut ans, try this I think this will help you
import csv
rows = csv.reader(open('old_file.csv'))
newRows=[]
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
newRows.append(row)
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(newRows)
You can very easily loop over the array of rows and replace values in the target cell.
# get rows from old CSV file
rows = csv.reader(open('old_file.csv'))
# iterate over each row and replace target cell
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(rows)
csv reader makes arrays, so you could just run it on r[1:]
len(cells) is the number of rows. Iterating from 1 makes it skip the header line. Also the lines should be cells.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
for i in range(1, len(cells)):
cells[i][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
read_handle = open('old_file.csv', 'r')
data = read_handle.read().split('\n')
read_handle.close()
new_data = []
new_data.append(data[0])
for line in data[1:]:
if not line:
new_data.append(line)
continue
line = line.split(',')
line[2] = 'YY'
new_data.append(','.join(line))
write_handle = open('new_file.csv', 'w')
write_handle.writelines('\n'.join(new_data))
write_handle.close()

How to find specific row in Python CSV module

I need to find the third row from column 4 to the end of the a CSV file. How would I do that? I know I can find the values from the 4th column on with
row[3]
but how do I get specifically the third row?
You could convert the csv reader object into a list of lists... The rows are stored in a list, which contains lists of the columns.
So:
csvr = csv.reader(file)
csvr = list(csvr)
csvr[2] # The 3rd row
csvr[2][3] # The 4th column on the 3rd row.
csvr[-4][-3]# The 3rd column from the right on the 4th row from the end
You could keep a counter for counting the number of rows:
counter = 1
for row in reader:
if counter == 3:
print('Interested in third row')
counter += 1
You could use itertools.islice to extract the row of data you wanted, then index into it.
Note that the rows and columns are numbered from zero, not one.
import csv
from itertools import islice
def get_row_col(csv_filename, row, col):
with open(csv_filename, 'rb') as f:
return next(islice(csv.reader(f), row, row+1))[col]
This one is a very basic code that will do the job and you can easily make a function out of it.
import csv
target_row = 3
target_col = 4
with open('yourfile.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
n = 0
for row in reader:
if row == target_row:
data = row.split()[target_col]
break
print data

Read and Compare 2 CSV files on a row and column basis

I have two CSV files. data.csv and data2.csv.
I would like to first of Strip the two data files down to the data I am interested in. I have figured this part out with data.csv. I would then like to compare by row making sure that if a row is missing to add it.
Next I want to look at column 2. If there is a value there then I want to write to column 3 if there is data in column 3 then write to 4, etc.
My current program looks like sow. Need some guidance
Oh and I am using Python V3.4
__author__ = 'krisarmstrong'
#!/usr/bin/python
import csv
searched = ['aircheck', 'linkrunner at', 'onetouch at']
def find_group(row):
"""Return the group index of a row
0 if the row contains searched[0]
1 if the row contains searched[1]
etc
-1 if not found
"""
for col in row:
col = col.lower()
for j, s in enumerate(searched):
if s in col:
return j
return -1
inFile = open('data.csv')
reader = csv.reader(inFile)
inFile2 = open('data2.csv')
reader2 = csv.reader(inFile2)
outFile = open('data3.csv', "w")
writer = csv.writer(outFile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
header = next(reader)
header2 = next(reader2)
"""Built a list of items to sort. If row 12 contains 'LinkRunner AT' (group 1),
one stores a triple (1, 12, row)
When the triples are sorted later, all rows in group 0 will come first, then
all rows in group 1, etc.
"""
stored = []
writer.writerow([header[0], header[3]])
for i, row in enumerate(reader):
g = find_group(row)
if g >= 0:
stored.append((g, i, row))
stored.sort()
for g, i, row in stored:
writer.writerow([row[0], row[3]])
inFile.close()
outFile.close()
Perhaps try:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
col1.append(row[0])
col2.append(row[1])
for i in xrange(len(col1))
if col1[i] == '':
#thing to do if there is nothing for col1
if col2[i] == '':
#thing to do if there is nothing for col2
This is a start at "making sure that if a row is missing to add it".

Categories