Read and Compare 2 CSV files on a row and column basis

Read and Compare 2 CSV files on a row and column basis - python

I have two CSV files. data.csv and data2.csv.
I would like to first of Strip the two data files down to the data I am interested in. I have figured this part out with data.csv. I would then like to compare by row making sure that if a row is missing to add it.
Next I want to look at column 2. If there is a value there then I want to write to column 3 if there is data in column 3 then write to 4, etc.
My current program looks like sow. Need some guidance
Oh and I am using Python V3.4
__author__ = 'krisarmstrong'
#!/usr/bin/python
import csv
searched = ['aircheck', 'linkrunner at', 'onetouch at']
def find_group(row):
"""Return the group index of a row
0 if the row contains searched[0]
1 if the row contains searched[1]
etc
-1 if not found
"""
for col in row:
col = col.lower()
for j, s in enumerate(searched):
if s in col:
return j
return -1
inFile = open('data.csv')
reader = csv.reader(inFile)
inFile2 = open('data2.csv')
reader2 = csv.reader(inFile2)
outFile = open('data3.csv', "w")
writer = csv.writer(outFile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
header = next(reader)
header2 = next(reader2)
"""Built a list of items to sort. If row 12 contains 'LinkRunner AT' (group 1),
one stores a triple (1, 12, row)
When the triples are sorted later, all rows in group 0 will come first, then
all rows in group 1, etc.
"""
stored = []
writer.writerow([header[0], header[3]])
for i, row in enumerate(reader):
g = find_group(row)
if g >= 0:
stored.append((g, i, row))
stored.sort()
for g, i, row in stored:
writer.writerow([row[0], row[3]])
inFile.close()
outFile.close()

Perhaps try:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
col1.append(row[0])
col2.append(row[1])
for i in xrange(len(col1))
if col1[i] == '':
#thing to do if there is nothing for col1
if col2[i] == '':
#thing to do if there is nothing for col2
This is a start at "making sure that if a row is missing to add it".

Related

csv skipping appending data skips rows

I have python code for appending data to the same csv, but when I append the data, it skips rows, and starts from row 15, instead from row 4
import csv
with open('csvtask.csv', 'r') as csv_file:
csv_reader = csv.DictReader(csv_file)
ls = []
for line in csv_reader:
if len(line['Values'])!= 0:
ls.append(int(line['Values']))
new_ls = ['','','']
for i in range(len(ls)-1):
new_ls.append(ls[i+1]-ls[i])
print(new_ls)
with open('csvtask.csv','a',newline='') as new_file:
csv_writer = csv.writer(new_file)
for i in new_ls:
csv_writer.writerow(('','','','',i))
new_file.close()
Here is the image

It's not really feasible to update a file at the same time you're reading it, so a common workaround it to create a new file. The following does that while preserving the fieldnames in the origin file. The new column will be named Diff.
Since there's no previous value to use to calculate a difference for the first row, the rows of the files are processed using the built-in enumerate() function which provides a value each time it's called which provides the index of the item in the sequence as well as the item itself as the object is iterated. You can use the index to know whether the current row is the first one or not and handle in a special way.
import csv
# Read csv file and calculate values of new column.
with open('csvtask.csv', 'r', newline='') as file:
reader = csv.DictReader(file)
fieldnames = reader.fieldnames # Save for later.
diffs = []
prev_value = 0
for i, row in enumerate(reader):
row['Values'] = int(row['Values']) if row['Values'] else 0
diff = row['Values'] - prev_value if i > 0 else ''
prev_value = row['Values']
diffs.append(diff)
# Read file again and write an updated file with the column added to it.
fieldnames.append('Diff') # Name of new field.
with open('csvtask.csv', 'r', newline='') as inp:
reader = csv.DictReader(inp)
with open('csvtask_updated.csv', 'w', newline='') as outp:
writer = csv.DictWriter(outp, fieldnames)
writer.writeheader()
for i, row in enumerate(reader):
row.update({'Diff': diffs[i]}) # Add new column.
writer.writerow(row)
print('Done')

You can use the DictWriter function like this:-
header = ["data", "values"]
writer = csv.DictWriter(file, fieldnames = header)
data = [[1, 2], [4, 6]]
writer.writerows(data)

Python read CSV file, and write to another skipping columns

I have CSV input file with 18 columns
I need to create new CSV file with all columns from input except column 4 and 5
My function now looks like
def modify_csv_report(input_csv, output_csv):
begin = 0
end = 3
with open(input_csv, "r") as file_in:
with open(output_csv, "w") as file_out:
writer = csv.writer(file_out)
for row in csv.reader(file_in):
writer.writerow(row[begin:end])
return output_csv
So it reads and writes columns number 0 - 3, but i don't know how skip column 4,5 and continue from there

You can add the other part of the row using slicing, like you did with the first part:
writer.writerow(row[:4] + row[6:])
Note that to include column 3, the stop index of the first slice should be 4. Specifying start index 0 is also usually not necessary.
A more general approach would employ a list comprehension and enumerate:
exclude = (4, 5)
writer.writerow([r for i, r in enumerate(row) if i not in exclude])

If your CSV has meaningful headers an alternative solution to slicing your rows by indices, is to use the DictReader and DictWriter classes.
#!/usr/bin/env python
from csv import DictReader, DictWriter
data = '''A,B,C
1,2,3
4,5,6
6,7,8'''
reader = DictReader(data.split('\n'))
# You'll need your fieldnames first in a list to ensure order
fieldnames = ['A', 'C']
# We'll also use a set for efficient lookup
fieldnames_set = set(fieldnames)
with open('outfile.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames)
writer.writeheader()
for row in reader:
# Use a dictionary comprehension to iterate over the key, value pairs
# discarding those pairs whose key is not in the set
filtered_row = dict(
(k, v) for k, v in row.iteritems() if k in fieldnames_set
)
writer.writerow(filtered_row)

This is what you want:
import csv
def remove_csv_columns(input_csv, output_csv, exclude_column_indices):
with open(input_csv) as file_in, open(output_csv, 'w') as file_out:
reader = csv.reader(file_in)
writer = csv.writer(file_out)
writer.writerows(
[col for idx, col in enumerate(row)
if idx not in exclude_column_indices]
for row in reader)
remove_csv_columns('in.csv', 'out.csv', (3, 4))

Selecting rows in csv file in with the variable number of columns

I have a csv file that i need to select certain rows. For me is easy remove the AGE and MEAN WEIGHT because these names are the same in any file.
ID,AGE,HEIGHT,MEAN WEIGHT,20-Nov-2002,05-Mar-2003,09-Apr-2003,23-Jul-2003
1,23,1.80,80,78,78,82,82
2,25,1.60,58,56,60,60,56
3,20,1.90,100,98,102,98,102
ID,HEIGHT,20-Nov-2002,05-Mar-2003,09-Apr-2003,23-Jul-2003
1,1.80,78,78,82,82
2,1.60,56,60,60,56
3,1.90,98,102,98,102
i have this code
import csv
out= open("C:/Users/Pedro/data.csv")
rdr= csv.reader(out)
result= open('C:/Users/Pedro/datanew.csv','w')
wtr= csv.writer ( result,delimiter=',',lineterminator='\n')
for row in rdr:
wtr.writerow( (row[0], row[2], row[4],row[5],row[6],row[7]) )
out.close()
result.close()
but my difficulty is select all columns that have dates. The number of columns of the dates may be variable. The solution could be to detect the character - in row[4]

I'm not 100 % sure what's you're asking, but here is a script that may do what you want, which is to reproduce the file with all of an unknown number of date columns, plus your columns 0 and 2 (ID & HEIGHT):
import csv
with open('data.csv') as infile: # Use 'with' to close files automatically
reader = csv.reader(infile)
headers = reader.next() # Read first line
# Figure out which columns have '-' in them (assume these are dates)
date_columns = [col for col, header in enumerate(headers) if '-' in header]
# Add our desired other columns
all_columns = [0, 2] + date_columns
with open('new.csv', 'w') as outfile:
writer = csv.writer(outfile, delimiter=',', lineterminator='\n')
# print headers
writer.writerow([headers[i] for i in all_columns])
# print data
for row in reader: # Read remaining data from our input CSV
writer.writerow([row[i] for i in all_columns])
Does that help?

stripping the zeros in csv with python

Hello I have a csv file and I need to remove the zero's with python:
Column 6, column 5 in python is defaulted to 7 digits. with this
AFI12001,01,C-,201405,P,0000430,2,0.02125000,US,60.0000
AFI12001,01,S-,201404,C,0001550,2,0.03500000,US,30.0000
I need to remove the zeros in front then I need to add a zero or zeros to make sure it has 4 digits total
so I would need it to look like this:
AFI12001,01,C-,201405,P,0430,2,0.02125000,US,60.0000
AFI12001,01,S-,201404,C,1550,2,0.03500000,US,30.0000
This code adds the zero's
import csv
new_rows = []
with open('csvpatpos.csv','r') as f:
csv_f = csv.reader(f)
for row in csv_f:
new_row = ""
col = 0
print row
for x in row:
col = col + 1
if col == 6:
if len(x) == 3:
x = "0" + x
new_row = new_row + x + ","
print new_row
However, I'm having trouble removing the zeros in front.

Convert the column to an int then back to a string in whatever format you want.
row[5] = "%04d" % int(row[5])

You could probably do this in several steps with .lstrip(), then finding the resulting string length, then adding on 4-len(s) 0s to the front. However, I think it's easier with regex.
with open('infilename', 'r') as infile:
reader = csv.reader(infile)
for row in reader:
stripped_value = re.sub(r'^0{3}', '', row[5])
Yields
0430
1550
In the regex, we are using the format sub(pattern, substitute, original). The pattern breakdown is:
'^' - match start of string
'0{3}' - match 3 zeros
You said all the strings in the 6th column have 7 digits, and you want 4, so replace the first 3 with an empty string.
Edit: If you want to replace the rows, I would just write it out to a new file:
with open('infilename', 'r') as infile, open('outfilename', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
row[5] = re.sub(r'^0{3}', '', row[5])
writer.writerow(row)
Edit2: In light of your newest requests, I would recommend doing the following:
with open('infilename', 'r') as infile, open('outfilename', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
# strip all 0's from the front
stripped_value = re.sub(r'^0+', '', row[5])
# pad zeros on the left to smaller numbers to make them 4 digits
row[5] = '%04d'%int(stripped_value)
writer.writerow(row)
Given the following numbers,
['0000430', '0001550', '0013300', '0012900', '0100000', '0001000']
this yields
['0430', '1550', '13300', '12900', '100000', '1000']

You can use lstrip() and zfill() methods. Like this:
with open('input') as in_file:
csv_reader = csv.reader(in_file)
for row in csv_reader:
stripped_data = row[5].lstrip('0')
new_data = stripped_data.zfill(4)
print new_data
This prints:
0430
1550
The line:
stripped_data = row[5].lstrip('0')
gets rid of all the zeros on the left. And the line:
new_data = stripped_data.zfill(4)
fills the front with zeros such that the total number of digits are 4.
Hope this helps.

You can keep last 4 chars
columns[5] = columns[5][-4:]
example
data = '''AFI12001,01,C-,201405,P,0000430,2,0.02125000,US,60.0000
AFI12001,01,S-,201404,C,0001550,2,0.03500000,US,30.0000'''
for row in data.splitlines():
columns = row.split(',')
columns[5] = columns[5][-4:]
print ','.join(columns)
result
AFI12001,01,C-,201405,P,0430,2,0.02125000,US,60.0000
AFI12001,01,S-,201404,C,1550,2,0.03500000,US,30.0000
EDIT:
code with csv module - not data to simulate file.
import csv
with open('csvpatpos.csv','r') as f:
csv_f = csv.reader(f)
for row in csv_f:
row[5] = row[5][-4:]
print row[5] # print one element
#print ','.join(row) # print full row
print row # print full row

Replace element in column with previous one in CSV file using python

3rd UPDATE: To describe the problem in precise:-
================================================
First post, so not able to format it well. Sorry for this.
I have a CSV file called sample.CSV. I need to add additional columns to this file, I could do it using below script. What is missing in my script
If present value in column named "row" is different from previous element. Then update the column named "value" with the previous row column value. If not, update it as zero in the "value" column.
Hope my question is clear. Thanks a lot for your support.
My script:
#!/usr/local/bin/python3 <bl
import csv, os, sys, time
inputfile='sample.csv'
with open(inputfile, 'r') as input, open('input.csv', 'w') as output:
reader = csv.reader(input, delimiter = ';')
writer = csv.writer(output, delimiter = ';')
list1 = []
header = next(reader)
header.insert(1,'value')
header.insert(2,'Id')
list1.append(header)
count = 0
for column in reader:
count += 1
list1.append(column)
myvalue = []
myvalue.append(column[4])
if count == 1:
firstmyvalue = myvalue
if count > 2 and myvalue != firstmyvalue:
column.insert(0, myvalue[0])
else:
column.insert(0, 0)
if column[0] != column[8]:
del column[0]
column.insert(0,0)
else:
del column[0]
column.insert(0,myvalue[0])
column.insert(1, count)
column.insert(0, 1)
writer.writerows(list1)
sample.csv:-
rate;sec;core;Ser;row;AC;PCI;RP;ne;net
244000;262399;7;5;323;29110;163;-90.38;2;244
244001;262527;6;5;323;29110;163;-89.19;2;244
244002;262531;6;5;323;29110;163;-90.69;2;244
244003;262571;6;5;325;29110;163;-88.75;2;244
244004;262665;7;5;320;29110;163;-90.31;2;244
244005;262686;7;5;326;29110;163;-91.69;2;244
244006;262718;7;5;323;29110;163;-89.5;2;244
244007;262753;7;5;324;29110;163;-90.25;2;244
244008;277482;5;5;325;29110;203;-87.13;2;244
My expected output:-
rate;value;Id;sec;core;Ser;row;AC;PCI;RP;ne;net
1;0;1;244000;262399;7;5;323;29110;163;-90.38;2;244
1;0;2;244001;262527;6;5;323;29110;163;-89.19;2;244
1;0;3;244002;262531;6;5;323;29110;163;-90.69;2;244
1;323;4;244003;262571;6;5;325;29110;163;-88.75;2;244
1;325;5;244004;262665;7;5;320;29110;163;-90.31;2;244
1;320;6;244005;262686;7;5;326;29110;163;-91.69;2;244
1;326;7;244006;262718;7;5;323;29110;163;-89.5;2;244
1;323;8;244007;262753;7;5;324;29110;163;-90.25;2;244
1;324;9;244008;277482;5;5;325;29110;203;-87.13;2;244

This will do the part you were asking for in a generic way, however your output clearly has more changes to it than the question asks for. I added in the Id column just to show how you can order the column output too:
df = pd.read_csv('sample.csv', sep=";")
df.loc[:,'value'] = None
df.loc[:, 'Id'] = df.index + 1
prev = None
for i, row in df.iterrows():
if prev is not None:
if row.row == prev.row:
df.value[i] = prev.value
else:
df.value[i] = prev.row
prev = row
df.to_csv('output.csv', index=False, cols=['rate','value','Id','sec','core','Ser','row','AC','PCI','RP','ne','net'], sep=';')

previous = []
for i, entry in enumerate(csv.reader(test.csv)):
if not i: # do this on first entry only
previous = entry # initialize here
print(entry)
else: # other entries
if entry[2] != previous[2]: # check if this entries row is equal to previous entries row
entry[1] = previous[2] # add previous entries row value to this entries var
previous = entry
print(entry)

import csv
with open('test.csv') as f, open('output.csv','w') as o:
out = csv.writer(o, delimiter='\t')
out.writerow(["id", 'value', 'row'])
reader = csv.DictReader(f, delimiter="\t") #Assuming file is tab delimited
prev_row = '100'
for line in reader:
if prev_row != line["row"]:
prev_row = line["row"]
out.writerow([line["id"],prev_row,line["row"]])
else:
out.writerow(line.values())
o.close()
content of output.csv:
id value row
1 0 100
2 0 100
3 110 110
4 140 140

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read and Compare 2 CSV files on a row and column basis - python

Related

csv skipping appending data skips rows

Python read CSV file, and write to another skipping columns

Selecting rows in csv file in with the variable number of columns

stripping the zeros in csv with python

Replace element in column with previous one in CSV file using python

Categories

Resources