remove extra rows in a file in python

remove extra rows in a file in python - python

I have a text file with 8 columns. The first one is ID and the 8th one is type. In the first column there are many repetitive rows per ID but in the 8th column there many types per ID and one type is H and there is only one H per ID.
ID type
E0 B
E0 H
E0 S
B4 B
B4 H
I want to make another file in which there is only one row per ID (only the row which has H in the 8th column). This example would be like this:
ID type
E0 H
B4 H

Just updated solution of inspectorG4dget for Python 2.7.3:
Only consider two columns in input csv file which are ID and type separated by \t
Code:
import csv
with open('/home/vivek/Desktop/input.csv', 'rb') as infile, open('/home/vivek/Desktop/output.csv', 'wb') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
reader_row = next(reader)
writer.writerow([reader_row[0], reader_row[1]])
for row in reader:
if row[1]=="H":
writer.writerow(row)
Output:
ID type
E0 H
B4 H
Check following for 2.6.6 I have not tested following code for python 2.6.6 because I have python 2.7.3 on my machine.
with open('/home/vivek/Desktop/input.csv', 'rb') as infile:
with open('/home/vivek/Desktop/output.csv', 'wb') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
reader_row = next(reader)
writer.writerow([reader_row[0], reader_row[1]])
for row in reader:
if row[1]=="H":
writer.writerow(row)

Assuming your file is simply a text file with spaces/tabs delimiting columns, and the column containing 'type' is right at the end of the row:
with open('input.txt', 'r') as input_file:
input_lines = input_file.readlines()
# Take the header line, and all the subsequent lines whose last character is 'H'
output_lines = input_lines[:1] + [line for line in input_lines if line[-2] == 'H']
output_string = ''.join(output_lines)
with open('output.txt', 'w') as output_file:
output_file.write(output_string)
The above code assumes that the 'type' column ends immediately after the single-character type code. If there can be whitespace after the data, or if you can have multi-character type codes that might look like 'AH' etc, then substitute the row beneath the comment with the below:
output_lines = input_lines[:1] + [line for line in input_lines if line.split()[-1] == 'H']
Edit: If your file turns out to be huge and you don't want to load it all into memory and manipulate, you can use a generator expression, which is lazily evaluated:
with open('input.txt', 'r') as input_file:
output_lines = (line for i, line in enumerate(input_lines)
if line[-2] == 'H' or i == 0)
with open('output.txt', 'w') as output_file:
for line in output_lines:
output_file.write(line)

Related

Python find matching string in each line

I would like to read each row of the csv file and match each word in the row with a list of strings. If any of the strings appears in the row, then write that string at the end of the line separated by comma.
The code below doesn't give me what I want.
file = 'test.csv'
read_files = open(file)
lines=read_files.read()
text_lines = lines.split("\n")
name=''
with open('testnew2.csv','a') as f:
for line in text_lines:
line=str(line)
#words = line.split()
with open('names.csv', 'r') as fd:
reader = csv.reader(fd, delimiter=',')
for row in reader:
if row[0] in line:
name=row
print(name)
f.write(line+","+name[0]+'\n')
A sample of test.csv would look like this:
A,B,C,D
ABCD,,,
Total,Robert,,
Name,Annie,,
Total,Robert,,
And the names.csv would look:
Robert
Annie
Amanda
The output I want is:
A,B,C,D,
ABCD,,,,
Total,Robert,,,Robert
Name,Annie,,,Annie
Total,Robert,,,Robert
Currently the code will get rid of lines that don't result in a match, so I got:
Total,Robert,,,Robert
Name,Annie,,,Annie
Total,Robert,,,Robert

Process each line by testing row[1] and appending the 5th column, then writing it. The name list isn't really a csv. If it's really long use a set for lookup. Read it only once for efficiency as well.
import csv
with open('names.txt') as f:
names = set(f.read().strip().splitlines())
# newline='' per Python 3 csv documentation...
with open('input.csv',newline='') as inf:
with open('output.csv','w',newline='') as outf:
r = csv.reader(inf)
w = csv.writer(outf)
for row in r:
row.append(row[1] if row[1] in names else '')
w.writerow(row)
Output:
A,B,C,D,
ABCD,,,,
Total,Robert,,,Robert
Name,Annie,,,Annie
Total,Robert,,,Robert

I think the problem is you're only writing when the name is in the row. To fix that move the writing call outside of the if conditional:
if row[0] in line:
name=row
print(name)
f.write(line+","+name[0]+'\n')
I'm guessing that print statement is for testing purposes?
EDIT: On second thought, you may need to move name='' inside the loop as well so it is reset after each row is written, that way you don't get names from matched rows bleeding into unmatched rows.
EDIT: Decided to show an implementation that should avoid the (possible) problem of two matched names in a row:
EDIT: Changed name=row and the call of name[0] in f.write() to name=row[0] and a call of name in f.write()
file = 'test.csv'
read_files = open(file)
lines=read_files.read()
text_lines = lines.split("\n")
with open('testnew2.csv','a') as f:
for line in text_lines:
name=''
line=str(line)
#words = line.split()
with open('names.csv', 'r') as fd:
reader = csv.reader(fd, delimiter=',')
match=False
while match == False:
for row in reader:
if row[0] in line:
name=row[0]
print(name)
match=True
f.write(line+","+name+'\n')

Try this as well:
import csv
myFile = open('testnew2.csv', 'wb+')
writer = csv.writer(myFile)
reader2 = open('names.csv').readlines()
with open('test.csv') as File1:
reader1 = csv.reader(File1)
for row in reader1:
name = ""
for record in reader2:
record = record.replace("\n","")
if record in row:
row.append(record)
writer.writerow(row)
break

Compare two CSV files and output only rows with the specific columns that are different

I have two CSV files with 6 columns each and both have one common column EmpID (the primary key for comparison). For Example, File1.csv is:
EmpID1,Name1,Email1,City1,Phone1,Hobby1
120034,Tom Hanks,tom.hanks#gmail.com,Mumbai,8888999,Fishing
And File2.csv is
EmpID2,Name2,Email2,City2,Phone2,Hobby2
120034,Tom Hanks,hanks.tom#gmail.com,Mumbai,8888999,Running
The files need to be compared for differences and only rows and columns that are different should be added into a new output file as
EmpID1,Email1,Email2,Hobby1,Hobby2
120034,tom.hanks#gmail.com,hanks.tom#gmail.com,Fishing,Running
Currently I have written the below piece of code in Python. Now I am wondering on how to identify and pick the differences. Any pointers and help will be much appreciated.
import csv
import os
os.getcwd()
os.chdir('filepath')
with open('File1.csv', 'r') as csv1, open('File2.csv', 'r') as csv2:
file1 = csv1.readlines()`
file2 = csv2.readlines()`
with open('OutputFile.csv', 'w') as output:
for line in file1:`
if line not in file2:
output.write(line)
output.close()
csv1.close()
csv2.close()

First read the files to a dict structure, with the 'EMPID' as key pointing to the entire row:
import csv
fieldnames = [] # to store all fieldnames
with open('File1.csv') as f:
cf = csv.DictReader(f, delimiter=',')
data1 = {row['EMPID1']: row for row in cf}
fieldnames.extend(cf.fieldnames)
with open('File2.csv') as f:
cf = csv.DictReader(f, delimiter=',')
data2 = {row['EMPID2']: row for row in cf}
fieldnames.extend(cf.fieldnames)
Then identify all ids that are in both dicts:
ids_to_check = set(data1) & set(data2)
Finally, iterate over the ids and compare the rows themselves
with open('OutputFile.csv', 'w') as f:
cw = csv.DictWriter(f, fieldnames, delimiter=',')
cw.writeheader()
for id in ids_to_check:
diff = compare_dict(data1[id], data2[id], fieldnames)
if diff:
cw.writerow(diff)
Here's the compare_dict function implementation:
def compare_dict(d1, d2, fields_compare):
fields_compare = set(field.rstrip('12') for field in fields_compare)
if any(d1[k + '1'] != d2[k + '2'] for k in fields_compare):
# they differ, return a new dict with all fields
result = d1.copy()
result.update(d2)
return result
else:
return {}

Compare a column between 2 csv files and write differences using Python

I am trying to print out the differences by comparing a column between 2 csv files.
CSV1:
SERVER, FQDN, IP_ADDRESS,
serverA, device1.com, 10.10.10.1
serverA,device2.com,10.11.11.1
serverC,device3.com,10.12.12.1
and so on..
CSV2:
FQDN, IP_ADDRESS, SERVER, LOCATION
device3.com,10.12.12.1,serverC,xx
device679.com,20.3.67.1,serverA,we
device1.com,10.10.10.1,serverA,ac
device345.com,192.168.2.0,serverA,ad
device2.com,192.168.6.0,serverB,af
and so on...
What I am looking to do is to compare the FQDN column and write the differences to a new csv output file. So my output would look something like this:
Output.csv:
FQDN, IP_ADDRESS, SERVER, LOCATION
device679.com,20.3.67.1,serverA,we
device345.com,192.168.2.0,serverA,ad
and so on..
I have tried, but not able to get the output.
This is my Code, please tell me where i am going wrong;
import csv
data = {} # creating list to store the data
with open('CSV1.csv', 'r') as lookuplist:
reader1 = csv.reader(lookuplist)
for col in reader1:
DATA[col[0]] = col[1]
with open('CSV2.csv', 'r') as csvinput, open('Output.csv', 'w', newline='') as f_output:
reader2 = csv.reader(csvinput)
csv_output = csv.writer(f_output)
fieldnames = (['FQDN', 'IP_ADDRESS', 'SERVER'])
csv_output.writerow(fieldnames) # prints header to the output file
for col in reader1:
if col[1] not in reader2:
csv_output.writerow(col)
(EDIT) This is another approach that I have used:
import csv
f1 = (open("CSV1.csv"))
f2 = (open("CSV2.csv"))
csv_f1 = csv.reader(f1)
csv_f2 = csv.reader(f2)
for col1, col2 in zip(csv_f1, csv_f2):
if col2[0] not in col1[1]:
print(col2[0])
Basically, here I am only trying to find out first whether the unmatched FQDNs are printed or not. But it is printing out the whole CSV1 column instead. Please help guys, lot of research has went into this, but found no luck yet! :(

This code uses the built-in difflib to spit out the lines from file1.csv that don't appear in file2.csv and vice versa.
I use the Differ object for identifying line changes.
I assumed that you would not regard line swapping as a difference, that's why I added the sorted() function call.
from difflib import Differ
csv_file1 = sorted(open("file1.csv", 'r').readlines())
csv_file2 = sorted(open("file2.csv", 'r').readlines())
with open("diff.csv", 'w') as f:
for line in Differ().compare(csv_file1,csv_file2)):
dmode, line = line[:2], line[2:]
if dmode.strip() == "":
continue
f.write(line + "\n")
Note that if the line differs somehow (not only in the FQDN column) it would appear in diff.csv

import csv
data = {} # creating list to store the data
with open('CSV1.csv', 'r') as lookuplist, open('CSV2.csv', 'r') as csvinput, open('Output.csv', 'w') as f_output:
reader1 = csv.reader(lookuplist)
reader2 = csv.reader(csvinput)
csv_output = csv.writer(f_output)
fieldnames = (['FQDN', 'IP_ADDRESS', 'SERVER', 'LOCATION'])
csv_output.writerow(fieldnames) # prints header to the output file
_tempFqdn = []
for i,dt in enumerate(reader1):
if i==0:
continue
_tempFqdn.append(dt[1].strip())
for i,col in enumerate(reader2):
if i==0:
continue
if col[0].strip() not in _tempFqdn:
csv_output.writerow(col)

import csv
data = {} # creating dictionary to store the data
with open('CSV1.csv', 'r') as lookuplist:
reader1 = csv.reader(lookuplist)
for col in reader1:
data[col[1]] = col[1] # stores the data from column 0 to column 1 in the data list
with open('CSV2.csv', 'r') as csvinput, open('Output.csv', 'w', newline='') as f_output:
reader2 = csv.reader(csvinput)
csv_output = csv.writer(f_output)
fieldnames = (['SERVER', 'FQDN', 'AUTOMATION_ADMINISTRATOR', 'IP_ADDRESS', 'PRIMARY_1', 'MHT_1', 'MHT_2',
'MHT_3'])
csv_output.writerow(fieldnames) # prints header to the output file
for col in reader2:
if col[0] not in data: # if the column 1 in CSV1 does not match with column 0 in CSV2 Extract
col = [col[0]]
csv_output.writerow(col) # writes all the data that is matched in CMDB WLC Extract
So basically, I only had to change 'not in' under 'for loop' and change the columns in the data list that will be reading from the CSV1 file that I am creating.

python script to find the max to combine

I have two files in this format
1.txt
what i want to do is to merge both these files by considering the first column and append the output as following
expected output
my script i have written is not working
file1=raw_input('Enter the first file name: ')
file2=raw_input('Enter the second file name: ')
with open(file1, 'r') as f1:
with open(file2, 'r') as f2:
mydict = {}
for row in f1:
mydict[row[0]] = row[1:]
for row in f2:
mydict[row[0]] = mydict[row[0]].extend(row[1:])
fout = csv.write(open('out.txt','w'))
for k,v in mydict:
fout.write([k]+v)

Your script doesn't work because you have made a few inaccuraces.
row is a string, so row[0] is the first character, not the first number.
The method .extend returns nothing, so it doesn't make a sense to use =.
I would fix your script in this way:
import csv
mydict = {}
with open('1.csv') as f:
reader = csv.reader(f)
for row in reader:
mydict[row[0]] = row[1:]
with open('2.csv') as f:
reader = csv.reader(f)
with open('out.csv', 'w') as fout:
writer = csv.writer(fout)
for row in reader:
new_row = row + mydict[row[0]]
writer.writerow(new_row)

The following approach should work:
import csv
d_1 = {}
with open('1.csv') as f_1:
for row in csv.reader(f_1):
d_1[row[0]] = row[4:]
with open('2.csv') as f_2, open('out.csv', 'wb') as f_out:
csv_out = csv.writer(f_out)
for row in csv.reader(f_2):
if row[0] in d_1:
row.extend(d_1[row[0]])
csv_out.writerow(row)
This first reads 1.csv into a dictionary, leaving out the first three columns. It then reads each entry in 2.csv, and if the first column matches an entry in the dictionary, it appends the result before writing to the output.
Note: Entries present in 1.csv but not in 2.csv will be ignored. Secondly, entries in 2.csv which are not in 1.csv are written unchanged.
This gives you an out.csv file as follows:
223456,233,334,334,45,667,445,6667,77798,881,2234,44556,3333,22334,44555,22233,22334,22222,22334,2234,2233,222,55,666666
333883,445,445,4445,44,556,555,333,44445,5556,5555,223,334,5566,334,445,667,334,556,776,45,2223,3334,4444
For Python 2.6, split the with onto two lines as follows:
import csv
d_1 = {}
with open('1.csv') as f_1:
for row in csv.reader(f_1):
d_1[row[0]] = row[4:]
with open('2.csv') as f_2:
with open('out.csv', 'wb') as f_out:
csv_out = csv.writer(f_out)
for row in csv.reader(f_2):
if row[0] in d_1:
row.extend(d_1[row[0]])
csv_out.writerow(row)

file1=raw_input('Enter the first file name: ')
file2=raw_input('Enter the second file name: ')
with open(file1, 'r') as f1:
r1 = f1.read()
with open(file2, 'r') as f2:
r2 = f2.read()
with open('out.txt','w') as o2:
o2.write('{0},{1}'.format(r1, r2))

Not able to write to all rows in .csv file in Python?

I'm trying to write to file2.csv file by values from file1.csv file using a keyfile.csv which contains the mapping between two files as the two files don't have the same column order.
def convert():
Keyfile = open('keyfile.csv', 'rb')
file1 = open('file1.csv', 'rb')
file2 = open('file2.csv', 'w')
reader_Keyfile = csv.reader(Keyfile, delimiter=",")
reader_file1 = csv.reader(file1, delimiter=",")
writer_file2 = csv.writer(file2, delimiter=",")
for row_file1 in reader_file1:
for row_Keyfile in reader_Keyfile:
for index_val in row_Keyfile:
file2.write(row_file1[int(index_val)-1]+',')
# Closing all the files
file2.close()
Keyfile.close()
file1.close()
# keyfile structure: 3,77,65,78,1,10,8...
# so 1st column of file2 is 3rd column of file1 ;
# col2 of file 2 is col77 of file1 and so on
I'm only able to write only one row in file2.csv. It should have as many rows as there are in file1.csv. How do I move to the next row after one row is finished ? I'm assuming Loop should take care of that but that's not happening.What am I doing wrong ?

You have two problems.
You should only read keyfile once and build a dict out of the mapping
You need to write a \n at the end of each line of your output file

I am assuming the KeyFile is just one row, giving the mappings for all rows. Something like the following should work:
def convert():
with open('keyfile.csv') as Keyfile, open('file1.csv', 'r') as file1, open('file2.csv', 'wb') as file2:
mappings = next(csv.reader(Keyfile, delimiter=","))
mappings = [int(x)-1 if x else None for x in mappings]
reader_file1 = csv.reader(file1, delimiter=",")
writer_file2 = csv.writer(file2, delimiter=",")
for row_file1 in reader_file1:
row = [''] * len(mappings)
for from_index, to_index in enumerate(mappings):
if to_index != None:
row[to_index] = row_file1[from_index]
writer_file2.writerow(row)
It assumes column mappings start from 1.

Your nested looping is problematic as others mentioned. Instead, create the mapping outside of the row iteration, then write the rows based on the mapping. I use a dict object for this.
import csv
Keyfile = open('keyfile.csv', 'rb')
file_out = csv.reader(open('file1.csv', 'rb'), delimiter=",")
file_in = csv.writer(open('file2.csv', 'w'), delimiter=",")
mapDict = {}
# the first line in KeyFile convert to dict
reader = csv.reader(Keyfile, delimiter=',')
for i, v in enumerate(reader.next()):
if v != ' ':
mapDict[i] = int(v)
# re-index the row in file_in based on mapDict
for row in file_out:
file_in.writerow([row[c] for c in mapDict.values()])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove extra rows in a file in python - python

Related

Python find matching string in each line

Compare two CSV files and output only rows with the specific columns that are different

Compare a column between 2 csv files and write differences using Python

python script to find the max to combine

Not able to write to all rows in .csv file in Python?

Categories

Resources