Python-comparing mutliple files from different folders and generating diff files - python

i want to automate below scenario in python
Actual-
cc0023-base.txt
cc9038.final.txt
Expected:
base.txt
final.txt
1"Actual" and "Expected" are two different folders under same directory.i want to compare "base" and "final" files of both folders and generate the diff file in another folder.
Diff:
base-diff.txt
final-diff.txt
how do i do it in python. below is the sample code which i have written,but its generating diff files of all possible combinations.I need that base should be compared only with base and final with final of both folders.
expected_files=os.listdir('expected/path')
actual_files = os.listdir('actual/path')
diff_files=os.listdir('diff/path')
cr=['base.txt','final.txt']
i=0
for files in expected_files:
tst=os.path.join('expected/path',files)
with open(tst,'r')as Expected:
for actualfile in actual_files:
actualpath=os.path.join('actual/path',actualfile)
with open(actualpath,'r') as actual:
diff=difflib.unified_diff(Expected.readlines(),
actual.readlines(),
fromfile=Expected,
tofile=actual,)
diffpath=os.path.join('diff/path',cr[i])
diff_file = open(diffpath, 'w')
for line in diff:
diff_file.write(line)
diff_file.close()
i=i+1
Please help,as i am new to python

The issue in your code is in this section:
i=0
diffpath=os.path.join('diff/path',cr[i])
diff_file = open(diffpath, 'w')
for line in diff:
diff_file.write(line)
diff_file.close()
i=i+1
Since you are always setting i to 0 before accessing cr[i] it will always be cr[0]
move the i=0 to before the loop starts that you want to initialize the value to 0.
I think you want something like this:
expected_files=os.listdir('expected/path')
actual_files = os.listdir('actual/path')
diff_files=os.listdir('diff/path')
cr=['base.txt','final.txt']
j=1
for files in expected_files:
tst=os.path.join('expected/path',files)
with open(tst,'r')as Expected:
#i=0
for i, actualfile in enumerate(actual_files):
actualpath=os.path.join('actual/path',actualfile)
with open(actualpath,'r') as actual:
diff=difflib.unified_diff(Expected.readlines(),
actual.readlines(),
fromfile=Expected,
tofile=actual,)
diffpath=os.path.join('diff/path',cr[i])
with open(diffpath, 'w') as diff_file:
for line in diff:
diff_file.write(line)
#diff_file.close()
#i=i+1
Some explanation, so the enumerate(actual_files) will give you an index i and the data from the list actualfile this way you don't have to do the incrementing yourself. (Also worth noting that this will break for more than 2 files in your directory!) Also, you can use with open() as foo: syntax for writes as shown.

Related

Find the list of number in another list using nested For loops and If condition

I have 2 different excel files file 1/file 2. I have stored the values of the columns in 2 different lists. I have to search the number present in file 1 with file 2 and I wanted the output as per file 3/ExpectedAnswer.
File 1:
File 2:
File 3/ Expected Answer:
I tried the below code for the above requirement. But I don't know where I'm going wrong.
for j in range(len(terr_code)):
g=terr_code[j]
#print(g)
for lists in Zip_code:
Zip_code= lists.split(";")
while('' in Zip_code):
Zip_code.remove('')
for i in range(len(Zip_code)):
#print(i)
h=Zip_code[i]
print(g)
if g in h:
print(h)
territory_code.append(str(terr_code[j]))
print(territory_code[j])
final_list.append(Zip_terr_Hem['NAME'][i])
#print(final_list)
s = ";"
s= s.join(str(v) for v in final_list)
#print(s)
final_file['Territory Code'] = pd.Series(str(terr_code[j]))
final_file['Territory Name'] = pd.Series(s)
final_file = pd.DataFrame(final_file )
final_file.to_csv('test file.csv', index=False)
The first for loop is working fine. But when I try to print the list of number from the 2nd for loop, the first number is getting printed multiple time. And though both the list are working, still they are not getting inside the if condition. Please tell me what I'm doing wrong here. Thanks

How to modify iteration list?

Following scenario of traversing dir structure.
"Build complete dir tree with files but if files in single dir are similar in name list only single entity"
Example tree ( let's assume they're are not sorted ):
- rootDir
-dirA
fileA_01
fileA_03
fileA_05
fileA_06
fileA_04
fileA_02
fileA_...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
Expected output:
- rootDir
-dirA
fileA_01 - fileA_06 ...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
So I did already simple def findSimilarNames that for fileA_01 (or any fileA_) will return list [fileA_01...fileA_06]
Now I'm in os.walk and I'm doing loop over files so every file will be checked against similar filenames so e.g fileA_03 I've got rest of them [fileA_01 - fileA_06] and now I want to modify the list that I iterate over to just skip items from findSimilarNames, without need of using another loop or if's inside.
I searched here and people are suggesting avoidance of modifying iteration list, but doing so I would avoid every file iteration.
Pseudo code:
for root,dirs,files in os.walk( path ):
for file in files:
similarList = findSimilarNames( file )
#OVERWRITE ITERATION LIST SOMEHOW
files = (set(files)-set(similarList))
#DEAL WITH ELEMENT
What I'm trying to avoid is below - checking each file because maybe it's already found by findSimilarNames.
for root,dirs,files in os.walk( path ):
filteredbysimilar = files[:]
for file in files:
similar = findSimilarNames( file )
filteredbysimilar = list(set(filteredbysimilar)-set(similar))
#--
for filteredFile in filteredbysimilar:
#DEAL WITH ELEMENT
#OVERWRITE ITERATION LIST SOMEHOW
You can get this effect by using a while-loop style iteration. Since you want to do set subtraction to remove the similar groups anyway, the natural approach is to start with a set of all the filenames, and repeatedly remove groups until nothing is left. Thus:
unprocessed = set(files)
while unprocessed:
f = unprocessed.pop() # removes and returns an arbitrary element
group = findSimilarNames(f)
unprocessed -= group # it is not an error that `f` has already been removed.
doSomethingWith(group) # i.e., "DEAL WITH ELEMENT" :)
How about building up a list of files that aren't similar?
unsimilar = set()
for f in files:
if len(findSimilarNames(f).intersection(unsimilar))==0:
unsimilar.add(f)
This assumes findSimilarNames yields a set.

Open two files pairwise out of many - python

Hey guys I'm a rookie in python and need some help.
My problem is, that I have a folder full of text files (with lists in it), where two belong to each other and need to be read and compared.
Folder with many files: File1_in.xlo, File1_out.xlo, File2_in.xlo, File2_out.xlo, ...
--> so File1_in.xlo and File1_out.xlo belong together and need to be compared.
I already can append the lists of the 'in-Files' (or 'out-Files') and then compare them, but since there are many Files the lists become really long (thousands and thousands of entries), so the idea is to compare the files or respectively the lists pairwise.
My first try looks like:
import os
for filename in sorted(os.listdir('path')):
if filename.endswith('in.xlo'):
with open(os.path.join('path', filename)) as inn:
lines = inn.readlines()
for x in lines:
temperatureIn = x.split()[4]
if filename.endswith('out.xlo'):
with open(os.path.join('path', filename)) as outt:
lines = outt.readlines()
for x in lines:
temperatureOut = x.split()[4] #4. column in list
So the problem is, as you can see, the 'temperatureIn's are always overwritten before I can compare them with the 'temperatureOut's. I think/ hope there must be a way to open both files at once to compare the list entries.
I hope you can understand my problem and someone can help me.
Thanks
Use zip to access in-Files and out-Files in pairs
files = sorted(os.listdir('path'))
in_files = [fname for fname in files if fname.endswith('in.xlo')]
out_files = [fname for fname in files if fname.endswith('out.xlo')]
for in_file, out_file in zip(in_files, out_files):
with open(os.path.join('path', in_file)) as inn, open(os.path.join('path', out_file)) as outt:
# Do whatever you want
add them to a list created just before your for loop, as:
temps_in =[]
for x in lines:
temperatureIn = x.split()[4]
temps_in.append(temperatureIn)
Do the same thoing for temperatures out, then compare your two lists

Python read and write a file faster

Here is my code for reading a huge file (more than 15 GiB) called interactions.csv and do some checks about each row and based on the check, split the interactions file into two separate files: test.csv and trains.csv.
It takes more than two days on my machine to stop. Is there any way I can make this code faster maybe using some kind of parallelism ?
target_items: a list containing some item IDs
The current program:
with open(interactions) as interactionFile, open("train.csv", "wb") as train, open("test.csv", "wb") as test:
header=interactionFile.next();
train.write(header+'\n')
test.write(header+'\n')
i=0
for row in interactionFile:
# process each row
l = row.split('\t')
if l[1] in target_items:
test.write(row+'\n')
else:
train.write(row+'\n')
print(i)
i+=1

Identify intersection of multiple lists using multiple files from an input

I am trying to write a code where it takes a Metadata.txt as an input, then it identifies the common genes across the different input files whose names are extracted from the Metadata.txt file.
Example of Metadata.txt
SIG1 SIG2
File1 File3
File2 File4
File3 File5
File4
The files in my directory are File1.xls, File2.xls, File3.xls...File6.xls. For simplicity, i have same inputs for File1 and File 3, as well as for File 2 and 4.
File1.xls or File3.xls
TargetID FoldChange p-value Adjusted-p
A 0.543528215 0.000518847 0.000518847
B 0.638469898 0.00204759 0.00204759
C 1.936595724 0.00250229 0.00250229
D 0.657322154 0.012840013 0.012840013
E 1.728842021 0.00251105 0.00251105
F 2.024842641 0.000719261 0.000719261
G 4.049059413 2.25E-05 2.25E-05
H 0.478660942 0.000352179 0.000352179
I 0.449304016 0.000489521 0.000489521
File2.xls or File4.xls
TargetID FoldChange p-value Adjusted-p
JJ 0.453537892 4.22E-06 4.22E-06
A 0.558325503 0.001697851 0.001697851
B 0.637336564 7.64E-05 7.64E-05
D 1.804853034 0.000492439 0.000492439
E 0.378445825 1.72E-05 1.72E-05
JJJJ 1.601997491 0.019618883 0.019618883
File5.xls
TargetID FoldChange p-value Adjusted-p
A 3.140223972 0.013347275 0.013347275
B 1.5205222 0.032318774 0.032318774
C 1.532760451 0.043763101 0.043763101
D 1.522865896 0.001791471 0.001791471
The goal is to output two files "SIG1.txt" and "SIG2.txt" which has the common genes between File1/File2 and File3/File4/File5, respectively. So the metadata is providing a platform to iterate over things.
Here is what I had so far:
md_input = pd.read_table("Metadata.txt", sep="\t") #opens the metadata file
for c in range(0, len(md_input.columns)):
first_file=md_input.ix[0,c]+".xls"
print first_file #this will print "File1.xls" for column1 and File3.xls for column#2
first_sig=pd.read_table(first_file, sep="\t", usecols=["TargetID", 'FoldChange']) #opens the first file
list1=list(first_file.iloc[:,0]) #takes column of first file and converts to list
#Then, I aim to iterate over the remaining files in each column of the metadata and find the intersection/common with each other. I tried the following:
for i in range(1, md_input.count()[c]):
list2=[]
df=pd.read_table("{}.xls".format(md_input.ix[i,c]), sep="\t", usecols=["TargetID", 'FoldChange'])
list2=list(df.iloc[:,0]) #assign the LIST
common=list(set(list_up_0).intersection(set(list2))) #find intersection
print common
When i print the 'common', i only get the common with the LAST file. Which is expected given how i wrote the loop/code. I am unable to find a way to iterate over all the files in the column, keep it open and then identify an intersection.
Please advise if i need to clarify the above further. I know it sounds complicated but it shouldn't be. t tried to simplify it and i hope that worked
I was finally able to get it to work. I am not sure if this is the simplest way but it works. I think the confusing part in the script below is the utilization of the globals key to allow opening multiple files and assigning file names based on the # in the for loop. Anyway, the script works and it also takes into consideration the fold changes. I hope this will be useful to others.
md_input = pd.read_table('Metadata.txt', sep="\t")
list_cols=list(md_input.columns)
df=pd.DataFrame(columns=list_cols)
for c in range(0, len(md_input.columns)):
sets_up=[]
sets_down=[]
for i in range(0, md_input.count()[c]):
globals()["file_"+str(i)]=md_input.ix[i,c]+".xls"
globals()["sig_"+str(i)]=pd.read_table(globals()["file_"+str(i)], sep="\t", usecols=["TargetID", 'FoldChange'])
globals()["List_up"+str(i)]=[]
globals()["List_down"+str(i)]=[]
for z in range(0, len(globals()["sig_"+str(i)].index)):
if globals()["sig_"+str(i)].ix[z,'FoldChange']>=1.5:
globals()["List_up"+str(i)].append(globals()["sig_"+str(i)].iloc[z,0])
elif globals()["sig_"+str(i)].ix[z,'FoldChange']<=1.5:
globals()["List_down"+str(i)].append(globals()["sig_"+str(i)].iloc[z,0])
sets_up.append(set(globals()["List_up"+str(i)]))
sets_down.append(set(globals()["List_down"+str(i)]))
common_up=list(set.intersection(*sets_up))
common_down=list(set.intersection(*sets_down))
common=common_up + common_down
for x in range(0, len(common)):
df.loc[x,md_input.columns[c]]=common[x]
df.to_csv("Output.xls",sep="\t", index=False)

Categories