Efficiently finding intersecting regions in two huge dictionaries - python

I wrote a piece of code that finds common ID's in line[1] of two different files.My input file is huge (2 mln lines). If I split it into many small files it gives me more intersecting ID's, while if I throw the whole file to run, much less. I cannot figure out why, can you suggest me what is wrong and how to improve this code to avoid the problem?
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in dictB:
if key in dictA:
output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])
My file1 is sorted by line[0] and has 0-15 lines,
contig17 GRMZM2G052619_P03 98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33 AT2G41790.1 98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98 GRMZM5G888620_P01 87 470 1 0 17 28 78.8 1 127 7 420 2 522 18
contig102 GRMZM5G886789_P02 73 115 1 0 34 45 78.8 0 134 5 421 0 456 50
contig123 AT3G57470.1 83 201 2 1 12 43 78.8 0 134 9 420 0 305 50
My file2 is not sorted and has 0-10 line,
GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525 1
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589 4
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0
My desired output,
contig17 GRMZM2G052619_P03 GO:0043531 ADP binding molecular_function PF07525
contig98 GRMZM5G888620_P01 GO:0011551 DNA binding molecular_function PF07589
contig102 GRMZM5G886789_P02 GO:0055516 ADP binding molecular_function PF07526

I really recommend you to use PANDAS to cope with this kind of problem.
for proof that can be simply done with pandas:
import pandas as pd #install this, and read de docs
from StringIO import StringIO #You dont need this
#simulating a reading the file
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""
#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""
#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file),
header=None,
sep=" ",
names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file),
header=None,
sep=" ",
names=['d', 'e', 'f'])
#this is the hard bit. Here I am using a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames.
my_df = s_df[s_df.e.isin(f_df.b)]
Output:
Out[180]:
d e f
0 y GRMZM2G052619_P03 y
1 y GRMZM5G888620_P01 y
2 y GRMZM5G886789_P02 y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")
chers!

This is almost the same but within a function.
#Creates a function to do the reading for each file
def read_store(file_, dictio_):
"""Given a file name and a dictionary stores the values
of the file in a dictionary by its value on the column provided."""
import re
with open(file_,'r') as file_0:
lines_file_0 = fileA.readlines()
for line in lines_file_0:
ID = re.findall("^.+\s+(\w+)", line)
#I couldn't check it but it should match whatever is after a separate
# character that has letters, numbers or underscore
dictio_[ID] = line
To use do:
file1 = {}
read_store("file1.txt", file1)
And then compare it normally as you do, but I would to use \s instead of \t to split. Even though it will split also between words, but that is easy to rejoin with " ".join(DictA[1:5])

Related

Remove numbers and user's stop words from pandas data frame

I would like to know how to remove some variables from a dataset, specifically numbers and list of strings. For example.
Test Num
0 bam 132
1 - 65
2 creation 47
3 MAN 32
4 41 831
... ... ...
460 Luchino 21
461 42 4126 7
462 finger 43
463 washing 1
I would like to have something like
Test Num
0 bam 132
2 creation 47
... ... ...
460 Luchino 21
462 finger 43
463 washing 1
where I removed (manually) MAN (it should be included in a list of strings, like a stop word), -, and numbers.
I have tried with isdigit but it is not working so I am sure that there are errors in my code:
df['Text'].where(~df['Text'].str.isdigit())
and for my stop words:
my_stop=['MAN','-']
df['Text'].apply(lambda lst: [x for x in lst if x in my_stop])
If you want to filter you could use .loc
df = df.loc[~df.Text.str.isdigit() & ~df.Text.isin(['MAN']), :]
.where(cond, other) returns a dataframe or series of the same shape as self, but keeps the original values where cond is true and replaces with other where it is false.
Read more in the docs
hi you should try this code :
df[df['Text']!='MAN']

How to extract a sum data from a text file on Python

I have a text file txt that has 6 columns:
1.sex (M /F) 2.age 3.height 4.weight 5.-/+ 6.zip code
I need to find from this text how many Males have - sign. ( for example: from the txt 30 M(Males) are - )
So I need only the number at the end.
Logically I need to work with Column1 and column 5 but I am struggling to get only one (sum) number at the end.
This is the content of the text:
M 87 66 133 - 33634
M 17 77 119 - 33625
M 63 57 230 - 33603
F 55 50 249 - 33646
M 45 51 204 - 33675
M 58 49 145 - 33629
F 84 70 215 - 33606
M 50 69 184 - 33647
M 83 60 178 - 33611
M 42 66 262 - 33682
M 33 75 176 + 33634
M 27 48 132 - 33607
I am getting the result now..., but I want both M and positive. How can I add that to occurrences??
f=open('corona.txt','r')
data=f.read()
occurrences=data.count('M')
print('Number of Males that have been tested positive:',occurrences)
You can split the lines like this:
occurrences = 0
with open('corona.txt') as f:
for line in f:
cells = line.split()
if cells[0] == "M" and cells[4] == "-":
occurrences += 1
print("Occurrences of M-:", occurrences)
But it is better to use the csv module or pandas for this type of work.
If you do any significant amount of work with text and columnar data, I would suggest getting started on learning pandas
For this task, if your csv is one record per line and is space-delimited:
import pandas as pd
d = pd.read_csv('data.txt',
names=['Sex', 'Age', 'Height', 'Weight', 'Sign', 'ZIP'],
sep=' ', index_col=False)
d[(d.Sex=='M') & (d.Sign=='-')].shape[0] # or
len(d[(d.Sex=='M') & (d.Sign=='-')]) # same result, in this case = 9
Pandas is a very extensive package. What this code does is build a DataFrame from your csv data, giving each column a name. Then selects from this, each row where both of your conditions Sex == 'M' and Sign == '-', and reports the number of records thus found.
I recommend starting here

pandas group by multiple columns and remove rows based on multiple conditions

I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132

How to combine rows of strings into one using pandas in a table or how to concatenate different rows of a column in a sentence using python?

Input:
LineNo word_num left top width text
1 1 322 14 14 My
1 2 304 4 41 Name
1 3 322 5 9 is
1 4 316 14 20 Raghav
2 1 420 129 34 Problem
2 2 420 31 27 just
2 3 420 159 27 got
2 4 431 2 38 complicated
1 1 322 14 14 #40
1 2 304 4 41 #gmail.com
2 1 420 129 34 2019
2 2 420 31 27 January
As you can see there are columns lineNo, left, top and word_num, so I was trying if I can get some logic using these both maybe I can achieve my solution.
I wanted to do some tweaks in the output, actually this output is coming through a PDF after its converted into an image, so it is catching the whole line because of which whole line is coming and the output is not making sense, what i am thinking of doing now is to group the text in a meaning full way. For e.g
lets say this output i am getting by using this:
g = df['line_num'].ne(df['line_num'].shift()).cumsum()
out = '\n'.join(df.groupby(g)['text'].agg(' '.join))
print (out)
Output=
"My name is raghav #40 #gmail.com
Problem just got complicated $2019 January"
Expected Output=
"My name is raghav
*40
#gmail.com
Problem just got complicated
2019 January"
All are in different lines no matter if they are in same line or not but logically grouped in different lines.
In my understanding maybe we can achieve this by doing these steps:
enter image description here
a) Words on same line are grouped if x distance < threshold
b) Words on next line are grouped with previous if y distance < threshold
Threshold is width(image)/ 100; x distance is calculated from left; y distance is calculated from top.
Can we do this ?
Let me know if the question is not clear enough!
Thanks!
Added the image i am trying to get the output, data in it is little complicated this i have changed it according to me!
To answer your second concern, maybe try iterating through the column like so.
phrase = ""
for i in range(0, df.count):
if type(df.iat[i, 'text']) == str:
phrase = phrase + " " + df.iat[i, 'text']
To add the space/..., I agree with jezrael, use the str.cat method.
Use double join - with agg and then for output Series:
out = '.....'.join(df.groupby('LineNo')['text'].agg(' '.join))
print (out)
My Name is Raghav.....Roll No. # 242
Another solution with str.cat:
out = df.groupby('LineNo')['text'].agg(' '.join).str.cat(sep='.....')
EDIT:
g = df['LineNo'].ne(df['LineNo'].shift()).cumsum()
out = '.....'.join(df.groupby(g)['text'].agg(' '.join))
print (out)
My Name is Raghav.....Roll No. # 242.....hello the problem just.....got more complicated !!!!

Reading values from a text file with different row and column size in python

I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3

Categories