Finding Common Elements (Amazon SDE-1) - python

Given two lists V1 and V2 of sizes n and m respectively. Return the list of elements common to both the lists and return the list in sorted order. Duplicates may be there in the output list.
Link to the problem : LINK
Example:
Input:
5
3 4 2 2 4
4
3 2 2 7
Output:
2 2 3
Explanation:
The first list is {3 4 2 2 4}, and the second list is {3 2 2 7}.
The common elements in sorted order are {2 2 3}
Expected Time complexity : O(N)
My code:
class Solution:
def common_element(self,v1,v2):
dict1 = {}
ans = []
for num1 in v1:
dict1[num1] = 0
for num2 in v2:
if num2 in dict1:
ans.append(num2)
return sorted(ans)
Problem with my code:
So the accessing time in a dictionary is constant and hence my time complexity was reduced but one of the hidden test cases is failing and my logic is very simple and straight forward and everything seems to be on point. What's your take? Is the logic wrong or the question desc is missing some vital details?
New Approach
Now I am generating two hashmaps/dictionaries for the two arrays. If a num is present in another array, we check the min frequency and then appending that num into the ans that many times.
class Solution:
def common_element(self,arr1,arr2):
dict1 = {}
dict2 = {}
ans = []
for num1 in arr1:
dict1[num1] = 0
for num1 in arr1:
dict1[num1] += 1
for num2 in arr2:
dict2[num2] = 0
for num2 in arr2:
dict2[num2] += 1
for number in dict1:
if number in dict2:
minFreq = min(dict1[number],dict2[number])
for _ in range(minFreq):
ans.append(number)
return sorted(ans)
The code is outputting nothing for this test case
Input:
64920
83454 38720 96164 26694 34159 26694 51732 64378 41604 13682 82725 82237 41850 26501 29460 57055 10851 58745 22405 37332 68806 65956 24444 97310 72883 33190 88996 42918 56060 73526 33825 8241 37300 46719 45367 1116 79566 75831 14760 95648 49875 66341 39691 56110 83764 67379 83210 31115 10030 90456 33607 62065 41831 65110 34633 81943 45048 92837 54415 29171 63497 10714 37685 68717 58156 51743 64900 85997 24597 73904 10421 41880 41826 40845 31548 14259 11134 16392 58525 3128 85059 29188 13812.................
Its Correct output is:
4 6 9 14 17 19 21 26 28 32 33 42 45 54 61 64 67 72 77 86 93 108 113 115 115 124 129 133 135 137 138 141 142 144 148 151 154 160 167 173 174 192 193 195 198 202 205 209 215 219 220 221 231 231 233 235 236 238 239 241 245 246 246 247 254 255 257 262 277 283 286 290 294 298 305 305 307 309 311 312 316 319 321 323 325 325 326 329 329 335 338 340 341 350 353 355 358 364 367 369 378 385 387 391 401 404 405 406 406 410 413 416 417 421 434 435 443 449 452 455 456 459 460 460 466 467 469 473 482 496 503 .................
And Your Code's output is:

Please find the below solution
def sorted_common_elemen(v1, v2):
res = []
for elem in v2:
res.append(elem)
v1.pop(0)
return sorted(res)

Your code ignores the number of times a given element occurs in the list. I think this is a good way to fix that:
class Solution:
def common_element(self, l0, l1):
li = []
for i in l0:
if i in l1:
l1.remove(i)
li.append(i)
return sorted(li)

Related

Any faster way to search each value of one column through another column in python?

I need to search through each value of a column, do some comparison with all entries of another column and, if certain conditions are met, print. I'm using the python code below and it works, but the drawback is that both columns have tens of thousands of entries, so it's very slow. Is there a more efficient way to do this?
for i in df1.index:
for j in df2['pdb']:
if df1['pdb'][i] == df2['pdb'][j]:
if df1['res1'][i] >= df2['start'][j] and df1['res2'][i] <= df2['end'][j]:
print(df1['pdb'][i], df2['PFAM_ACC'][j])
Example:
df1 =
pdb res1 res2
4xhfA 76 83
4xhfA 126 133
2mx1A 179 186
3s8lA 111 118
4ucmA 115 122
1pigA 119 126
4mavA 263 270
4mavA 289 296
3sbrA 101 108
3sbrA 148 155
3sbrA 158 165
3sbrA 222 229
3sbrA 394 401
5zeaA 83 90
5zeaC 562 569
5zeaD 32 39
5zeaD 89 96
5zeaG 277 284
df2 =
pdb start end PFAM_ACC
4xhfA 140 236 PF04205
1pigA 61 332 PF00128
1pigA 409 493 PF02806
3sbrA 171 241 PF18793
3sbrA 424 494 PF18764
3sbrA 558 635 PF00116
5zeaA 13 75 PF02874
5zeaC 13 75 PF02874
5zeaD 15 81 PF02874
5zeaG 13 75 PF02874
and I want to get as output:
1pigA PF00128
3sbrA PF18793
5zeaD PF02874
I hope it's more clear now.
Please let me know if you have any suggestions
Try:
x = df1.merge(df2, on="pdb")
out = x.loc[
(x["res1"] >= x["start"]) & (x["res2"] <= x["end"]), ["pdb", "PFAM_ACC"]
]
print(out)
Prints:
pdb PFAM_ACC
2 1pigA PF00128
13 3sbrA PF18793
21 5zeaD PF02874

How to use certain rows of a dataframe in a formula

So I have multiple data frames and all need the same kind of formula applied to certain sets within this data frame. I got the locations of the sets inside the df, but I don't know how to access those sets.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #might used/need it later to check the output
df = pd.read_csv('Dalfsen.csv')
l = []
x = []
y = []
#the formula(trendline)
def rechtzetten(x,y):
a = (len(x)*sum(x*y)- sum(x)*sum(y))/(len(x)*sum(x**2)-sum(x)**2)
b = (sum(y)-a*sum(x))/len(x)
y1 = x*a+b
print(y1)
METING = df.ID.str.contains("<METING>") #locating the sets
indicatie = np.where(METING == False)[0] #and saving them somewhere
if n in df[n] != indicatie & n+1 != indicatie: #attempt to add parts of the set in l
append.l
elif n in df[n] != indicatie & n+1 == indicatie: #attempt defining the end of the set and using the formula for the set
append.l
rechtzetten(l.x, l.y)
else: #emptying the storage for the new set
l = []
indicatie has the following numbers:
0 12 13 26 27 40 41 53 54 66 67 80 81 94 95 108 109 121
122 137 138 149 150 162 163 177 178 190 191 204 205 217 218 229 230 242
243 255 256 268 269 291 292 312 313 340 341 373 374 401 402 410 411 420
421 430 431 449 450 468 469 487 488 504 505 521 522 538 539 558 559 575
576 590 591 604 605 619 620 633 634 647
Because my df looks like this:
ID,NUM,x,y,nap,abs,end
<PROFIEL>not used data
<METING>data</METING>
<METING>data</METING>
...
<METING>data</METING>
<METING>data</METING>
</PROFIEL>,,,,,,
<PROFIEL>not usde data
...
</PROFIEL>,,,,,,
tl;dr I'm trying to use a formula in each profile as shown above. I want to edit the data between 2 numbers of the list indicatie.
For example:
the fucntion rechtzetten(x,y) for the x and y df.x&df.y[1:11](Because [0]&[12] are in the list indicatie.) And then the same for [14:25] etc. etc.
What I try to avoid is typing the following hundreds of times manually:
x_#=df.x[1:11]
y_#=df.y[1:11]
rechtzetten(x_#,y_#)
I cant understand your question clearly, but if you want to replace a specific column of your pandas dataframe with a numpy array, you could simply assign it :
df['Column'] = numpy_array
Can you be more clear ?

split big matrix into multiple smaller ones - difficulties

I have a 32*32 matrix and I want to break it into 4 8x8 matrixes.
Here's how I try to make a smaller matrix for top-left part of the big one (pix is a 32x32 matrix).
A = [[0]*mat_size]*mat_size
for i in range(mat_ size):
for j in range(mat_size):
A[i][j] = pix[i, j]
So, pix has the following values for top-left part:
198 197 194 194 197 192 189 196
199 199 198 198 199 195 195 145
200 200 201 200 200 204 131 18
201 201 199 201 203 192 57 56
201 200 198 200 207 171 41 141
200 200 198 199 208 160 38 146
198 198 198 198 206 157 39 129
198 197 197 199 209 157 38 77
But when I print(A) after the loop, all the rows of A equal to the last row of pix. So it's 8 rows of 198 197 197 199 209 157 38 77 I know I can use A = pix[:8, :8], but I prefer to use loop for some purpose. I wonder why that loop solution doesn't gives me correct result.
A = np.zeros((4, 4, 8, 8))
for i in range(4):
for j in range(4):
A[i, j] = pix[i*8:(i+1)*8, j*8:(j+1)*8]
If I understand your question correctly, this solution should work. What it's doing is iterating through the pix matrix, and selecting a 8*8 matrix each time. Is this what you need?
Consider using numpy in order to avoid multiple references pointing to the same list (the last list in the matrix):
mat_size = 8
A = np.empty((mat_size,mat_size))
pix = np.array(pix)
for i in range(mat_size):
for j in range(mat_size):
A[i][j] = pix[i][j]

Python 3: Getting information from list in list

I have txt file in Python 3 like this (cities are just examples):
Tokyo 0 267 308 211 156 152 216 27 60 70 75
London 267 0 155 314 111 203 101 258 254 199 310
Paris 308 155 0 429 152 315 216 330 295 249 351
Vienna 211 314 429 0 299 116 212 184 271 265 252
Tallinn 156 111 152 299 0 183 129 178 143 97 199
Helsinki 152 203 313 116 183 0 99 126 212 151 193
Stockholm 216 101 216 212 129 99 0 189 252 161 257
Moscow 27 258 330 184 178 126 189 0 87 73 68
Riga 60 254 295 271 143 212 252 87 0 91 71
Melbourne 70 199 249 265 97 151 161 73 91 0 128
Oslo 75 310 351 252 199 193 257 68 71 128 0
I want to get program to work like this with an example:
Please enter starting point: Paris
Now please enter ending point: Riga
Distance between Paris and Riga is 295 km.
I'm fairly new in Python and I don't know how to read distance list in list.
What I managed to do so far:
cities = []
distances = []
file = open("cities.txt")
for city_info in file:
city_info = city_info.strip()
city = city_info.split()
cities.append(city[0])
distances2 = []
for dist in city[1:]:
distances2.append(int(dist))
distances.append(distances2)
# to check, if lists are good to go
print(distances)
print(cities)
file.close()
amount = len(cities)
for x in range(amount):
for y in range(amount):
startpoint = cities[x]
endpoint = cities[y]
dist1 = distances[x][y]
startpoint = input("Enter start point: ").capitalize()
if startpoint not in cities:
print("Start point doesn't exist in our database: ", startpoint)
else:
endpoint = input("Enter end point: ").capitalize()
if endpoint not in cities:
print("Start point doesn't exist in our database: ", endpoint)
else:
print("Distance between", startpoint, "and", endpoint, "is", dist1, "kilometers.")
As I'm not very competent in Python language, I don't know what I'm doing wrong.
For example I want to get distance between cities[1] and cities[4], so it should find distance from distances[1][4].
Try this:
# reading from file:
with open('cities.txt') as f:
lines = f.readlines()
# pre-processing
indices = {line.split()[0]: i for i, line in enumerate(lines)}
distances = [line.split()[1:] for line in lines]
#user input:
start = input("Please enter starting point: ")
end = input("Now please enter ending point: ")
# evaluation:
distance = distances[indices[start]][indices[end]]
# output:
print("Distance between {start} and {end} is {distance} km.".format(**locals()))
Another approach, not too different from the other answer.
cities = {}
with open('cities.txt') as f:
for i, line in enumerate(f.read().splitlines()):
vals = line.split()
cities[vals[0]] = {'index': i, 'distances': [int(i) for i in vals[1:]]}
startpoint = input("Enter start point: ").capitalize()
if startpoint in cities:
endpoint = input("Enter end point: ").capitalize()
if endpoint in cities:
index = cities[startpoint]['index']
distance = cities[endpoint]['distances'][index]
print('The distance from %s to %s is %d' % (startpoint, endpoint, distance))
else:
print('city %s does not exist' % endpoint)
else:
print('city %s does not exist' % startpoint)

Python selecting items by comparing values in a table using dictionary

I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively.
The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).
So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.
Here is the code:
#!usr/bin/python
filename = "data.txt"
readfile = open(filename,"r")
d = dict()
for i in readfile.readlines():
i = i.strip()
i = i.split("\t")
d.setdefault(i[1], []).append([i[0],i[10]])
d.setdefault(i[1], []).append([i[0],i[11]])
for x in d:
print(x,d[x])
readfile.close()
But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid.
Is there any good logic to solve the problem?
Thedata.txtfile (including the header row and with»representing tab characters)
qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732
And here's a little more readable version of the table's contents:
0 1 2 3 4 5 6 7 8 9 10 11
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ACLA_022040 TBB 32.71 431 258 8 39 468 24 423 2.00E-76 240
ACLA_024600 TBB 80 435 87 0 1 435 1 435 0 729
ACLA_031860 TBB 39.74 453 251 3 1 447 1 437 1.00E-121 357
ACLA_046030 TBB 75.81 434 105 0 1 434 1 434 0 704
ACLA_072490 TBB 41.7 446 245 3 4 447 3 435 2.00E-120 353
ACLA_010400 EF1A 27.31 249 127 8 69 286 9 234 3.00E-13 61.6
ACLA_015630 EF1A 22 491 255 17 186 602 3 439 8.00E-19 78.2
ACLA_016510 EF1A 26.23 122 61 4 21 127 9 116 2.00E-08 46.2
ACLA_023300 EF1A 29.31 447 249 12 48 437 3 439 2.00E-45 155
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0 801
ACLA_074730 CALM 23.13 147 101 4 6 143 2 145 7.00E-08 41.2
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1.00E-13 55.1
ACLA_016630 CALM 23.9 159 106 5 58 216 4 147 5.00E-12 51.2
ACLA_031930 RPB2 36.87 1226 633 24 121 1237 26 1219 0 734
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0 1691
ACLA_082370 RPB2 27.69 1228 667 37 31 1132 35 1167 7.00E-110 365
ACLA_061960 ACT 28.57 147 95 5 146 284 69 213 3.00E-12 57.4
ACLA_068200 ACT 28.73 463 231 13 16 471 4 374 1.00E-53 176
ACLA_069960 ACT 24.11 141 97 4 581 718 242 375 9.00E-09 46.2
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0 732
Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.
Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.
import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()
which produces
>>> df_new
sseqid qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
0 ACT ACLA_095800 91.73 375 31 0 1 375 1 375 0.000000e+00 732.0
1 CALM ACLA_096170 29.33 150 96 4 34 179 2 145 1.000000e-13 55.1
2 EF1A ACLA_028450 85.55 443 63 1 1 443 1 442 0.000000e+00 801.0
3 RPB2 ACLA_065630 65.79 1257 386 14 1 1252 4 1221 0.000000e+00 1691.0
4 TBB ACLA_024600 80.00 435 87 0 1 435 1 435 0.000000e+00 729.0
Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.
#!usr/bin/python
import csv
DATA = "data.txt"
class Sequence:
def __init__(self, row):
self.qseqid = row[0]
self.sseqid = row[1]
self.pident = float(row[2])
self.length = int(row[3])
self.mismatch = int(row[4])
self.gapopen = int(row[5])
self.qstart = int(row[6])
self.qend = int(row[7])
self.sstart = int(row[8])
self.send = int(row[9])
self.evalue = float(row[10])
self.bitscore = float(row[11])
def __str__(self):
return (
"{qseqid}\t"
"{sseqid}\t"
"{pident}\t"
"{length}\t"
"{mismatch}\t"
"{gapopen}\t"
"{qstart}\t"
"{qend}\t"
"{sstart}\t"
"{send}\t"
"{evalue}\t"
"{bitscore}"
).format(**self.__dict__)
def entries(fname, header_rows=1, dtype=list, **kwargs):
with open(fname) as inf:
incsv = csv.reader(inf, **kwargs)
# skip header rows
for i in range(header_rows):
next(incsv)
for row in incsv:
yield dtype(row)
def main():
bestseq = {}
for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
# see if a sequence with the same sseqid already exists
prev = bestseq.get(seq.sseqid, None)
if (
prev is None
or seq.evalue < prev.evalue
or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
):
bestseq[seq.sseqid] = seq
# display selected sequences
keys = sorted(bestseq)
for key in keys:
print(bestseq[key])
if __name__ == "__main__":
main()
which results in
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0.0 732.0
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1e-13 55.1
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0.0 801.0
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0.0 1691.0
ACLA_024600 TBB 80.0 435 87 0 1 435 1 435 0.0 729.0
While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.
import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter
data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)
# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
(itemgetter(col.strip()), 1)) for col in columns]
def comparer(left, right):
for fn, mult in comparers:
result = cmp(fn(left), fn(right))
if result:
return mult * result
else:
return 0
return sorted(items, cmp=comparer)
# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
with open(data_file_name, 'rb') as data_file:
reader = csv.DictReader(data_file, delimiter=DELIMITER)
format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
for rec in (AttrDict(r) for r in reader):
# Convert the two sort fields to numeric values for proper ordering.
rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
ssqeid_dict[rec.sseqid].append(rec)
for ssqeid in sorted(ssqeid_dict):
# Sort each group of recs with same ssqeid. The first record after sorting
# will be the one sought that has the lowest evalue and highest bitscore.
selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
print format_spec.format(**selected)
Output (»represents tabs):
ACLA_095800» ACT» 91.73» 375» 31» 0» 1» 375» 1» 375» 0.0» 732.0
ACLA_096170» CALM» 29.33» 150» 96» 4» 34» 179» 2» 145» 1e-13» 55.1
ACLA_028450» EF1A» 85.55» 443» 63» 1» 1» 443» 1» 442» 0.0» 801.0
ACLA_065630» RPB2» 65.79» 1257» 386» 14» 1» 1252» 4» 1221» 0.0» 1691.0
ACLA_024600» TBB» 80» 435» 87» 0» 1» 435» 1» 435» 0.0» 729.0
filename = 'data.txt'
readfile = open(filename,'r')
d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
sseqid.append(i.rsplit()[1])
lines.append(i.rsplit())
sorted_sseqid = sorted(set(sseqid))
sdqDict={}
key =None
for sorted_ssqd in sorted_sseqid:
key=sorted_ssqd
evalue=[]
bitscore=[]
qseid=[]
for line in lines:
if key in line:
evalue.append(line[10])
bitscore.append(line[11])
qseid.append(line[0])
sdqDict[key]=[qseid,evalue,bitscore]
print sdqDict
print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid
readfile.close()

Categories