counting the number of common sets in two files

counting the number of common sets in two files - python

I would like to get your help in writing a script to count number of common set of numbers in two files. My files have format as shown below,
File 1
0: 152 145 148
1: 251 280 428
2: 42 281 407
3: 289 292 331
4: 309 212 226
5: 339 336 376
6: 339 376 380
7: 41 406 205
8: 237 418 193
File 2
0: 251 280 428
1: 309 212 226
2: 339 336 376
3: 339 376 380
4: 420 414 199
5: 418 193 237
6: 203 195 200
7: 287 161 257
8: 287 257 158
9: 263 369 15
10: 285 323 327
First column is just the serial numbers and should be ignored while checking the match between two files and set of same numbers with different ordering should be counted as common one (for e.g 237 418 193 = 418 193 237)
In this case, expected outcome will be.....
5 # no. of common sets
251 280 428
309 212 226
339 336 376
339 376 380
237 418 193
I have tried with an awk script -
awk 'FNR==NR{a[$3]=$0;next}{if(b=a[$3]){print b}}' file1 file2
Unfortunately, the set "237 418 193" didn't count since it have different ordering in the second file (418 193 237).
Can any help me to do this task with a awk or Python script.
Any help is appreciated?

Parse the file, creating a set of lines, each element sorted lexicographically.
file1_sets = {tuple(sorted(line.split()[1:])) for line in open(file1, 'r')}
file2_sets = {tuple(sorted(line.split()[1:])) for line in open(file2, 'r')}
Then see how many of one exist in the other
count = sum([f in file2_sets for f in file1_sets])
(Edited per comments)

Try this python code:
data1, data2 = [], []
for fname, data in [('file1.txt', data1), ('file2.txt', data2)]:
for line in open(fname):
data.append(set(line.strip().split()[1:]))
common = [s for s in data1 if s in data2]
for c in common:
print c
print len(common)
Output:
set(['280', '251', '428'])
set(['309', '212', '226'])
set(['336', '339', '376'])
set(['380', '339', '376'])
set(['237', '418', '193'])
5

Using sets and .intersection:
with open("21132195_1.txt") as fh1, open("21132195_2.txt") as fh2:
number_sets1 = set(frozenset(line.split()[1:]) for line in fh1)
number_sets2 = set(frozenset(line.split()[1:]) for line in fh2)
common_number_sets = number_sets1.intersection(number_sets2)
print "%i # no. of common sets" % len(common_number_sets)
print "\n".join([" ".join(s) for s in common_number_sets])
Will give as output:
5 # no. of common sets
339 376 380
251 280 428
212 226 309
193 237 418
336 339 376

In Gnu Awk version 4.1, you can use PROCINFO["sorted_in"] like:
gawk -f c.awk file1 file2
where c.awk is:
BEGIN {PROCINFO["sorted_in"]="#val_num_asc"}
NR==FNR {
b[getRow()]++
next
}
{
c=getRow()
if (c in b)
a[++j]=c
}
END{
print j" # no. of common sets"
for (i in a)
print a[i]
}
function getRow(b,d,i) {
for (i=2; i<=NF; i++)
d[i]=$i
for (i in d)
b=(b=="")?d[i]:(b" "d[i])
return b
}
Output:
5 # no. of common sets
193 237 418
212 226 309
251 280 428
336 339 376
339 376 380

Another way using shell.
cat file1 file2 |while read x line
do
arr2=($(for val in $line;do echo "$val";done|sort -n))
echo "${arr2[#]}"
done|awk '++a[$1,$2,$3]>1'
251 280 428
212 226 309
336 339 376
339 376 380
193 237 418

Related

Finding Common Elements (Amazon SDE-1)

Given two lists V1 and V2 of sizes n and m respectively. Return the list of elements common to both the lists and return the list in sorted order. Duplicates may be there in the output list.
Link to the problem : LINK
Example:
Input:
5
3 4 2 2 4
4
3 2 2 7
Output:
2 2 3
Explanation:
The first list is {3 4 2 2 4}, and the second list is {3 2 2 7}.
The common elements in sorted order are {2 2 3}
Expected Time complexity : O(N)
My code:
class Solution:
def common_element(self,v1,v2):
dict1 = {}
ans = []
for num1 in v1:
dict1[num1] = 0
for num2 in v2:
if num2 in dict1:
ans.append(num2)
return sorted(ans)
Problem with my code:
So the accessing time in a dictionary is constant and hence my time complexity was reduced but one of the hidden test cases is failing and my logic is very simple and straight forward and everything seems to be on point. What's your take? Is the logic wrong or the question desc is missing some vital details?
New Approach
Now I am generating two hashmaps/dictionaries for the two arrays. If a num is present in another array, we check the min frequency and then appending that num into the ans that many times.
class Solution:
def common_element(self,arr1,arr2):
dict1 = {}
dict2 = {}
ans = []
for num1 in arr1:
dict1[num1] = 0
for num1 in arr1:
dict1[num1] += 1
for num2 in arr2:
dict2[num2] = 0
for num2 in arr2:
dict2[num2] += 1
for number in dict1:
if number in dict2:
minFreq = min(dict1[number],dict2[number])
for _ in range(minFreq):
ans.append(number)
return sorted(ans)
The code is outputting nothing for this test case
Input:
64920
83454 38720 96164 26694 34159 26694 51732 64378 41604 13682 82725 82237 41850 26501 29460 57055 10851 58745 22405 37332 68806 65956 24444 97310 72883 33190 88996 42918 56060 73526 33825 8241 37300 46719 45367 1116 79566 75831 14760 95648 49875 66341 39691 56110 83764 67379 83210 31115 10030 90456 33607 62065 41831 65110 34633 81943 45048 92837 54415 29171 63497 10714 37685 68717 58156 51743 64900 85997 24597 73904 10421 41880 41826 40845 31548 14259 11134 16392 58525 3128 85059 29188 13812.................
Its Correct output is:
4 6 9 14 17 19 21 26 28 32 33 42 45 54 61 64 67 72 77 86 93 108 113 115 115 124 129 133 135 137 138 141 142 144 148 151 154 160 167 173 174 192 193 195 198 202 205 209 215 219 220 221 231 231 233 235 236 238 239 241 245 246 246 247 254 255 257 262 277 283 286 290 294 298 305 305 307 309 311 312 316 319 321 323 325 325 326 329 329 335 338 340 341 350 353 355 358 364 367 369 378 385 387 391 401 404 405 406 406 410 413 416 417 421 434 435 443 449 452 455 456 459 460 460 466 467 469 473 482 496 503 .................
And Your Code's output is:

Please find the below solution
def sorted_common_elemen(v1, v2):
res = []
for elem in v2:
res.append(elem)
v1.pop(0)
return sorted(res)

Your code ignores the number of times a given element occurs in the list. I think this is a good way to fix that:
class Solution:
def common_element(self, l0, l1):
li = []
for i in l0:
if i in l1:
l1.remove(i)
li.append(i)
return sorted(li)

How can I read a file having different column for each rows?

my data looks like this.
0 199 1028 251 1449 847 1483 1314 23 1066 604 398 225 552 1512 1598
1 1214 910 631 422 503 183 887 342 794 590 392 874 1223 314 276 1411
2 1199 700 1717 450 1043 540 552 101 359 219 64 781 953
10 1707 1019 463 827 675 874 470 943 667 237 1440 892 677 631 425
How can I read this file structure in python? I want to extract a specific column from rows. For example, If I want to extract value in the second row, second column, how can I do that? I've tried 'loadtxt' using data type string. But it requires string index slicing, so that I could not proceed because each column has different digits. Moreover, each row has a different number of columns. Can you guys help me?
Thanks in advance.

Use something like this to split it
split2=[]
split1=txt.split("\n")
for item in split1:
split2.append(item.split(" "))

I have stored given data in "data.txt". Try below code once.
res=[]
arr=[]
lines = open('data.txt').read().split("\n")
for i in lines:
arr=i.split(" ")
res.append(arr)
for i in range(len(res)):
for j in range(len(res[i])):
print(res[i][j],sep=' ', end=' ', flush=True)
print()

Python - Print updating counter not working

This is my code. The goal is to print a counter that updates the number of the page that's being checked within the same lane, replacing the old one:
import time
start_page = 500
stop_page = 400
print 'Checking page ',
for n in range(start_page,stop_page,-1):
print str(n),
time.sleep(5) # This to simulate the execution of my code
print '\r',
This doesn't print anything:
$ python test.py
$
I'm using Python 2.7.10, the line that causes problems is probably this print '\r', because if I run this:
import time
start_page = 500
stop_page = 400
print 'Checking page ',
for n in range(start_page,stop_page,-1):
print str(n),
time.sleep(5) # This to simulate the execution of my code
#print '\r',
I have this output:
$ python test.py
Checking page 500 499 498 497 496 495 494 493 492 491 490 489 488 487 486 485 484 483 482 481 480 479 478 477 476 475 474 473 472 471 470 469 468 467 466 465 464 463 462 461 460 459 458 457 456 455 454 453 452 451 450 449 448 447 446 445 444 443 442 441 440 439 438 437 436 435 434 433 432 431 430 429 428 427 426 425 424 423 422 421 420 419 418 417 416 415 414 413 412 411 410 409 408 407 406 405 404 403 402 401
$

Remove the comas after the print expressions:
print 'Checking page ',
print str(n),
print '\r',
PS: Since I got asked, the first thing to notice is that print is not a function it is a statement, hence it is not interpreted in the same way.
In the print case in particular, adding a ',' after the print will make it print the content without a newline.
In the case of your program, in particular, what it was doing is:
printing 'Checking page' -> NO \n here
printing n -> no \n here
printing '\r' -> again no '\n' here
Since you were not sending any new lines to the output, your OS didn't flush the data. You can add a sys.stdout.flush() after the print('\r') and see it changing if you want.
More on the print statement here.
https://docs.python.org/2/reference/simple_stmts.html#grammar-token-print_stmt
Why the hell I got downvoted? o.O

Output formatting based on user parameters

250 255 260 265 270 275 280 285 290 295 300 305 310 315 320
325 330 335 340 345 350 355 360 365 370 375 380 385 390 395
400 405 410 415 420 425 430 435 440 445 450 455 460 465 470
475 480 485 490 495 500 505 510 515 520 525 530 535 540 545
550 555 560 565 570 575 580 585 590 595 600 605 610 615 620
625 630 635 640 645 650 655 660 665 670 675 680 685 690 695
700 705 710 715 720 725 730 735 740 745 750
I am trying to create a program where you enter in a lower bound (250 in this case), and upper bound(750 in this case, a skip number (5 in this case) and also tell Python how many numbers I want per line (15 in this case) before breaking the line, can somehow help me go about with formatting this? I have Python 3.1.7. Thanks!
Basically how do I tell python to create 15 numbers per line before breaking, thanks.

Use range to create a list of your numbers from 250 to 750 in steps of 5. Loop over this using enumerate which gives the value and a counter. Check modulo 15 of the counter to determine if a line break needs to be included in the print. The counter starts at zero so look for cnt%15==14.
import sys
for cnt,v in enumerate(range(250,755,5)):
if cnt%15==14:
sys.stdout.write("%d\n" % v)
else:
sys.stdout.write("%d " % v)

#ljk07
xrange is better - does not create whole list at once
You do not really need sys.stdout - just add comma
New style formatting is better
how about a little more compact and more contemporary
for cnt,v in enumerate(range(250,755,5),1):
print '{}{}'.format(v,' ' if cnt % 15 else '\n'),
EDIT:
Missed python 3 part - print even better here
for cnt,v in enumerate(range(250,755,5),1):
print( '{:3}'.format(v),end=' ' if cnt % 15 else '\n')

Another option to through in the mix is to use the itertools.grouper recipe:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
from itertools import izip_longest
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Then the loop becomes:
for line in grouper(15, range(250, 755, 5), ''):
print(' '.join(map(str, line)))

You should keep a counter and check on the modulo of the counter if you have to print a new line.
Could look something like this:
if counter % 15 == 0:
sys.stdout.write(str("\n"))

how to print only lines which contain a substring

i have a file with strings:
REALSTEP 12342 {2012-7-20 15:10:39};[{416 369 338};{423 432 349};{383 380 357};{399 401 242};{0 454 285};{457 433 115};{419 455 314};{495 534 498};][{428 377 336};{433 456 345};{386 380 363};{384 411
REALSTEP 7191 {2012-7-20 15:10:41};[{416 370 361};{406 417 376};{377 368 359};{431 387 251};{0 461 366};{438 409 134};{429 411 349};{424 505 364};][{423 372 353};{420 433 374};{379 365 356};{431 387 2
REALSTEP 12123 {2012-7-20 15:10:42};[{375 382 329};{386 402 347};{374 378 357};{382 384 259};{0 397 357};{442 424 188};{398 384 356};{392 420 355};][{404 405 359};{420 432 372};{405 408 383};{413 407
REALSTEP 27237 {2012-7-20 15:10:44};[{431 375 329};{416 453 334};{387 382 349};{397 403 248};{0 451 300};{453 422 131};{433 401 317};{434 505 326};][{443 384 328};{427 467 336};{391 386 344};{394 413
FAKE 32290 {2012-7-20 15:10:48};[{424 399 364};{408 446 366};{397 394 389};{415 409 261};{0 430 374};{445 428 162};{432 416 375};{431 473 380};][{424 398 376};{412 436 372};{401 400 390};{417 409 261}
FAKE 32296 {2012-7-20 15:10:53};[{409 326 394};{445 425 353};{401 402 357};{390 424 250};{0 420 353};{447 423 143};{404 436 351};{421 527 420};][{410 332 400};{450 429 356};{402 403 356};{391 425 250}
FAKE 32296 {2012-7-20 15:10:59};[{381 312 387};{413 405 328};{320 387 376};{388 387 262};{0 402 326};{417 418 177};{407 409 335};{443 502 413};][{412 336 417};{446 437 353};{343 417 403};{417 418 258}
FAKE 32295 {2012-7-20 15:11:4};[{377 314 392};{416 403 329};{322 388 375};{385 391 261};{0 403 329};{425 420 168};{414 393 330};{458 502 397};][{408 338 421};{449 435 355};{345 418 403};{413 420 257};
FAKE 32295 {2012-7-20 15:11:9};[{371 318 411};{422 385 333};{342 379 352};{394 395 258};{0 440 338};{418 414 158};{420 445 346};{442 516 439};][{401 342 441};{456 415 358};{367 407 377};{420 420 255};
FAKE 32296 {2012-7-20 15:11:15};[{373 319 412};{423 386 333};{344 384 358};{402 402 257};{0 447 342};{423 416 151};{422 450 348};{447 520 442};][{403 342 442};{456 416 358};{366 409 379};{422 421 255}
REALSTEP 7191 {2012-7-20 15:10:41};[{416 370 361};{406 417 376};{377 368 359};{431 387 251};{0 461 366};{438 409 134};{429 411 349};{424 505 364};][{423 372 353};{420 433 374};{379 365 356};{431 387 2
REALSTEP 12123 {2012-7-20 15:10:42};[{375 382 329};{386 402 347};{374 378 357};{382 384 259};{0 397 357};{442 424 188};{398 384 356};{392 420 355};][{404 405 359};{420 432 372};{405 408 383};{413 407
REALSTEP 27237 {2012-7-20 15:10:44};[{431 375 329};{416 453 334};{387 382 349};{397 403 248};{0 451 300};{453 422 131};{433 401 317};{434 505 326};][{443 384 328};{427 467 336};{391 386 344};{394 413
I read the file with readlines() and want to then loop over the lines and print only when there is a consecutive block of lines larger than 3, containing the string "REALSTEP". So in the example the expected result is:
REALSTEP 12342 {2012-7-20 15:10:39};[{416 369 338};{423 432 349};{383 380 357};{399 401 242};{0 454 285};{457 433 115};{419 455 314};{495 534 498};][{428 377 336};{433 456 345};{386 380 363};{384 411
REALSTEP 7191 {2012-7-20 15:10:41};[{416 370 361};{406 417 376};{377 368 359};{431 387 251};{0 461 366};{438 409 134};{429 411 349};{424 505 364};][{423 372 353};{420 433 374};{379 365 356};{431 387 2
REALSTEP 12123 {2012-7-20 15:10:42};[{375 382 329};{386 402 347};{374 378 357};{382 384 259};{0 397 357};{442 424 188};{398 384 356};{392 420 355};][{404 405 359};{420 432 372};{405 408 383};{413 407
REALSTEP 27237 {2012-7-20 15:10:44};[{431 375 329};{416 453 334};{387 382 349};{397 403 248};{0 451 300};{453 422 131};{433 401 317};{434 505 326};][{443 384 328};{427 467 336};{391 386 344};{394 413
I tried this:
lines = f.readlines()
idx = -1
#loop trough all lines in the file
for i, line in enumerate(lines):
if idx > i:
continue
else:
if "REALSTEP" in line:
steps = lines[i:i+3]
#check for blokc of realsteps
if is_block(steps, "REALSTEP") == 3:
#prints the block up to the first next "FAKE STEP"
lst = get_block(lines[i:-1])
for l in lst:
print l[:200]
idx = i + len(lst)
print "next block============"
where the function is_block is this:
def is_block(lines, check):
#print len(lines)
bool = 1
for l in lines:
if check in l:
bool = 1
else:
bool = 0
bool = bool + bool
return bool
and the function get_block:
def get_block(lines):
lst = []
for l in lines:
if "REALSTEP" in l:
#print l[:200]
lst.append(l)
else:
break
return lst
While this works, it prints all lines containing the string "REALSTEPS". The print len(lines) in is_block(lines) is always 10 when the function is called so that is not it.
I am confused, please help me out here!

Here's a simple solution containing the logic you need:
to_print = []
count = 0
started = False
for line in f.readlines():
if "REALSTEP" in line:
if not started:
started = True
to_print.append(line)
count += 1
else:
if count > 3: print('\n'.join(to_print))
started = False
count = 0
to_print = []
It counts any line that has the string "REALSTEP" in it as valid. Produces the desired output.

This part:
...
if "REALSTEP" in line:
steps = lines[i:i+3]
for s in steps:
print s[:200] # <- right here
...
Whenever you find "REALSTEP" in a line, you retrieve the following three lines and print them right away. That's probably not what you wanted.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

counting the number of common sets in two files - python

Another way using shell. cat file1 file2 |while read x line do arr2=($(for val in $line;do echo "$val";done|sort -n)) echo "${arr2[#]}" done|awk '++a[$1,$2,$3]>1' 251 280 428 212 226 309 336 339 376 339 376 380 193 237 418

Related

Finding Common Elements (Amazon SDE-1)

How can I read a file having different column for each rows?

Python - Print updating counter not working

Output formatting based on user parameters

how to print only lines which contain a substring

Categories

Resources