The very fast way to find repeating combinations in Python using pandas? - python
I have this "DrawsDB.csv" sample file as input:
Day,Hour,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19,N20
1996-03-18,15:00,4,9,10,16,21,22,23,26,27,34,35,41,42,48,62,66,68,73,76,78
1996-03-19,15:00,6,12,15,19,28,33,35,39,44,48,49,59,62,63,64,67,69,71,75,77
1996-03-21,15:00,2,4,6,7,15,16,17,19,20,26,28,45,48,52,54,69,72,73,75,77
1996-03-22,15:00,3,8,15,17,19,25,30,33,34,35,36,38,44,49,60,61,64,67,68,75
1996-03-25,15:00,2,10,11,14,18,22,26,27,29,30,42,44,45,55,60,61,66,67,75,79
2022-01-01,15:00,1,9,12,17,33,34,36,37,38,44,45,46,53,56,58,60,62,63,70,72
2022-01-01,22:50,1,3,4,14,19,22,24,27,32,33,35,36,44,48,53,55,69,70,76,78
2022-01-02,15:00,13,15,16,19,22,24,31,37,38,43,47,58,64,66,70,72,73,75,76,78
2022-01-02,22:50,5,10,11,14,16,28,29,36,41,53,54,56,58,59,61,67,68,71,73,77
2022-01-03,15:00,8,9,10,11,15,20,21,22,26,30,35,36,39,42,52,58,63,64,73,80
2022-01-03,22:50,4,9,17,21,22,32,33,34,36,37,38,41,48,49,50,60,64,69,70,75
2022-01-04,15:00,4,5,7,9,11,16,17,21,22,25,30,37,38,39,44,49,52,60,65,78
2022-01-04,22:50,17,18,22,26,27,30,31,40,43,49,55,62,63,64,65,71,72,73,76,80
2022-01-05,15:00,1,5,8,14,15,20,23,25,26,33,34,35,37,47,54,59,67,70,72,76
2022-01-05,22:50,6,7,14,15,16,18,26,37,39,41,45,51,52,54,55,59,61,70,71,80
2022-01-06,15:00,9,10,11,17,28,30,32,41,42,44,45,49,50,51,55,65,67,72,76,78
2022-01-06,22:50,1,2,6,9,11,15,21,26,31,37,40,43,47,51,52,54,67,68,73,75
This is just a sample. The real csv file is more than 50.000 rows in total.
N1 to N20 columns contains random values, non repeating across the same row, which means they are not duplicate. And they are sorted from smallest one (N1) to the biggest one (N20).
I want to get repeating combos (e.g. of 5 numbers let's say) across all rows from the DataFrame from columns N1 to N20.
So, for the entire .csv file posted above the output should be:
(6, 15, 26, 52, 54) 3
(17, 33, 34, 36, 38) 3
(17, 33, 34, 36, 60) 3
(17, 33, 34, 38, 60) 3
(17, 33, 36, 38, 60) 3
(17, 34, 36, 38, 60) 3
(33, 34, 36, 38, 60) 3
...
This is the full ouput which I'm not posting here because of text size limitations:
https://pastebin.com/4EVXXSn1
Please check it out.
Sorry for making such long output, I tried to create a shorter one but didn't succeed in getting representative combos for it.
This is the Python code I wrote to accomplish what I need: (please read its commented lines too)
import pandas as pd
from itertools import combinations
from collections import Counter
df = pd.read_csv("DrawsDB.csv")
# looping through db using method found here:
# https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
df = df.reset_index() # make sure indexes pair with number of rows
draws = []
# please read this: https://stackoverflow.com/a/55557758/7710871 (Conclusion:iter is very slow)
for index, row in df.iterrows():
draws.append(
[row['N1'], row['N2'], row['N3'], row['N4'], row['N5'], row['N6'], row['N7'], row['N8'], row['N9'], row['N10'],
row['N11'], row['N12'], row['N13'], row['N14'], row['N15'], row['N16'], row['N17'], row['N18'], row['N19'],
row['N20']])
# comparing to each other in order to check for repeating combos:
repeating_combos = []
for i in range(len(draws)):
for j in draws[i + 1:]:
repeating_combos.append(sorted(list(set(draws[i]).intersection(j))))
# e.g. getting any repeating combo of 5 across all rows:
combos_of_5 = []
for each in repeating_combos:
if len(each) == 5:
combos_of_5.append(tuple(each))
# print(each)
elif len(each) > 5:
# e.g. a repeating sequence of 6 numbers means in fact 6 combos taken by 5 numbers in this case.
# e.g. a repeating sequence of 7 numbers means in fact 21 combos of 5 numbers and so on.
# Combinations(k, n)
for cmb in combinations(each, 5):
combos_of_5.append(tuple(sorted(list(set(cmb)))))
# count how many times each combo appear:
x = Counter(combos_of_5)
sorted_x = dict(sorted(x.items(), key=lambda item: item[1], reverse=True))
for k, v in sorted_x.items():
print(k, v)
It works very well, as expected but there is one single problem: for a bigger DataFrame it takes a lot of time to do its job done. More than that, if you want to get repeating combinations with more than 5 numbers (let's say with 6, 7, 8 or 9 numbers) it will take for ever to run.
How to do it in full pandas in a very fast and much more smarter way than I did?
Also, please note that it does not generate every combo in the first instance and after that start looking for each of those combos into DataFrame because it will take even longer.
Thank you very much in advance!
P.S. What if the numbers from N1 to N20 were not sorted? Will this make any difference?
I read this topic and many others already but none is asking for the same thing so I think it is not duplicate and this could help many other have the same or very similar problem.
Proof of work:
Given this part of your dataframe:
index
Day
Hour
N1
N2
N3
N4
N5
N6
N7
N8
N9
N10
N11
N12
N13
N14
N15
N16
N17
N18
0
1996-03-18
15:00
4
9
10
16
21
22
23
26
27
34
35
41
42
48
62
66
68
73
1
1996-03-19
15:00
6
12
15
19
28
33
35
39
44
48
49
59
62
63
64
67
69
71
2
1996-03-21
15:00
2
4
6
7
15
16
17
19
20
26
28
45
48
52
54
69
72
73
3
1996-03-22
15:00
3
8
15
17
19
25
30
33
34
35
36
38
44
49
60
61
64
67
You can update your code with something similar to the one below:
check = [6,15]
df['check'] = df.iloc[:,2:].apply(lambda r: all(s in r.values for s in check), axis=1)
true_count = df.check.sum()
print(f'The following numbers {check} appear {true_count} time(s) in the dataframe.')
Result:
The following numbers [6, 15] appear 2 time(s) in the dataframe.
Related
Python looping back in a range when stop is exceed a value?
I have a range in like below. What I am trying to do is to loop back to 0 if the range stop is greater that a certain value (this example 96). I can simply loop through the range as I did below, but is there a better way to do perform this in Python's range? my_range = range(90, 100) tmp_list=[] for i in range(90, 100): if i >= 96: tmp_list.append(i-96) else: tmp_list.append(i) print(tmp_list) [90, 91, 92, 93, 94, 95, 0, 1, 2, 3]
Checkout itertools.cycle: from itertools import cycle def clipped_cycle(start, end): c = cycle(range(0, 96)) # Discard till start for _ in range(start): next(c) return c c = clipped_cycle(90, 96) for i in c: print(i) what you get is an infinite output stream that cycles along. 90 91 92 93 94 95 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 . . . to get a limited number of outputs: n = 7 for _ in range(n): print(next(c)) gives 90 91 92 93 94 95 0
First, I did not understand why you have defined my_range = range(90, 100), if you are never going to use it. You can use 'mod' in these cases. Try this, short and effective xlist = [i%96 for i in range(90,100)]
Extract data from line and add to specific numbered boxes
Sorry, but I need your help if possible. I am a complete beginner in Python and I am completely stuck. I would like to know what would be the first steps to solve my problem. I have (many pages) of lines with the following structure: pg10_65 * 3.2200 * 22 24 28 30 33 34 36 37 pg10_116 * 3.2420 * 24 28 30 33 34 37 pg10_118 * 3.1500 * 19 24 28 30 33 34 36 pg10_120 * 3.1230 * 24 28 30 33 34 36 37 pg74_32 * 3.0350 * 17 28 30 33 34 36 37 38 For each line and in between the * symbols I have a value (digit dot four decimals) and after the last * symbol I have a series of numbers, from 1 to 68 but not all of them. I have 68 boxes. In this example, and for the first line, I want to add 3.2200 to boxes 22, 24, ..., 36, 37. If there is a 0 add 3.2200 to 0, if there is another value, add to that value. For the second line, I want to add the values 3.2420 to boxes 24, 28, ..., 34, 37. If there is a 0 add to 0, if there is another value, add 3.2420 to that value. And so on for each of the lines. In the end I would have 60 boxes with all values corresponding to that boxes added. I am completely stuck on this. Thanks a lot to everyone for your advice. José
I came up with a solution to your problem. def add_to_boxes(input_string, all_boxes): ''' Takes a list of boxes and your string input and does the additions you requested ''' _, float_value, box_list = input_string.split('*') float_value = float(float_value) #convert from string to float box_list = [int (s) for s in box_list.split(' ')[1:]] #convert from string to list of integers print("Adding ", float(float_value), " to ", box_list) for box in box_list: all_boxes[box-1] += float_value #boxes are numbered from 0 to 67 boxes = [0 for i in range(68)] # 68 boxes with the value 0 input_str = "pg10_65 * 3.2200 * 22 24 28 30 33 34 36 37" add_to_boxes(input_str, boxes) print(boxes) Please let me know if you have any questions regarding how it works.
How to merge two columns of a dataframe based on values from a column in another dataframe?
I have a dataframe called df_location: location = {'location_id': [1,2,3,4,5,6,7,8,9,10], 'temperature_value': [20,21,22,23,24,25,26,27,28,29], 'humidity_value':[60,61,62,63,64,65,66,67,68,69]} df_location = pd.DataFrame(locations) I have another dataframe called df_islands: islands = {'island_id':[10,20,30,40,50,60], 'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]} df_islands = pd.DataFrame(islands) Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list. What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location. Final dataframe should be the following: merged = {'location_id': [1,2,3,4,5,6,7,8,9,10], 'temperature_value': [20,21,22,23,24,25,26,27,28,29], 'humidity_value':[60,61,62,63,64,65,66,67,68,69], 'island_id':[10,20,20,30,30,40,40,40,50,60]} df_merged = pd.DataFrame(merged) I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.
The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key. import pandas as pd locations = {'location_id': [1,2,3,4,5,6,7,8,9,10], 'temperature_value': [20,21,22,23,24,25,26,27,28,29], 'humidity_value':[60,61,62,63,64,65,66,67,68,69]} df_locations = pd.DataFrame(locations) islands = {'island_id':[10,20,30,40,50,60], 'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]} df_islands = pd.DataFrame(islands) df_islands = df_islands.explode(column='list_of_locations') df_islands.columns = ['island_id', 'location_id'] pd.merge(df_locations, df_islands) Out[]: location_id temperature_value humidity_value island_id 0 1 20 60 10 1 2 21 61 20 2 3 22 62 20 3 4 23 63 30 4 5 24 64 30 5 6 25 65 40 6 7 26 66 40 7 8 27 67 40 8 9 28 68 50 9 10 29 69 60
The df.apply() method works here. It's a bit long-winded but it works: df_location['island_id'] = df_location['location_id'].apply( lambda x: [ df_islands['island_id'][i] \ for i in df_islands.index \ if x in df_islands['list_of_locations'][i] # comment above line and use this instead if list is stored in a string # if x in eval(df_islands['list_of_locations'][i]) ][0] ) First we select the final value we want if the if statement is True: df_islands['island_id'][i] Then we loop over each column in df_islands by using df_islands.index Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list. Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end. I hope this helps and happy for other editors to make the answer more legible! print(df_location) location_id temperature_value humidity_value island_id 0 1 20 60 10 1 2 21 61 20 2 3 22 62 20 3 4 23 63 30 4 5 24 64 30 5 6 25 65 40 6 7 26 66 40 7 8 27 67 40 8 9 28 68 50 9 10 29 69 60
How to read a file word by word
I have a PPM file that I need to do certain operations on. The file is structured as in the following example. The first line, the 'P3' just says what kind of document it is. In the second line it gives the pixel dimension of an image, so in this case it's telling us that the image is 480x640. In the third line it declares the maximum value any color can take. After that there are lines of code. Every three integer group gives an rbg value for one pixel. So in this example, the first pixel has rgb value 49, 49, 49. The second pixel has rgb value 48, 48, 48, and so on. P3 480 640 255 49 49 49 48 48 48 47 47 47 46 46 46 45 45 45 42 42 42 38 38 38 35 35 35 23 23 23 8 8 8 7 7 7 17 17 17 21 21 21 29 29 29 41 41 41 47 47 47 49 49 49 42 42 42 33 33 33 24 24 24 18 18 ... Now as you may notice, this particular picture is supposed to be 640 pixels wide which means 640*3 integers will provide the first row of pixels. But here the first row is very, very far from containing 640*3 integers. So the line-breaks in this file are meaningless, hence my problem. The main way to read Python files is line-by-line. But I need to collect these integers into groups of 640*3 and treat that like a line. How would one do this? I know I could read the file in line-by-line and append every line to some list, but then that list would be massive and I would assume that doing so would place an unacceptable burden on a device's memory. But other than that, I'm out of ideas. Help would be appreciated.
To read three space-separated word at a time from a file: with open(filename, 'rb') as file: kind, dimensions, max_color = map(next, [file]*3) # read 3 lines rgbs = zip(*[(int(word) for line in file for word in line.split())] * 3) Output [(49, 49, 49), (48, 48, 48), (47, 47, 47), (46, 46, 46), (45, 45, 45), (42, 42, 42), ... See What is the most “pythonic” way to iterate over a list in chunks? To avoid creating the list at once, you could use itertools.izip() that would allow to read one rgb value at a time.
Probably not the most 'pythonic' way but... Iterate through the lines containing integers. Keep four counts - a count of 3 - color_code_count, a count of 1920 - numbers_processed, a count - col (0-639), and another - rows (0-479). For each integer you encounter, add it to a temporary list at index of list[color_code_count]. Increment color_code_count, col, and numbers_processed. Once color_code_count is 3, you take your temporary list and create a tuple 3 or triplet (not sure what the term is but your structure will look like (49,49,49) for the first pixel), and add that to a list of 640 columns, and 480 rows - insert your (49, 49, 49) into pixels[col][row]. Increment col. Reset color_code_count. 'numbers_processed' will continue to increment until you get to 1920. Once you hit 1920, you've reached the end of the first row. Reset numbers_processed and col to zero, increment row by 1. By this point, you should have 640 tuple3s or triplets in the row zero starting with (49,49,49), (48, 48, 48), (47, 47, 47), etc. And you're now starting to insert pixel values in row 1 column 0. Like I said, probably not the most 'pythonic' way. There are probably better ways of doing this using join and map but I think this might work? This 'solution' if you want to call it that, shouldn't care about number of integers on any line since you're keeping count of how many numbers you expect to run through (1920) before you start a new row.
A possible way to go through each word is to iterate through each line then .split it into each word. the_file = open("file.txt",r) for line in the_file: for word in line.split(): #-----Your Code----- From there you can do whatever you want with your "words." You can add if-statements to check if there are numbers in each line with: (Though not very pythonic) for line in the_file: if "1" not in line or "2" not in line ...: for word in line.split(): #-----Your Code----- Or you can test if there is anything in each line: (Much more pythonic) for line in the_file: for word in line.split(): if len(word) != 0 or word != "\n": #-----Your Code----- I would recommend adding each of your new "lines" to a new document.
I am a C programmer. Sorry if this code looks like C Style: f = open("pixel.ppm", "r") type = f.readline() height, width = f.readline().split() height, width = int(height), int(width) max_color = int(f.readline()); colors = [] count = 0 col_count = 0 line = [] while(col_count < height): count = 0 i = 0 row =[] while(count < width * 3): temp = f.readline().strip() if(temp == ""): col_count = height break temp = temp.split() line.extend(temp) i = 0 while(i + 2 < len(line)): row.append({'r':int(line[i]),'g':int(line[i+1]),'b':int(line[i+2])}) i = i+3 count = count +3 if(count >= width *3): break if(i < len(line)): line = line[i:len(line)] else: line = [] col_count += 1 colors.append(row) for row in colors: for rgb in row: print(rgb) print("\n") You can tweak this according to your needs. I tested it on this file: P4 3 4 256 4 5 6 4 7 3 2 7 9 4 2 4 6 8 0 3 4 5 6 7 8 9 0 2 3 5 6 7 9 2 2 4 5 7 2 2
This seems to do the trick: from re import findall def _split_list(lst, i): return lst[:i], lst[i:] def iter_ppm_rows(path): with open(path) as f: ftype = f.readline().strip() h, w = (int(s) for s in f.readline().split(' ')) maxcolor = int(f.readline()) rlen = w * 3 row = [] next_row = [] for line in f: line_ints = [int(i) for i in findall('\d+\s+', line)] if not row: row, next_row = _split_list(line_ints, rlen) else: rest_of_row, next_row = _split_list(line_ints, rlen - len(row)) row += rest_of_row if len(row) == rlen: yield row row = next_row next_row = [] It isn't very pretty, but it allows for varying whitespace between numbers in the file, as well as varying line lengths. I tested it on a file that looked like the following: P3 120 160 255 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 [...] 9993 9994 9995 9996 9997 9998 9999 That file used random line lengths, but printed numbers in order so it was easy to tell at what value the rows began and stopped. Note that its dimensions are different than in the question's example file. Using the following test code... for row in iter_ppm_rows('mock_ppm.txt'): print(len(row), row[0], row[-1]) ...the result was the following, which seems to not be skipping over any data and returning rows of the right size. 480 0 479 480 480 959 480 960 1439 480 1440 1919 480 1920 2399 480 2400 2879 480 2880 3359 480 3360 3839 480 3840 4319 480 4320 4799 480 4800 5279 480 5280 5759 480 5760 6239 480 6240 6719 480 6720 7199 480 7200 7679 480 7680 8159 480 8160 8639 480 8640 9119 480 9120 9599 As can be seen, trailing data at the end of the file that can't represent a complete row was not yielded, which was expected but you'd likely want to account for it somehow.
Printing a rather specific matrix
I have a list consisting of 148 entries. Each entry is a four digit number. I would like to print out the result as this: 1 14 27 40 2 15 28 41 3 16 29 42 4 17 30 43 5 18 31 44 6 19 32 45 7 20 33 46 8 21 34 47 9 22 35 48 10 23 36 49 11 24 37 50 12 25 38 51 13 26 39 52 53 54 55... and so on I have some code that work for the first 13 rows and 4 columns: kort_identifier = [my_list_with_the_entries] print_val = 0 print_num_1 = 0 print_num_2 = 13 print_num_3 = 26 print_num_4 = 39 while (print_val <= 36): print kort_identifier[print_num_1], '%10s' % kort_identifier[print_num_2], '%10s' % kort_identifier[print_num_3], '%10s' % kort_identifier[print_num_4] print_val += 1 print_num_1 += 1 print_num_2 += 1 print_num_3 += 1 print_num_4 += 1 I feel this is an awful solution and there has to be a better and simpler way of doing this. I have searched through here (searched for printing tables and matrices) and tried those solution but none seems to work with this odd table/matrix behaviour that I need. Please point me in the right direction.
A bit tricky, but here you go. I opted to manipulate the list until it had the right shape, instead of messing around with indexes. lst = range(1, 149) lst = [lst[i:i+13] for i in xrange(0, len(lst), 13)] lst = zip(*[lst[i] + lst[i+4] + lst[i+8] for i in xrange(4)]) for row in lst: for col in row: print col, print
It might be overkill, but you could just make a numpy array. import numpy as np x = np.array(kort_identifier).reshape(2, 13, 4) for subarray in x: for row in subarray: print row