The most frequent pattern of specific columns in Pandas.DataFrame in python - python

I know how to get the most frequent element of list of list, e.g.
a = [[3,4], [3,4],[3,4], [1,2], [1,2], [1,1],[1,3],[2,2],[3,2]]
print max(a, key=a.count)
should print [3, 4] even though the most frequent number is 1 for the first element and 2 for the second element.
My question is how to do the same kind of thing with Pandas.DataFrame.
For example, I'd like to know the implementation of the following method get_max_freq_elem_of_df:
def get_max_freq_elem_of_df(df):
# do some things
return freq_list
df = pd.DataFrame([[3,4], [3,4],[3,4], [1,2], [1,2], [1,1],[1,3],[2,2],[4,2]])
x = get_max_freq_elem_of_df(df)
print x # => should print [3,4]
Please notice that DataFrame.mode() method does not work. For above example, df.mode() returns [1, 2] not [3,4]
Update
have explained why DataFrame.mode() doesn't work.

You could use groupby.size and then find the max:
>>> df.groupby([0,1]).size()
0 1
1 1 1
2 2
3 1
2 2 1
3 4 3
4 2 1
dtype: int64
>>> df.groupby([0,1]).size().idxmax()
(3, 4)

In python you'd use Counter*:
In [11]: from collections import Counter
In [12]: c = Counter(df.itertuples(index=False))
In [13]: c
Out[13]: Counter({(3, 4): 3, (1, 2): 2, (1, 3): 1, (2, 2): 1, (4, 2): 1, (1, 1): 1})
In [14]: c.most_common(1) # get the top 1 most common items
Out[14]: [((3, 4), 3)]
In [15]: c.most_common(1)[0][0] # get the item (rather than the (item, count) tuple)
Out[15]: (3, 4)
* Note that your solution
max(a, key=a.count)
(although it works) is O(N^2), since on each iteration it needs to iterate through a (to get the count), whereas Counter is O(N).

Related

Pythonic way of making combinations in groups

Given I have the following list:
group
code
A
1
A
2
A
3
B
4
B
5
B
6
B
7
How do I create the following list in a pythonic way?:
group
code
code
A
1
2
A
1
3
A
2
3
B
4
5
B
4
6
B
4
7
B
5
6
B
5
7
B
6
7
I saw from another ticket that suggests using itertools import combinations. But how to get by the grouping restriction: I don't want all matches, just ones within groups.
You need to use itertool to get combinations all possible code combinations for each group
from itertools import combinations
A = [1,2,3]
B = [4,5,6,7]
comb_A = combinations(A, 2)
comb_B = combinations(B, 2)
#to see the results iterate through all the combinations
# For A (the same applies for B)
for comb in comb_A:
print(comb)
>>> (1,2)
>>> (1,3)
>>> (2,3)
Note: It will be more helpful if you could provide the "list" so we can give a more specific answer
Since you didn't post a MWE of what you tried, I'll show you steps that you can implement yourself.
Build a dictionary of data groups, say group_dict.
Create an empty list for result.
Iterate through the items in group_dict, where each item is a group name and a list of codes for that group.
For each group, use the combinations function to generate all possible combinations of 2 codes.
For each combination, append a tuple of the group name, the first code, and the second code to result.
This should give you a result like this:
[('A', 1, 2), ('A', 1, 3), ('A', 2, 3), ('B', 4, 5), ('B', 4, 6), ('B', 4, 7), ('B', 5, 6), ('B', 5, 7), ('B', 6, 7)]

Return various sum totals of integer list

Is there a way to return various sums of a list of integers? Pythonic or otherwise.
For e.g. various sum totals from [1, 2, 3, 4] would produce 1+2=3, 1+3=4, 1+4=5, 2+3=5, 2+4=6, 3+4=7. Integers to be summed could by default be stuck to two integers only or more I guess.
Can't seem to wrap my head around how to tackle this and can't seem to find an example or explanation on the internet as they all lead to "Sum even/odd numbers in list" and other different problems.
You can use itertools.combinations and sum:
from itertools import combinations
li = [1, 2, 3, 4]
# assuming we don't need to sum the entire list or single numbers,
# and that x + y is the same as y + x
for sum_size in range(2, len(li)):
for comb in combinations(li, sum_size):
print(comb, sum(comb))
outputs
(1, 2) 3
(1, 3) 4
(1, 4) 5
(2, 3) 5
(2, 4) 6
(3, 4) 7
(1, 2, 3) 6
(1, 2, 4) 7
(1, 3, 4) 8
(2, 3, 4) 9
is this what you are looking for ?
A=[1,2,3,4] for i in A: for j in A: if i!=j: print(i+j)

Looping through Numpy Array elements

Is there a more readable way to code a loop in Python that goes through each element of a Numpy array? I have come up with the following code, but it seems cumbersome & not very readable:
import numpy as np
arr01 = np.random.randint(1,10,(3,3))
for i in range(0,(np.shape(arr01[0])[0]+1)):
for j in range(0,(np.shape(arr01[1])[0]+1)):
print (arr01[i,j])
I could make it more explicit such as:
import numpy as np
arr01 = np.random.randint(1,10,(3,3))
rows = np.shape(arr01[0])[0]
cols = np.shape(arr01[1])[0]
for i in range(0, (rows + 1)):
for j in range(0, (cols + 1)):
print (arr01[i,j])
However, that still seems a bit more cumbersome, compared to other languages, i.e. an equivalent code in VBA could read (supposing the array had already been populated):
dim i, j as integer
for i = lbound(arr01,1) to ubound(arr01,1)
for j = lbound(arr01,2) to ubound(arr01,2)
msgBox arr01(i, j)
next j
next i
You should use the builtin function nditer, if you don't need to have the indexes values.
for elem in np.nditer(arr01):
print(elem)
EDIT: If you need indexes (as a tuple for 2D table), then:
for index, elem in np.ndenumerate(arr01):
print(index, elem)
Seems like you've skipped over some intro Python chapters. With a list there are several simple ways of iterating:
In [1]: alist = ['a','b','c']
In [2]: for i in alist: print(i) # on the list itself
a
b
c
In [3]: len(alist)
Out[3]: 3
In [4]: for i in range(len(alist)): print(i,alist[i]) # index is ok
0 a
1 b
2 c
In [5]: for i,v in enumerate(alist): print(i,v) # but enumerate is simpler
0 a
1 b
2 c
Note the indexes. range(3) is sufficient. alist[3] produces an error.
In [6]: arr = np.arange(6).reshape(2,3)
In [7]: arr
Out[7]:
array([[0, 1, 2],
[3, 4, 5]])
In [8]: for row in arr:
...: for col in row:
...: print(row,col)
...:
[0 1 2] 0
[0 1 2] 1
[0 1 2] 2
[3 4 5] 3
[3 4 5] 4
[3 4 5] 5
The shape is a tuple. The row count is then arr.shape[0], and columns arr.shape[1]. Or you can 'unpack' both at once:
In [9]: arr.shape
Out[9]: (2, 3)
In [10]: n,m = arr.shape
In [11]: [arr[i,j] for i in range(n) for j in range(m)]
Out[11]: [0, 1, 2, 3, 4, 5]
But we can get the same flat list of values with ravel and optional conversion to list:
In [12]: arr.ravel()
Out[12]: array([0, 1, 2, 3, 4, 5])
In [13]: arr.ravel().tolist()
Out[13]: [0, 1, 2, 3, 4, 5]
But usually with numpy arrays, you shouldn't be iterating at all. Learn enough of the numpy basics so you can work with the whole array, not elements.
nditer can be used, as the other answer shows, to iterate through an array in a flat manner, but there are a number of details about it that could easily confuse a beginner. There are a couple of intro pages to nditer, but they should be read in full. Usually I discourage its use.
In [14]: for i in np.nditer(arr):
...: print(i, type(i), i.shape)
...:
0 <class 'numpy.ndarray'> () # this element is a 0d array, not a scalar integer
1 <class 'numpy.ndarray'> ()
2 <class 'numpy.ndarray'> ()
...
Iterating with ndenumerate or on the tolist produce different types of elements. The type may matter if you try to do more than display the value, so be careful.
In [15]: list(np.ndenumerate(arr))
Out[15]: [((0, 0), 0), ((0, 1), 1), ((0, 2), 2), ((1, 0), 3), ((1, 1), 4), ((1, 2), 5)]
In [16]: for ij, v in np.ndenumerate(arr):
...: print(ij, v, type(v))
...:
(0, 0) 0 <class 'numpy.int64'>
(0, 1) 1 <class 'numpy.int64'>
...
In [17]: for i, v in enumerate(arr.ravel().tolist()):
...: print(i, v, type(v))
...:
0 0 <class 'int'>
1 1 <class 'int'>
...

Random access over all pair-wise combinations of large list in Python

Background:
I have a list of 44906 items: large = [1, 60, 17, ...]. I also have a personal computer with limited memory (8GB), running Ubuntu 14.04.4 LTS.
The Goal:
I need to find all the pair-wise combinations of large in a memory-efficient manner, without filling a list with all the combinations beforehand.
The Problem & What I've Tried So Far:
When I use itertools.combinations(large, 2), and try to assign it to a list my memory fills up immediately, and I get very slow performance. The reason for this is that the number of pairwise combinations goes like n*(n-1)/2 where n is the number of elements of the list.
The number of combinations for n=44906 comes out to 44906*44905/2 = 1008251965. A list with this many entries is much too large to store in memory. I would like to be able to design a function so that I can plug in a number i to find the ith pair-wise combination of numbers in this list, and a way to somehow dynamically compute this combination, without referring to a 1008251965 element list that's impossible to store in memory.
An Example of What I Am Trying To Do:
Let's say I have an array small = [1,2,3,4,5]
In the configuration in which I have the code, itertools.combinations(small, 2) will return a list of tuples as such:
[(1, 2), # 1st entry
(1, 3), # 2nd entry
(1, 4), # 3rd entry
(1, 5), # 4th entry
(2, 3), # 5th entry
(2, 4), # 6th entry
(2, 5), # 7th entry
(3, 4), # 8th entry
(3, 5), # 9th entry
(4, 5)] # 10th entry
A call to a the function like this: `find_pair(10)' would return:
(4, 5)
, giving the 10th entry in the would-be array, but without calculating the entire combinatorial explosion beforehand.
The thing is, I need to be able to drop in to the middle of the combinations, not starting from the beginning every time, which is what it seems like an iterator does:
>>> from itertools import combinations
>>> it = combinations([1, 2, 3, 4, 5], 2)
>>> next(it)
(1, 2)
>>> next(it)
(1, 3)
>>> next(it)
(1, 4)
>>> next(it)
(1, 5)
So, instead of having to execute next() 10 times to get to the 10th combination, I would like to be able to retrieve the tuple returned by the 10th iteration with one call.
The Question
Are there any other combinatorial functions that behave this way designed to deal with huge data sets? If not, is there a good way to implement a memory-saving algorithm that behaves this way?
Except itertools.combinations does not return a list - it returns an iterator. Here:
>>> from itertools import combinations
>>> it = combinations([1, 2, 3, 4, 5], 2)
>>> next(it)
(1, 2)
>>> next(it)
(1, 3)
>>> next(it)
(1, 4)
>>> next(it)
(1, 5)
>>> next(it)
(2, 3)
>>> next(it)
(2, 4)
and so on. It's extremely memory-efficient: only one pair is produced per invocation.
Of course it is possible to write a function that returns the n'th result, but before bothering with that (which will be slower and more involved), are you quite sure you can't just use combinations() the way it was designed to be used (i.e., iterating over it, instead of forcing it to produce a giant list)?
If you want random access to any combination you can use this function to return the index of a corresponding lower triangular representation of the cross-product
def comb(k):
row=int((math.sqrt(1+8*k)+1)/2)
column=int(k-(row-1)*(row)/2)
return [row,column]
using your small array for example
small = [1,2,3,4,5]
length = len(small)
size = int(length * (length-1)/2)
for i in range(size):
[n,m] = comb(i)
print(i,[n,m],"(",small[n],",",small[m],")")
will give
0 [1, 0] ( 2 , 1 )
1 [2, 0] ( 3 , 1 )
2 [2, 1] ( 3 , 2 )
3 [3, 0] ( 4 , 1 )
4 [3, 1] ( 4 , 2 )
5 [3, 2] ( 4 , 3 )
6 [4, 0] ( 5 , 1 )
7 [4, 1] ( 5 , 2 )
8 [4, 2] ( 5 , 3 )
9 [4, 3] ( 5 , 4 )
obviously if your access method is in order other methods will be more practical.
Note also that the comb function is independent of the size of the problem.
As suggested by #Blckknght in the comments to get the same order as itertools version change to
for i in range(size):
[n,m] = comb(size-1-i)
print(i,[n,m],"(",small[length-1-n],",",small[length-1-m],")")
0 [4, 3] ( 1 , 2 )
1 [4, 2] ( 1 , 3 )
2 [4, 1] ( 1 , 4 )
3 [4, 0] ( 1 , 5 )
4 [3, 2] ( 2 , 3 )
5 [3, 1] ( 2 , 4 )
6 [3, 0] ( 2 , 5 )
7 [2, 1] ( 3 , 4 )
8 [2, 0] ( 3 , 5 )
9 [1, 0] ( 4 , 5 )
I started with that triangular arrangement, finding the subscript k for list members indexed row and col. Then I reversed the process, deriving row and col from k.
For a list large of N items, let
b = 2*N - 1
Now, to get the kth combination in the list ...
row = (b - math.sqrt(b*b - 8*k)) // 2
col = k - (2*N - row + 1)*row / 2
kth_pair = large[row][col]
This allows you to access any member of the combinations list without ever generating that list.
So you've got 44906 items. Notice, however, that if you build your combinations the same way you build them in the example then there are 44905 combinations with large[0] as the first number. Furthermore, combination i for i <= 44905 looks like (large[0], large[i]).
For 44905 < i <= 89809, it looks like (large[1],large[i-44904]).
If I'm not mistaken, this pattern should continue with something like (large[j],large[i-(exclusive lower bound for j)+1]). You can check my math on that but I'm pretty sure it's right. Anyways, you could iterate to find these lower bounds (so for j=0, it's 0, for j=1, it's 44905, etc.) Iterating should be easy because you just add the next descending number: 44905, 44905+44904, 44905+44904+44903...
For well defined order of created pairs, indices of first and second elements should be related to n and length of a sequence. If you'll find them, you'll be able to achieve const-time performance, since indexing lists is O(1) operation.
Pseudocode would look like this:
def find_nth_pair(seq, n):
idx1 = f1(n, len(seq)) # some formula of n and len(seq)
idx2 = f2(n, len(seq)) # some formula of n and len(seq)
return (seq[idx1], seq[idx2])
You only need to find formulas for idx1 and idx2.

Find maximum length of consecutive repeated numbers in a list

My question is how to find the maximum length of consecutive repeated numbers (or elements in general) in a list. I wrote the following function which works fine but I was wondering if there's a better way to do this or improve my code.
def longest(roll):
'''Return the maximum length of consecutive repeated elements in a list.'''
i = 0
M = 0 # The maximum length
while 0 <= i < len(roll):
c = 1 # Temporarily record the length of consecutive elements
for j in range(i+1, len(roll)):
if roll[j] != roll[i]:
i = j
break
c += 1
i += 1
if c > M:
M = c
if i == len(roll) - 1:
break
return M
By maximum length I mean the following:
[1, 1, 2, 2, 2, 4] should return 3 (2 repeated 3 times);
[1, 2, 1, 2, 1] should return 1 (1 and 2 only repeated once).
You can use itertools.
In [8]: import itertools
In [9]: z = [(x[0], len(list(x[1]))) for x in itertools.groupby(a)]
In [10]: z
Out[10]: [(1, 2), (2, 3), (3, 1)]
Tuples are in (item, count) format. If there are multiple runs of a given number, this will group them accordingly as well. See below.
In [11]: a = [1,1,1,1,1,2,2,2,2,2,1,1,1,3,3]
In [12]: z = [(x[0], len(list(x[1]))) for x in itertools.groupby(a)]
In [13]: z
Out[13]: [(1, 5), (2, 5), (1, 3), (3, 2)]
Getting the max value isn't that hard from here.
In [15]: max(z, key=lambda x:x[1])[1]
Out[15]: 5
longest_fragment = 0
current_fragment = 0
a = int(input())
last_input = a # why do I assign last_input here?
while a:
if a == last_input:
current_fragment += 1
else: # why is current_fragment assigned 1 in this clause?
if current_fragment > longest_fragment:
longest_fragment = current_fragment
current_fragment = 1
last_input = a
a = int(input())
longest_fragment = max(longest_fragment, current_fragment)
# why didn't i use max in the loop?
# why am I checking again down here anyway?
print('The longest fragment was:', longest_fragment)

Categories