Pandas : determine mapping from unique rows to original dataframe - python

Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.

One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64

Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

Related

how to count the match number between two dataframe fast?

I'm writing some programs on calculate the match item number between two dataframes.
for example,
A is the dataframe as : A = pd.DataFrame({'pick_num1':[1, 2, 3], 'pick_num2':[2, 3, 4], 'pick_num3':[4, 5, 6]})
B is the answer I want to match, like:
B = pd.DataFrame({'ans_num1':[1, 2, 3], 'ans_num2':[2, 3, 4], 'ans_num3':[4, 5, 6], 'ans_num4':[7, 8, 1], 'ans_num5':[9, 1, 9]})
DataFrame A
pick_num1 pick_num2 pick_num3 match_num
0 1 2 4 2
1 2 3 5 2
2 3 4 6 2
DataFrame B
ans_num1 ans_num2 ans_num3 ans_num4 ans_num5
0 1 2 4 7 9
1 2 3 5 8 1
2 3 4 6 1 9
and I want to append a new column of ['match_num'] at the end of A.
Now I have tried to write a mapping function to compare and calculate, and I found the speed is not that fast while the dataframe is huge, the functions are below:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(pd.concat([df1[p_name]]*5, axis=1).values==df1[open_ball_name_ls].values, 1)
return df1
def compute_win_prb(df1):
return list(map(lambda p_name: win_prb_func(df1, p_name), pick_name_ls))
df1 = pd.concat([A, B], axis=1)
df1['win prb.'] = 0
result_df = compute_win_prb(df1)
where pick_name_ls is ['pick_num1', 'pick_num2', 'pick_num3'], and open_ball_name_ls is ['ans_num1', 'ans_num2', 'ans_num3', 'ans_num4', 'ans_num5'].
I'm wondering is it possible to make the computation more fast or smart than I did?
now the performance would is: 0.015626192092895508 seconds
Thank you for helping me!
You can use broadcasting instead of concatenating the columns:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(df1[p_name].values[:, np.newaxis] == df1[open_ball_name_ls].values, 1)
return df1
Since df1[p_name].values will return an 1-D array, you have to convert it into the column vector by adding a new axis. It only takes me 0.004 second.

Pandas: eror when checking for a binary flag pattern [duplicate]

This question already has an answer here:
Bitwise operations in Pandas that return numbers rather than bools?
(1 answer)
Closed 1 year ago.
I have a dataframe where one of the columns of type int is storing a binary flag pattern:
import pandas as pd
df = pd.DataFrame({'flag': [1, 2, 4, 5, 7, 3, 9, 11]})
I tried selecting rows with value matching 4 the way it is typically done (with binary and operator):
df[df['flag'] & 4]
But it failed with:
KeyError: "None of [Int64Index([0, 0, 4, 4, 4, 0, 0, 0], dtype='int64')] are in the [columns]"
How to actually select rows matching binary pattern?
The bitwise-flag selection works as you’d expect:
>>> df['flag'] & 4
0 0
1 0
2 4
3 4
4 4
5 0
6 0
7 0
Name: flag, dtype: int64
However if you pass this to df.loc[], you’re asking to get the indexes 0 and 4 repeatedly, or if you use df[] directly you’re asking for the column that has Int64Index[...] as column header.
Instead, you should force the conversion to a boolean indexer:
>>> (df['flag'] & 4) != 0
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
Name: flag, dtype: bool
>>> df[(df['flag'] & 4) != 0]
flag
2 4
3 5
4 7
Even though in Pandas & or | is used as a logical operator to specify conditions but at the same time using a Series as an argument to allegedly logical operator results not in a Series of Boolean values but numbers.
Knowing that you can use any of the following approaches to select rows based on a binary pattern:
Since result of <int> & <FLAG> is always <FLAG> then you can use:
df[df['flag'] & 4 == 4]
which (due to the precedence of operators) evaluates as:
df[(df['flag'] & 4) == 4]
alternatively you can use apply and map the result directly to a bool:
df[df['flag'].apply(lambda v: bool(v & FLAG))]
But this does look very cumbersome and is likely to be much slower.
In either cases, the result is as expected:
flag
2 4
3 5
4 7

Appending seperate dataframes, each as column

I am currently working on the following:
data - with the correct index
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data_values)
wcss.append(kmeans.inertia_)
kmeans = KMeans(n_clusters=2).fit(data_values)
y = kmeans.fit_predict(data_values) # prediction of k
df= pd.DataFrame(y,index = data.index)
....
#got here multiple dicts
Example of y:
[1 2 3 4 5 2 2 5 1 0 0 1 0 0 1 0 1 4 4 4 3 1 0 0 1 0 0 ...]
f = pd.DataFrame(y, columns = [buster] )
f.to_csv('busters.csv, mode = 'a')
y = clusters after determination
I dont know how did I stuck on this.. I am iterating over 20 dataframes, each one consists of one columns and values from 1-9. The index is irrelevent. I am trying to append all frame together but instead it just prints them one after the other. If I put ".T" to transpose it , I still got rows with irrelevent values as index, which I cant remove them because they are actually headers.
Needed result
If the dicts produced in each iteration look like {'Buster1': [0, 2, 2, 4, 5]}, {'Buster2': [1, 2, 3, 4, 5]} ..., using 5 elements here for illustration purposes, and all the lists, i.e., values in the dicts, have the same number of elements (as it is the case in your example), you could create a single dict and use pd.DataFrame directly. (You may also want to take a look at pandas.DataFrame.from_dict.)
You may have lists with more than 5 elements, more than 3 dicts (and thus columns), and you will be generating the dicts with a loop, but the code below should be sufficient for getting the idea.
>>> import pandas as pd
>>>
>>> d = {}
>>> # update d in every iteration
>>> d.update({'Buster 1': [0, 2, 2, 4, 5]})
>>> d.update({'Buster 2': [1, 2, 3, 4, 5]})
>>> # ...
>>> d.update({'Buster n': [0, 9, 3, 0, 0]})
>>>
>>> pd.DataFrame(d, columns=d.keys())
Buster 1 Buster 2 Buster n
0 0 1 0
1 2 2 9
2 2 3 3
3 4 4 0
4 5 5 0
If you have the keys, e.g., 'Buster 1', and values, e.g., [0, 2, 2, 4, 5], separated, as I believe is the case, you can simplify the above (and make it more efficient) by replacing d.update({'Buster 1': [0, 2, 2, 4, 5]}) with d['Buster 1']=[0, 2, 2, 4, 5].
I included columns=d.keys() because depending on your Python and pandas version the ordering of the columns may not be as you expect it to be. You can specify the ordering of the columns through specifying the order in which you provide the keys. For example:
>>> pd.DataFrame(d, columns=sorted(d.keys(),reverse=True))
Buster n Buster 2 Buster 1
0 0 1 0
1 9 2 2
2 3 3 2
3 0 4 4
4 0 5 5
Although it may not apply to your use case, if you do not want to print the index, you can take a look at How to print pandas DataFrame without index.

Get first row of dataframe in Python Pandas based on criteria

Let's say that I have a dataframe like this one
import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=['A', 'B', 'C'])
>> df
A B C
0 1 2 1
1 1 3 2
2 4 6 3
3 4 3 4
4 5 4 5
The original table is more complicated with more columns and rows.
I want to get the first row that fulfil some criteria. Examples:
Get first row where A > 3 (returns row 2)
Get first row where A > 4 AND B > 3 (returns row 4)
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
But, if there isn't any row that fulfil the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc)
Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)
I was able to do it by iterating on the dataframe (I know that craps :P). So, I prefer a more pythonic way to solve it.
This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:
>>> df[condition]
This will return a slice of your dataframe which you can index using iloc. Here are your examples:
Get first row where A > 3 (returns row 2)
>>> df[df.A > 3].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].
Get first row where A > 4 AND B > 3:
>>> df[(df.A > 4) & (df.B > 3)].iloc[0]
A 5
B 4
C 5
Name: 4, dtype: int64
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
>>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:
>>> def series_or_default(X, condition, default_col, ascending=False):
... sliced = X[condition]
... if sliced.shape[0] == 0:
... return X.sort_values(default_col, ascending=ascending).iloc[0]
... return sliced.iloc[0]
>>>
>>> series_or_default(df, df.A > 6, 'A')
A 5
B 4
C 5
Name: 4, dtype: int64
As expected, it returns row 4.
For existing matches, use query:
df.query(' A > 3' ).head(1)
Out[33]:
A B C
2 4 6 3
df.query(' A > 4 and B > 3' ).head(1)
Out[34]:
A B C
4 5 4 5
df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]:
A B C
2 4 6 3
you can take care of the first 3 items with slicing and head:
df[df.A>=4].head(1)
df[(df.A>=4)&(df.B>=3)].head(1)
df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)
The condition in case nothing comes back you can handle with a try or an if...
try:
output = df[df.A>=6].head(1)
assert len(output) == 1
except:
output = df.sort_values('A',ascending=False).head(1)
For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:
def pd_iter_func(df):
for row in df.itertuples():
# Define your criteria here
if row.A > 4 and row.B > 3:
return row
It is more efficient than Boolean Indexing when it comes to a large dataframe.
To make the function above more applicable, one can implements lambda functions:
def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
for row in df.itertuples():
if criteria(row):
return row
pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)
As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.
def pd_idxmax_func(df, mask):
return df.loc[mask.idxmax()]
pd_idxmax_func(df, (df.A > 4) & (df.B > 3))

Multidimension array indexing and column-accessing

I have a 3 dimensional array like
[[[ 1 4 4 ..., 952 0 0]
[ 2 4 4 ..., 33 0 0]
[ 3 4 4 ..., 1945 0 0]
...,
[4079 1 1 ..., 0 0 0]
[4080 2 2 ..., 0 0 0]
[4081 1 1 ..., 0 0 0]]
[[ 1 4 4 ..., 952 0 0]
[ 2 4 4 ..., 33 0 0]
[ 3 4 4 ..., 1945 0 0]
...,
[4079 1 1 ..., 0 0 0]
[4080 2 2 ..., 0 0 0]
[4081 1 1 ..., 0 0 0]]
.....
[[ 1 4 4 ..., 952 0 0]
[ 2 4 4 ..., 33 0 0]
[ 3 4 4 ..., 1945 0 0]
...,
[4079 1 1 ..., 0 0 0]
[4080 2 2 ..., 0 0 0]
[4081 1 1 ..., 0 0 0]]]
This array has total 5 data blocks. Each data blocks have 4081 lines and 9 columns.
My question here is about accessing to column, in data-block-wise.
I hope to index data-blocks, lines, and columns, and access to the columns, and do some works with if loops. I know how to access to columns in 2D array, like:
column_1 = [row[0] for row in inputfile]
but how can I access to columns per each data block?
I tried like ( inputfile = 3d array above )
for i in range(len(inputfile)):
AAA[i] = [row[0] for row in inputfile]
print AAA[2]
But it says 'name 'AAA' is not defined. How can I access to the column, for each data blocks? Should I need to make [None] arrays? Are there any other way without using empty arrays?
Also, how can I access to the specific elements of the accessed columns? Like AAA[i][j] = i-th datablock, and j-th line of first column. Shall I use one more for loop for line-wise accessing?
ps) I tried to analyze this 3d array in a way like
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile[i])): ### number of lines per a datablock = 4081
AAA = inputfile[i][j] ### Store first column for each datablocks to AAA
print AAA[0] ### Working as I intended to access 1st column.
print AAA[0][1] ### Not working, invalid index to scalar variable. I can't access to the each elemnt.
But this way, I cannot access to the each elements of 1st column, AAA[0]. How can I access to the each elements in here?
I thought maybe 2 indexes were not enough, so I used 3 for-loops as:
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile[i])): ### number of lines per a datablock = 4081
for k in range(len(inputfile[i][j])): ### number of columns per line = 9
AAA = inputfile[i][j][0]
print AAA[0]
Still, I cannot access to the each elements of 1st column, it says 'invalid index to scalar variable'. Also, AAA contains nine of each elements, just like
>>> print AAA
1
1
1
1
1
1
1
1
1
2
2
...
4080
4080
4080
4081
4081
4081
4081
4081
4081
4081
4081
4081
Like this, each elements repeats 9 times, which is not what I want.
I hope to use indices during my analysis, will use index as element during analysis. I want to access to the columns, and access to the each elements with all indices, in this 3d array. How can I do this?
A good practice in to leverage zip:
For example:
>>> a = [1,2,3]
>>> b = [4,5,6]
>>> for i in a:
... for j in b:
... print i, b
...
1 [4, 5, 6]
1 [4, 5, 6]
1 [4, 5, 6]
2 [4, 5, 6]
2 [4, 5, 6]
2 [4, 5, 6]
3 [4, 5, 6]
3 [4, 5, 6]
3 [4, 5, 6]
>>> for i,j in zip(a,b):
... print i,j
...
1 4
2 5
3 6
Unless you're using something like NumPy, Python doesn't have multi-dimensional arrays as such. Instead, the structure you've shown is a list of lists of lists of integers. (Your choice of inputfile as the variable name is confusing here; such a variable would usually contain a file handle, iterating over which would yield one string per line, but I digress...)
Unfortunately, I'm having trouble understanding exactly what you're trying to accomplish, but at one point, you seem to want a single list consisting of the first column of each row. That's as simple as:
column = [row[0] for block in inputfile for row in block]
Granted, this isn't really a column in the mathematical sense, but it might possibly perhaps be what you want.
Now, as to why your other attempts failed:
for i in range(len(inputfile)):
AAA[i] = [row[0] for row in inputfile]
print AAA[2]
As the error message states, AAA is not defined. Python doesn't let you assign to an index of an undefined variable, because it doesn't know whether that variable is supposed to be a list, a dict, or something more exotic. For lists in particular, it also doesn't let you assign to an index that doesn't yet exist; instead, the append or extend methods are used for that:
AAA = []
for i, block in enumerate(inputfile):
for j, row in enumerate(block):
AAA.append(row[0])
print AAA[2]
(However, that isn't quite as efficient as the list comprehension above.)
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile)): ### number of lines per a datablock = 4081
AAA = inputfile[i][j] ### Store first column for each datablocks to AAA
print AAA[0] ### Working as I intended to access 1st column.
print AAA[0][1] ### Not working, invalid index to scalar variable. I can't access to the each elemnt.
There's an obvious problem with the range in the second line, and an inefficiency in looking up inputfile[i] multiple times, but the real problem is in the last line. At this point, AAA refers to one of the rows of one of the blocks; for example, on the first time through, given your dataset above,
AAA == [ 1 4 4 ..., 952 0 0]
It's a single list, with no references to the data structure as a whole. AAA[0] works to access the number in the first column, 1, because that's how lists operate. The second column of that row will be in AAA[1], and so on. But AAA[0][1] throws an error, because it's equivalent to (AAA[0])[1], which in this case is equal to (1)[1], but numbers can't be indexed. (What's the second element of the number 1?)
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile[i])): ### number of lines per a datablock = 4081
for k in range(len(inputfile[i][j])): ### number of columns per line = 9
AAA = inputfile[i][j][0]
print AAA[0]
This time, your for loops, though still inefficient, are at least correct, if you want to iterate over every number in the whole data structure. At the bottom, you'll find that inputfile[i][j][k] is integer k in row j in block i of the data structure. However, you're throwing out k entirely, and printing the first element of the row, once for each item in the row. (The fact that it's repeated exactly as many times as you have columns should have been a clue.) And once again, you can't index any further once you get to the integers; there is no inputfile[i][j][0][0].
Granted, once you get to an element, you can look at nearby elements by changing the indexes. For example, a three-dimensional cellular automaton might want to look at each of its neighbors. With proper corrections for the edges of the data, and checks to ensure that each block and row are the right length (Python won't do that for you), that might look something like:
for i, block in enumerate(inputfile):
for j, row in enumerate(block):
for k, num in enumerate(row):
neighbors = sum(
inputfile[i][j][k-1],
inputfile[i][j][k+1],
inputfile[i][j-1][k],
inputfile[i][j+1][k],
inputfile[i-1][j][k],
inputfile[i+1][j][k],
)
alive = 3 <= neigbors <= 4

Categories