I am using matrix multiplication on a dataframe and its transpose with df#df.T
So if I have a df which looks like: (below 1 indicates that the object has the property whereas 0 indicates not having it):
Object Property1 Property2 Property3
A 1 1 1
B 0 1 1
C 1 0 0
Using df#df.T gives me:
A B C
A 3 2 1
B 2 2 0
C 1 0 1
This can be thought of a matrix showing how many properties each object has in common with another.
I now want to modify the problem where, instead of a binary indication of whether an object has a property, the properties column show levels of that property. So the new df looks like: (below the values 1,2,3 of properties shows its level. But 0 indicates not having the property)
Object Property1 Property2 Property3
A 3 2 1
B 0 2 3
C 2 0 0
I want to apply matrix multiplication, but with an altered definition of 'common' properties. Two objects will only have a common property if the levels of a property is within +-1 range of the other property.
Below is what the result will look like:
A B C
A 3 1 1
B 1 2 0
C 1 0 1
Note that the number of properties common between A and B have changed from 2 to 1. This is because property 3 between A and B is not within +-1 level. Also, 0 still means that the object does not have the property, so A and C still have 1 property in common (with property 3 for C being 0).
How can I achieve this in Python?
This can be done by modifying matrix multiplication for two DataFrames
Code
# DataFrame Matrix Multiplication
# i.e. equivalent to df1#df2
def df_multiply(df_a, df_b):
'''
Matrix multiplication of values in two DataFrames
Returns a DataFrame whose index and column are
from the df_a
'''
a = df_a.values
b = df_b.values
zip_b = zip(*b)
zip_b = list(zip_b)
zip_b = b
result = [[sum(ele_a*ele_b for ele_a, ele_b in zip(row_a, col_b))
for col_b in zip_b] for row_a in a]
return pd.DataFrame(data=result, index=df_a.index, columns=df_a.index)
# Modify df_multiply for desired result
def df_multiply_modified(df_a, df_b):
'''
Modified Matrix multiplication of values in two DataFrames to create desired result
Returns a DataFrame whose index and
column are from the df_a
'''
a = df_a.values
b = df_b.values
zip_b = zip(*b)
zip_b = list(zip_b)
# sum 1 when difference <= 1 and
# values are non-zero
# i.e. ele_a and ele_b and abs(ele_a-ele_b) <=1
result = [[sum(1 if ele_a and ele_b and abs(ele_a-ele_b) <=1 else 0 for ele_a, ele_b in zip(row_a, col_b))
for col_b in zip_b] for row_a in a]
return pd.DataFrame(data=result, index=df_a.index, columns=df_a.index)
Usage
Original Multiplication
df = pd.DataFrame({'Object':['A', 'B', 'C'],
'Property1':[1, 0, 1],
'Property2':[1, 1, 0],
'Property3':[1, 1, 0]})
df.set_index('Object', inplace = True)
print(df_multiply(df, df.T)
# Output (same as df#df.T):
Object A B C
Object
A 3 2 1
B 2 2 0
C 1 0 1
Modified Multiplication
# Use df_multiply_modified
df = pd.DataFrame({'Object':['A', 'B', 'C'],
'Property1':[3, 0, 2],
'Property2':[2, 2, 0],
'Property3':[1, 3, 0]})
df.set_index('Object', inplace = True)
print(df_multiply_modified(df, df.T)
# Output (same as desired)
Object A B C
Object
A 3 1 1
B 1 2 0
C 1 0 1
Related
I'm a complete newbie at pandas so a simpler (though maybe not the most efficient or elegant) solution is appreciated. I don't mind a bit of brute force if I can understand the answer better.
If I have the following Dataframe:
A B C
0 0 1
0 1 1
I want to loop through columns "A", "B" and "C" in that order and during each iteration select all the rows for which the current column is "1" and none of the previous columns are and save the result and also use it in the next iteration.
So when looking at column A, I wouldn't select anything. Then when looking at column B I would select the second row because B==1 and A==0. Then when looking at column C I would select the first row because A==0 and B==0.
Create a boolean mask:
m = (df == 1) & (df.cumsum(axis=1) == 1)
d = {col: df[m[col]].index.tolist() for col in df.columns if m[col].sum()}
Output:
>>> m
A B C
0 False False True
1 False True False
2 False False True
>>> d
{'B': [1], 'C': [0, 2]}
I slightly modified your dataframe:
>>> df
A B C
0 0 0 1
1 0 1 1
2 0 0 1
Update
For the expected output on my sample:
for rows, col in zip(m, df.columns):
if m[col].sum():
print(f"\n=== {col} ===")
print(df[m[col]])
Output:
=== B ===
A B C
1 0 1 1
=== C ===
A B C
0 0 0 1
2 0 0 1
Seems like you need a direct use of idxmax
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
>>> df.idxmax()
A 0
B 1
C 0
dtype: int64
The values above are the indexes for which your constraints are met. 1 for B means that the second row was "selected". 0 for C, same. The only issue is that, if nothing is found, it'll also return 0.
To address that, you can use where
>>> df.idxmax().where(~df.eq(0).all())
This will make sure that NaNs are returned for all-zero columns.
A NaN
B 1.0
C 0.0
dtype: float64
I am trying to get the max value of one row, according to the cumulative sum of a different row. My dataframe looks like this:
df = pd.DataFrame({'constant': ['a', 'b', 'b', 'c', 'c', 'd', 'a'], 'value': [1, 3, 1, 5, 1, 9, 2]})
indx constant value
0 a 1
1 b 3
2 b 1
3 c 5
4 c 1
5 d 9
6 a 2
I am trying to add a new field, with the constant that has the highest cumulative sum of value up to that point in the dataframe. the final dataframe would look like this:
indx constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
As you can see, at index 1, a has the highest cumulative sum of value for all prior rows. At index 2, b has the highest cumulative sum of value for all prior rows, and so on.
Anyone have a solution?
As presented, you just need a shift. However try the following for other scenarios.
Steps
Find the cummulative maximum
Where the cummulative max is equal to df['value'], copy the 'constant', otherwise make it a NaN
The NaNs should leave chance to broadcast the constant corresponding to the max value
Outcome
df=df.assign(new_field=(np.where(df['value']==df['value'].cummax(), df['constant'], np.nan))).ffill()
df=df.assign(new_field=df['new_field'].shift())
constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
I think you should try and approach this as a pivot table, which would allow you to use np.argmax over the column axis.
# this will count cummulative occurences over the ix for each value of `constant`
X = df.pivot_table(
index=df.index,
columns=['constant'],
values='value'
).fillna(0.0).cumsum(axis=0)
# now you get a list of ixs that max the cummulative value over the column axis - i.e., the "winner"
colix = np.argmax(X.values, axis=1)
# you can fetch corresponding column names using this argmax index
df['winner'] = np.r_[[np.nan], X.columns[colix].values[:-1]]
# and there you go
df
constant value winner
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
You should be a little more careful (since values can be negative value which decrease cumsum), here is what you probably need to do,
df["cumsum"] = df["value"].cumsum()
df["cummax"] = df["cumsum"].cummax()
df["new"] = np.where(df["cumsum"] == df["cummax"], df['constant'], np.nan)
df["new"] = df.ffill()["new"].shift()
df
I have this table1:
A B C D
0 1 2 k l
1 3 4 e r
df.dtypes gets me this:
A int64
B int64
C object
D object
Now, I want to create a table2 which only includes objects (column C and D) using this command table2=df.select_dtypes(include=[object]).
Then, I want to encode table2 using this command pd.get_dummies(table).
It gives me this table2:
C D
0 0 1
1 1 0
The last thing I want to do is append both tables together (table 1 + table 2), so that the final table looks like this:
A B C D
0 1 2 0 1
1 3 4 1 0
Can somebody help?
This should do it:
table2=df.select_dtypes(include=[object])
table1.select_dtypes(include=[int]).join(table2.apply(lambda x:pd.factorize(x, sort=True)[0]))
It first factorizes the object typed columns of table 2 (instead of using dummies generator) and then merge it back to the int typed columns of the original dataframe!
Assuming what you're trying to do from the question is have a column for C that has a value of 1 replace values of e and in column D, values of 1 replace values of l. Otherwise, as mentioned elsewhere there will be a column for each response possibility.
df = pd.DataFrame({'A': [1,2], 'B': [2,4], 'C': ['k','e'], 'D': ['l','r']})
df
A B C D
0 1 2 k l
1 2 4 e r
df.dtypes
A int64
B int64
C object
D object
dtype: object
Now, if you want to drop the e and l because you want to have k-1 columns, you can use the drop_first argument.
df = pd.get_dummies(df, drop_first = True)
df
A B C_k D_r
0 1 2 1 0
1 2 4 0 1
Note that the dtypes are not int64 like columns A and B.
df
A int64
B int64
C_k uint8
D_r uint8
dtype: object
If it's important they are the same type you can of course change those as appropriate. In the general case, you may want to keep names like C_k and D_r so you know what the dummies correspond to. If not, you can always rename based on the _ (the default of get_dummies prefix argument.) So, you could create the rename dictionary using the '_' as as way to split out the part of the column name after the prefix. Or for a simple case like this.
df.rename({'C_k': 'C', 'D_r': 'D'}, axis = 1, inplace = True)
df
A B C D
0 1 2 1 0
1 2 4 0 1
Background: I have a matrix which represents the distance between two points. In this matrix both rows and columns are the data points. For example:
A B C
A 0 999 3
B 999 0 999
C 3 999 0
In this toy example let's say I want to drop C for some reason, because it is far away from any other point. So I first aggregate the count:
df["far_count"] = df[df == 999].count()
and then batch remove them:
df = df[df["far_count"] == 2]
In this example this looks a bit redundant but please imagine that I have many data points like this (say in the order of 10Ks)
The problem with the above batch removal is that I would like to remove rows and columns in the same time (instead of just rows) and it is unclear to me how to do so elegantly. A naive way is to get a list of such data points and put it in a loop and then:
for item in list:
df.drop(item, axis=1).drop(item, axis=0)
But I was wondering if there is a better way. (Bonus if we could skip the intermdiate step far_count)
np.random.seed([3,14159])
idx = pd.Index(list('ABCDE'))
a = np.random.randint(3, size=(5, 5))
df = pd.DataFrame(
a.T.dot(a) * (1 - np.eye(5, dtype=int)),
idx, idx)
df
A B C D E
A 0 4 2 4 2
B 4 0 1 5 2
C 2 1 0 2 6
D 4 5 2 0 3
E 2 2 6 3 0
l = ['A', 'C']
m = df.index.isin(l)
df.loc[~m, ~m]
B D E
B 0 5 2
D 5 0 3
E 2 3 0
For your specific case, because the array is symmetric you only need to check one dimension.
m = (df.values == 999).sum(0) == len(df) - 1
In [66]: x = pd.DataFrame(np.triu(df), df.index, df.columns)
In [67]: x
Out[67]:
A B C
A 0 999 3
B 0 0 999
C 0 0 0
In [68]: mask = x.ne(999).all(1) | x.ne(999).all(0)
In [69]: df.loc[mask, mask]
Out[69]:
A C
A 0 3
C 3 0
I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2