Pandas crosstab matrix dot nansum - python

i'm looking for help creating a sub-dataframe from an existing dataframe using a np.nansum-like function. I want to convert this table into a matrix of non-null column sums:
dan ste bob
t1 na 2 na
t2 2 na 1
t3 2 1 na
t4 1 na 2
t5 na 1 2
t6 2 1 na
t7 1 na 2
For example, when 'dan' is not-null (t-2,3,4,6,7) the sum of 'ste' is 2 and 'bob' is 5. When 'ste' is not-null the sum of 'dan' is 4.
dan ste bob
dan 0 2 5
ste 4 0 2
bob 4 1 0
Any ideas?
Thanks in advance!
I ended up using a modified version of matt's function below:
def nansum_matrix_create(df):
rows = []
for col in list(df.columns.values):
col_sums = df[df[col] != 0].sum()
rows.append(col_sums)
return pd.DataFrame(rows, columns=df.columns, index=df.columns)

Use pd.DataFrame.notnull to get where non-nulls are.
Then use pd.DataFrame.dot to ge the crosstab.
Finally, use np.eye to zero out the diagonal.
df.notnull().T.dot(df.fillna(0)) * (1 - np.eye(df.shape[1]))
dan ste bob
dan 0.0 2.0 5.0
ste 4.0 0.0 2.0
bob 4.0 1.0 0.0
Note:
I used this to ensure my values were numeric.
df = df.apply(pd.to_numeric, errors='coerce')

Assuming your dataframe doesn't have large number of columns, this function should do what you want and be fairly performant. I have implemented this using for loop across columns so there may be a more performant / elegant solution out there.
import pandas as pd
# Initialise dataframe
df = {"dan":[pd.np.nan,2,2,1,pd.np.nan,2,1],
"ste":[2,pd.np.nan,1,pd.np.nan,1,1,pd.np.nan],
"bob":[pd.np.nan,1,pd.np.nan,2,2,pd.np.nan,2]}
df = pd.DataFrame(df)[["dan","ste","bob"]]
def matrix_create(df):
rows = []
for col in df.columns:
subvals, index = [], []
for subcol in df.columns:
index.append(subcol)
if subcol == col:
subvals.append(0)
else:
subvals.append(df[~pd.isnull(df[col])][subcol].sum())
rows.append(subvals)
return pd.DataFrame(rows,columns=df.columns,index=index)
matrix_create(df)

Related

Sort Dataframe by Descending Rows AND Columns at the Same Time

Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.
You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2
Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)
Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1

Python: For each unique ID, find its code and its value and calculate the ratio

Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
4 B 8 Z10
I want to obtain ratio of A/B for each UniqueID and put it in a new dataframe. For example, for UniqueID 1, its ratio of A/B = 5/6.
What is the most efficient way to do this in Python?
Want:
UniqueID RatioAB
1 5/6
2 10/11
3 Inf
4 0
Thank you.
One approach is using pivot_table, aggregating with the sum in the case there are multiple occurrences of the same letters (otherwise a simple pivot will do), and evaluating on columns A and B:
df.pivot_table(index='UniqueID', columns='Code', values='Value', aggfunc='sum').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If there is maximum one occurrence of each letter per group:
df.pivot(index='UniqueID', columns='Code', values='Value').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If you only care about A/B ratio:
df1 = df[df['Code'].isin(['A','B'])][['UniqueID', 'Code', 'Value']]
df1 = df1.pivot(index='UniqueID',
columns='Code',
values='Value')
df1['RatioAB'] = df1['A']/df1['B']
The most apparent way is via groupby.
df.groupby('UniqueID').apply(lambda g: g.query("Code == 'A'")['Value'].iloc[0] / g.query("Code == 'B'")['Value'].iloc[0])

User Defined function in python for detecting missing Values in Python?

ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS
0 104.0 PUTNAM Y 3.0
1 197.0 LEXINGTON NaN NaN
2 NaN LEXINGTON N 3.0
3 201.0 BERKELEY NaN 1.0
4 203.0 BERKELEY Y NaN
this is my data frame. I wanted to create a user defined functions which returns the data frame which shows number of missing values in data frame by column and row number of missing value.
output df should look like this.
col_name index
st_num 2
st_num 6
st_name 8
Num_bedrooms 2
Num_bedrooms 5
Num_bedrooms 7
Num_bedrooms 8 .......
You can slice the index by the isnull for each column to get the indices. Also possible with stacking and groupby.
def summarize_missing(df):
# Null counts
s1 = df.isnull().sum().rename('No. Missing')
s2 = pd.Series(data=[df.index[m].tolist() for m in [df[col].isnull() for col in df.columns]],
index=df.columns,
name='Index')
# Other way, probably overkill
#s2 = (df.isnull().replace(False, np.NaN).stack().reset_index()
# .groupby('level_1')['level_0'].agg(list)
# .rename('Index'))
return pd.concat([s1, s2], axis=1, sort=False)
summarize_missing(df)
# No. Missing Index
#ST_NUM 1 [2]
#ST_NAME 0 NaN
#OWN_OCCUPIED 2 [1, 3]
#NUM_BEDROOMS 2 [1, 4]
Here's another way:
m = df.isna().sum().to_frame().rename(columns={0: 'No. Missing'})
m['index'] = m.index.map(lambda x: ','.join(map(str, df.loc[df[x].isna()].index.values)))
print(m)
No. Missing index
ST_NUM 1 2
ST_NAME 0
OWN_OCCUPIED 2 1,3
NUM_BEDROOMS 2 1,4

Dividing each row by the previous one

I have pandas dataframe:
df = pd.DataFrame()
df['city'] = ['NY','NY','LA','LA']
df['hour'] = ['0','12','0','12']
df['value'] = [12,24,3,9]
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
What's the most pythonic way to do this?
First divide by shifted values per groups:
df['ratio'] = df['value'].div(df.groupby('city')['value'].shift(1))
print (df)
city hour value ratio
0 NY 0 12 NaN
1 NY 12 24 2.0
2 LA 0 3 NaN
3 LA 12 9 3.0
Then remove NaNs and select only city and ratio column:
df = df.dropna(subset=['ratio'])[['city', 'ratio']]
print (df)
city ratio
1 NY 2.0
3 LA 3.0
You can use pct_change:
In [20]: df[['city']].assign(ratio=df.groupby('city').value.pct_change().add(1)).dropna()
Out[20]:
city ratio
1 NY 2.0
3 LA 3.0
This'll do it:
df.groupby('city')['value'].agg({'ratio': lambda x: x.max()/x.min()}).reset_index()
# city ratio
#0 LA 3
#1 NY 2
This is one way using a custom function. It assumes you want to ignore the NaN rows in the result of dividing one series by a shifted version of itself.
def divider(x):
return x['value'] / x['value'].shift(1)
res = df.groupby('city').apply(divider)\
.dropna().reset_index()\
.rename(columns={'value': 'ratio'})\
.loc[:, ['city', 'ratio']]
print(res)
city ratio
0 LA 3.0
1 NY 2.0
one way is,
df.groupby(['city']).apply(lambda x:x['value']/x['value'].shift(1))
for further improvement,
print df.groupby(['city']).apply(lambda x:(x['value']/x['value'].shift(1)).fillna(method='bfill'))).reset_index().drop_duplicates(subset=['city']).drop('level_1',axis=1)
city value
0 LA 3.0
2 NY 2.0

Using `rank` on a pandas DataFrameGroupBy object

I have some simple data in a dataframe consisting of three columns [id, country, volume] where the index is 'id'.
I can perform simple operations like:
df_vol.groupby('country').sum()
and it works as expected. When I attempt to use rank() it does not work as expected and the results is an empty dataframe.
df_vol.groupby('country').rank()
The result is not consistent and in some cases it works. The following also works as expected:
df_vol.rank()
I want to return something like:
vols = []
for _, df in f_vol.groupby('country'):
vols.append(df['volume'].rank())
pd.concat(vols)
Any ideas why much appreciated!
You can add column by [] - function is call only for column Volume:
df_vol.groupby('country')['volume'].rank()
Sample:
df_vol = pd.DataFrame({'country':['en','us','us','en','en'],
'volume':[10,10,30,20,50],
'id':[1,1,1,2,2]})
print(df_vol)
country id volume
0 en 1 10
1 us 1 10
2 us 1 30
3 en 2 20
4 en 2 50
df_vol['r'] = df_vol.groupby('country')['volume'].rank()
print (df_vol)
country id volume r
0 en 1 10 1.0
1 us 1 10 1.0
2 us 1 30 2.0
3 en 2 20 2.0
4 en 2 50 3.0

Categories