Pandas: join series based on category index - python

I have two pd.Series:
A
idx
200 1
300 2
400 3
with length n and
B
idx
200 4
350 5
360 6
370 7
380 8
with length m.
Note that the length of the Series can be different.
I would like to have a category index:
cat
[200, 300)
[300, 400)
[400, 500)
and perform a correlation between the following pd.Series:
A B
cat
[200, 300) 1 3
[300, 400) 2 4+5+6+7
[400, 500) 3 NaN
So how do I slot my data based on their index into the category index and perform a sum over entries that fall into the same category?
I tried around with groupby but I do not manage to groupby over Categories.
THX

IIUC:
Data setup:
a = pd.Series(data=[1,2,3],index=[200,300,400])
b = pd.Series(data=[4,5,6,7,8], index=[200,350,360,370,380])
Convert to dataframe and create category using pd.cut
df_a = a.to_frame()
df_a['cat'] = pd.cut(df_a.index,bins=[0,100,200,300,400,500,600], labels=['0-99','100-199','200-299','300-399','400-499','500-599'])
df_b = b.to_frame()
df_b['cat'] = pd.cut(df_b.index,bins=[0,100,200,300,400,500,600], labels=['0-99','100-199','200-299','300-399','400-499','500-599'])
Do groupby on cat and use pd.concat
group_b = df_b.groupby('cat')[0].apply(list)
group_b = group_b.where(group_b.str.len())
group_a = df_a.groupby('cat')[0].apply(list)
group_a = group_a.where(group_a.str.len())
pd.concat([group_a,group_b],axis=1,keys=['A','B'])
Output:
A B
cat
0-100 NaN NaN
101-200 [1] [4]
201-300 [2] NaN
301-400 [3] [5, 6, 7, 8]
401-500 NaN NaN

Related

How to apply pandas groupby to a dataframe to use both rows and columns when calculating a mean

I have a dataframe df in the format:
Grade Height Speed Value
0 A 13 0.1 500
1 B 25 0.3 100
2 C 54 0.6 200
And I am looking to group it such that I intersect the Rating as the index, the Height (split into buckets) as the columns, and within the individual cells have the average value for the combination of Grade and Height.
So, the output dataframe would look something like this:
Height
Grade 0-10 10-25 25-50 50-100
A avg(speed*value) x x x
B x x x x
C x x x x
where the x's are the calculated mean speed*value.
I have attempted unsuccessfully with something like:
output = pd.DataFrame(data=df, index = df[df['Grade']], columns = df[df['Height']].groupby(pd.qcut(df['Height'], 3, duplicates='drop'))).groupby(df['Value')).mean()
but I can't quite figure out a method that might work not throwing errors or an empty df.
Would you have any ideas I can try out?
Use pd.cut to break your Height Column into bins.
Create a new column of Speed * Value
Pivot your table, mean is the default pivot function.
dropna=False is used so that even null bins are shown.
df.Height = pd.cut(df.Height, bins=[0, 10, 25, 50, 100])
df['speed_value'] = df.Speed.mul(df.Value)
out = df.pivot_table(index='Grade', columns='Height', values='speed_value', dropna=False)
print(out)
Output:
Height (0, 10] (10, 25] (25, 50] (50, 100]
Grade
A NaN 50.0 NaN NaN
B NaN 30.0 NaN NaN
C NaN NaN NaN 120.0
Use:
df['agg'] = pd.cut(df['Height'].astype(int), [0,10,25,50,100])
s = df.pivot_table(index='Grade', columns='agg', values=['Speed', 'Value'], dropna=False)
s.apply(lambda x: [x[i]*x[i+4] for i in range(4)], axis = 1).apply(pd.Series).rename(columns = {i: s.columns.get_level_values(1).categories[i] for i in range(4)})
Output:
(0, 10] (10, 25] (25, 50] (50, 100]
Grade
A NaN 50.0 NaN NaN
B NaN 30.0 NaN NaN
C NaN NaN NaN 120.0

Creating a new dataframe off of duplicate indexes

I'm working in pandas and I have a dataframe X
idx
0
1
2
3
4
I want to create a new dataframe with the following indexes from ths list. There are duplicate indexes because I want some rows to repeat.
idx = [0,0,1,2,3,2,4]
My expected output is
idx
0
0
1
2
3
2
4
I cant use
X.iloc[idx]
because of the duplicated indexes
code i tried:
d = {'idx': [0,1,3,4]}
df = pd.DataFrame(data=d)
idx = [0,0,1,2,3,2,4]
df.iloc[idx] # errors here with IndexError: indices are out-of-bounds
What you want to do is weird, but here is one way to do it.
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'ONE', 'TWO'])
OUTPUT
A B C
ONE 11 12 13
ONE 21 22 23
TWO 31 32 33
Read: pandas: Rename columns / index names (labels) of DataFrame
your current dataframe df:-
idx
0 0
1 1
2 3
3 4
Now just use reindex() method:-
idx = [0,0,1,2,3,2,4]
df=df.reindex(idx)
Now if you print df you get:-
idx
0 0.0
0 0.0
1 1.0
2 3.0
3 4.0
2 3.0
4 NaN

Pandas. Picking a column name based on row data

In my previous question, i was trying to count blanks and build a dataframe with new columns for the subsequent analysis. The question became too exhaustive and i decided to split it for different purposes.
I have my initial dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':[1000,2000,3000,4000],
'201710':[7585, 4110, 4498, np.nan],
'201711':[7370, 3877, 4850, 4309],
'201712':[6505, np.nan, 4546, 4498],
'201801':[7473, np.nan, np.nan, 4850],
'201802':[6183, np.nan, np.nan, np.nan ],
'201803':[6699, 4558, 1429, np.nan ],
'201804':[ 118, 4152, 1429, np.nan ],
'201805':[ np.nan, 4271, 1960, np.nan ],
'201806':[ np.nan, np.nan, 1798, np.nan ],
'201807':[ np.nan, np.nan, 1612, 4361],
'201808':[ np.nan, np.nan, 1612, 4272],
'201809':[ np.nan, 3900, 1681, 4199] ,
})
I need to obtain start- and end- dates of each fraction (not blank) for each id.
I managed to get the first occurred start- and the last occurred end- data but not in the middle. An then, I counted blanks for each gap (for further analysis)
The code is here (it's might seem confused):
# to obtain the first and last occurrence with data
res = pd.melt(df, id_vars=['id'], value_vars=df.columns[1:])
res.dropna(subset=['value'], inplace=True)
res.sort_values(by=['id', 'variable', 'value'], ascending=[True, True, True],
inplace=True)
minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')
minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)
# To obtain number of gaps (nulls) and their length
res2 = pd.melt(df, id_vars=['id'], value_vars=df.columns[1:])
res2.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
res2=res2.replace(np.nan, 0)
m = res2.value.diff().ne(0).cumsum().rename('gid')
gaps = res2.groupby(['id',
m]).value.value_counts().loc[:,:,0].droplevel(-1).reset_index()
# add columns to main dataset with start- and end dates and gaps
df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')
I came to the dataset like this, where start_date is the 1st notnull occurrence, end_date - the last notnull occurence and 1-,2-,3- blanks are fractions with blanks counting for further analysis:
The output is intended to have additional columns:
Here is a function that may be helpful, IIUC.
import pandas as pd
# create test data
t = pd.DataFrame({'x': [10, 20] + [None] * 3 + [30, 40, 50, 60] + [None] * 5 + [70]})
Create a function to find start location, end location, and size of each 'group', where a group is a sequence of repeated values (e.g., NaNs):
def extract_nans(df, field):
df = df.copy()
# identify NaNs
df['is_na'] = df[field].isna()
# identify groups (sequence of identical values is a group): X Y X => 3 groups
df['group_id'] = (df['is_na'] ^ df['is_na'].shift(1)).cumsum()
# how many members in this group?
df['group_size'] = df.groupby('group_id')['group_id'].transform('size')
# initial, final index of each group
df['min_index'] = df.reset_index().groupby('group_id')['index'].transform(min)
df['max_index'] = df.reset_index().groupby('group_id')['index'].transform(max)
return df
Results:
summary = extract_nans(t, 'x')
print(summary)
x is_na group_id group_size min_index max_index
0 10.0 False 0 2 0 1
1 20.0 False 0 2 0 1
2 NaN True 1 3 2 4
3 NaN True 1 3 2 4
4 NaN True 1 3 2 4
5 30.0 False 2 4 5 8
6 40.0 False 2 4 5 8
7 50.0 False 2 4 5 8
8 60.0 False 2 4 5 8
9 NaN True 3 5 9 13
10 NaN True 3 5 9 13
11 NaN True 3 5 9 13
12 NaN True 3 5 9 13
13 NaN True 3 5 9 13
14 70.0 False 4 1 14 14
Now, you can exclude 'x' from the summary, drop duplicates, filter to keep only NaN values (is_na == True), filter to keep sequences above a certain length (e.g., at least 3 consecutive NaN values), etc. Then, if you drop duplicates, the first row will summarize the first NaN run, second row will summarize the second NaN run, etc.
Finally, you can use this with apply() to process the whole data frame, if this is what you need.
Short version of results, for the test data frame:
print(summary[summary['is_na']].drop(columns='x').drop_duplicates())
is_na group_id group_size min_index max_index
2 True 1 3 2 4
9 True 3 5 9 13

omit groups in pandas groupby based on a condition

This is my dataframe:
df = pd.DataFrame({'sym': list('aaaaaabb'), 'key': [1, 1, 1, 1, 2, 2, 3, 3], 'x': [100, 100, 90, 100, 500, 500, 700, 700]})
I group them by key and sym:
groups = df.groupby(['key', 'sym'])
Now I want to check whether all x in each group are equal or not. If they are not equal, I want to delete it from the df. In this case I want to omit the first group.
This is my desired df:
key sym x
4 2 a 500
5 2 a 500
6 3 b 700
7 3 b 700
Use GroupBy.transform with SeriesGroupBy.nunique and compare by 1, filter by boolean indexing:
df1 = df[df.groupby(['key', 'sym'])['x'].transform('nunique').eq(1)]
print (df1)
sym key x
4 a 2 500
5 a 2 500
6 b 3 700
7 b 3 700

Filling Pandas columns with lists of unequal lengths

I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]
You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.
If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN

Categories