How to shift the values of a certain group by different amounts - python

I have a DataFrame that looks like this:
user data
0 Kevin 1
1 Kevin 3
2 Sara 5
3 Kevin 23
...
And I want to get the historical values (looking let's say 2 entries forward) as rows:
user data data_1 data_2
0 Kevin 1 3 23
1 Sara 5 24 NaN
2 Kim ...
...
Right now I'm able to do this through the following command:
_temp = df.groupby(['user'], as_index = False)['data']
for i in range(1,2):
data['data_{0}'.format(i)] = _temp.shift(-1)
I feel like my approach is very inefficient and that there is a much faster way to do this (esp. when the number of lookahead/lookback values go up)!

You can use groupby.cumcount() with set_index() and unstack():
m=df.assign(k=df.groupby('user').cumcount().astype(str)).set_index(['user','k']).unstack()
m.columns=m.columns.map('_'.join)
print(m)
data_0 data_1 data_2
user
Kevin 1.0 3.0 23.0
Sara 5.0 NaN NaN

Related

why is pd.crosstab not giving the expected output in python pandas?

I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])

Dynamic filtering/ masking in Pandas

I have a pandas data frame containing employee information like this:
df=pd.DataFrame({
'Id':[1,2,3,4],
'Name':['Joe','Henry','Sam','Max'],
'Salary':[70000,80000,60000,90000],
'ManagerId':[3,4,np.nan,np.nan]
})
Id Name Salary ManagerId
0 1 Joe 70000 3.0
1 2 Henry 80000 4.0
2 3 Sam 60000 NaN
3 4 Max 90000 NaN
What I need to do is to filter employees having their salary exceed his manager's (in this case Joe since his salary is larger than his manager, Sam).
0 1 Joe 70000 3.0
Because of the relation between Id and Manager Id, I think I can use loops as the last resort, but that seems to be really manual and looks ugly too. I wonder that if I can do this with masking. As a beginner, I can only do simple masking where the condition is static so far, like to filter employees that have salary exceeding 60000. But in this case, the condition for each employee is different from each other.
I have no idea what this technique is called so I just made up the title.
Thanks for any help.
Idea is match ManagerID by Salary by Id, so possible compare for greater and filter:
df = df[df['Salary'].gt(df['ManagerID'].map(df.set_index(['Id'])['Salary']))]
print (df)
Id Name Salary ManagerID
0 1 Joe 70000 3.0
Details:
print (df['ManagerID'].map(df.set_index(['Id'])['Salary']))
0 60000.0
1 90000.0
2 NaN
3 NaN
Name: ManagerID, dtype: float64

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

Pandas - Merge rows on column A, taking first values from each column B, C etc

I have a dataframe, with recordings of statistics in multiple columns.
I have a list of the column names: stat_columns = ['Height', 'Speed'].
I want to combine the data to get one row per id.
The data comes sorted with the newest records on the top. I want the most recent data, so I must use the first value of each column, by id.
My dataframe looks like this:
Index id Height Speed
0 100007 8.3
1 100007 54
2 100007 8.6
3 100007 52
4 100035 39
5 100014 44
6 100035 5.6
And I want it to look like this:
Index id Height Speed
0 100007 54 8.3
1 100014 44
2 100035 39 5.6
I have tried a simple groupby myself:
df_stats = df_path.groupby(['id'], as_index=False).first()
But this seems to only give me a row with the first statistic found.
For me your solution working, maybe is necessary replace empty values to NaNs:
df_stats = df_path.replace('',np.nan).groupby('id', as_index=False).first()
print (df_stats)
id Index Height Speed
0 100007 0 54.0 8.3
1 100014 5 44.0 NaN
2 100035 4 39.0 5.6

Is there an "ungroup by" operation opposite to .groupby in pandas?

Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.

Categories