Replace and mapping string values in a Python dataframe with pandas - python

Hi i've been tryng to replace string values in a dataframe (strings are abbreviation of NFL teams), i have something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 Phi Atl Phi Phi Phi
1 2 Bal Bal Bal Buf Bal
2 3 Ind Ind Cin Cin Ind
3 4 NE NE Hou NE NE
4 5 Jax Jax NYG NYG NYG
and a Dataframe with the mapping, something like this:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
...
31 WAS 32
I want to replace every string with the TeamID to make basic statistics (frequency), i've tried the next:
## Dataframe with strings and Team ID
dfDicTeams = dfTeams[['TEAM_YH','TeamID']].to_dict('dict')
## Dataframe with selections by users
dfW1.replace(dfDicTeams[['TEAM_YH']],dfDicTeams[['TeamID']]) ## Error: unhashable type: 'list'
dfW1.replace(dfDicTeams) ## Error: Replacement not allowed with overlapping keys and values
what am i doing wrong? is it posible to do it?
I'm using Python 3, and i want something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 26 2 26 26 26
1 2 3 3 3 4 3
2 3 14 14 7 7 14
3 4 21 21 13 21 21
4 5 15 15 23 23 23
to aggregate the options:
IDMatch ATeam Count HTeam Count
1 26 4 2 1
2 3 4 4 1
3 14 3 7 2
4 21 4 13 1
5 15 2 23 3

Given a main input dataframe df and a mapping dataframe df_map, you can create a series mapping, then use pd.DataFrame.applymap with a custom function:
s = df_map.set_index('TEAM_YH')['TeamID']
df.iloc[:, 2:] = df.iloc[:, 2:].applymap(lambda x: s.get(x.upper(), -1))
print(df)
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 0 1 7 2 7 7 7
1 1 2 3 3 3 4 3
2 2 3 5 5 -1 -1 5
3 3 4 -1 -1 -1 -1 -1
4 4 5 6 6 -1 -1 -1
The example df_map used to calculate the above result:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
3 BUF 4
4 IND 5
5 JAX 6
6 PHI 7
32 WAS 32

Related

Count consecutive numbers from a column of a dataframe in Python

I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3

Newly created column in a data frame need to be updated with values based on condition ,from another column

DF has four columns and column 'Id' in unique and it is grouped by column 'idhogar'.
column ' parentesco1' has status 0 (or) 1 . 'Target' columns has values,which are different for various rows under same column values of 'idhogar'
INDEX Id parentesco1 idhogar Target
0 ID_fe8c32eba 0 4616164 2
1 ID_ca701e058 1 4616164 2
2 ID_5ad4372cd 0 4983866 3
3 ID_1e320689c 1 4983866 3
4 ID_700e30a8d 0 5905417 2
5 ID_bc99ecfb8 0 5905417 2
6 ID_308a05a16 1 5905417 2
7 ID_00186dde5 1 7.56E+06 4
8 ID_34570a74c 1 20713493 4
9 ID_b13870a19 1 27651991 3
10 ID_74e989389 1 45038655 4
11 ID_726ba7d34 0 60027579 4
12 ID_b75d7c648 0 60027579 4
13 ID_37e7b3aaa 1 60027579 4
14 ID_396da5a70 0 104578907 2
15 ID_4381374bb 1 104578907 2
16 ID_272a9b4d5 0 119024319 4
17 ID_1225f3779 0 119024319 4
18 ID_fc5dfaa2e 0 119024319 4
19 ID_7390a3f99 1 119024319 4
New column'Rev_target' created ,need to have the value of 'Target' of row having ' parentesco1' as 1 for all the rows falling under the group of same 'idhogar'.
I tried the following but not successful.
for idhogar in df['idhogar'].unique():
if len(df[df['idhogar'] == idhogar]['Target'].unique())!= 1:
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=rev_target_val
# NOT WORKING AS REQUIRED ---- gives output as NaN in all rows of newly created column
Tried the below but throwing error
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=np.where(len(df[df['idhogar'] == idhogar]['Target'].unique())!=
1,rev_target_val,df['Target'])
ValueError: operands could not be broadcast together with shapes () (0,) (9557,)
Tried the below but not working as intended,gives same value as 2 in all the rows of new'Rev_target' column
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=df.apply(lambda x: rev_target_val if (len(df[df['idhogar'] == idhogar]
['Target'].unique())!= 1) else df['Target'],axis=1)
Would appreciate a solution from you and thanks in advance.
I would sort the dataframe on parentesco1 in descending order to make sure that the parentesco1 1 row is the first row. Then a transform could easily access that row:
df['Rev_target'] = df.sort_values('parentesco1', ascending=False).groupby(
'idhogar').transform(lambda x: x.iloc[0])['Target']
It gives:
INDEX Id parentesco1 idhogar Target Rev_target
0 0 ID_fe8c32eba 0 4616164.0 2 2
1 1 ID_ca701e058 1 4616164.0 2 2
2 2 ID_5ad4372cd 0 4983866.0 3 3
3 3 ID_1e320689c 1 4983866.0 3 3
4 4 ID_700e30a8d 0 5905417.0 2 2
5 5 ID_bc99ecfb8 0 5905417.0 2 2
6 6 ID_308a05a16 1 5905417.0 2 2
7 7 ID_00186dde5 1 7560000.0 4 4
8 8 ID_34570a74c 1 20713493.0 4 4
9 9 ID_b13870a19 1 27651991.0 3 3
10 10 ID_74e989389 1 45038655.0 4 4
11 11 ID_726ba7d34 0 60027579.0 4 4
12 12 ID_b75d7c648 0 60027579.0 4 4
13 13 ID_37e7b3aaa 1 60027579.0 4 4
14 14 ID_396da5a70 0 104578907.0 2 2
15 15 ID_4381374bb 1 104578907.0 2 2
16 16 ID_272a9b4d5 0 119024319.0 4 4
17 17 ID_1225f3779 0 119024319.0 4 4
18 18 ID_fc5dfaa2e 0 119024319.0 4 4
19 19 ID_7390a3f99 1 119024319.0 4 4

random sampling of the data in python

I have a dataframe with several columns and I need to re-sample from that data with more weight to one category. I think np.random.choice should work but not sure how to implement it. Following is the example data from which I want to sample randomly but want 70% probability of getting expensive home (based on the Expensive_home column, value = 1) and 30% probability for Expensive_home=0. How can I create the re-sampled data file? Thank you!
ID Lot_Area Year_Built Full_Bath Bedroom Sale_Price Expensive_home
1 31770 1960 1 3 215000 0
2 11622 1961 1 2 105000 0
3 5389 1995 2 2 236500 0
4 8402 1998 2 3 180400 0
5 10176 1990 1 2 171500 0
6 6820 1985 1 1 212000 0
7 53504 2003 3 4 538000 1
8 12134 1988 2 4 164000 0
9 11394 2010 1 1 394432 1
10 19138 1951 1 2 141000 0
11 13175 1978 2 3 210000 0
12 11751 1977 2 3 190000 0
13 10625 1974 2 3 170000 0
14 7500 2000 2 3 216000 0
15 11241 1970 1 2 149000 0
16 2280 1978 2 3 146000 0
17 12858 2009 2 3 376162 1
18 12883 2009 2 3 290941 0
19 12182 2005 2 3 220000 0
20 11520 2005 2 3 275000 0
similar data file but with more of randomly picked 1s in the last column
To create a dataframe of the same length but allowing expensive to have a higher chance of being selected and allowing replacements, use:
weights = df['Expensive_home'].replace({0: 30, 1: 70})
df1 = df.sample(len(df), replace=True, weights=weights)
To create a dataframe with all expensive and then 30% of non-expensive, you can do:
expensive = df['Expensive_home'].astype(bool)
df2 = pd.concat([df[expensive], df[~expensive].sample(frac=0.3)])

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Filtering Pandas Dataframe by mean of last N values

I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.
_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)
Something like this
_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]
However, the problem here is that the value length is incorrect. Any tips?
ValueError: Series lengths must match to compare
Sample Data
This has an index on the year and month, and 2 columns.
Col1 Col2
year month
2005 12 0.533835 0.170679
12 0.494733 0.198347
2006 3 0.440098 0.202240
6 0.410285 0.188421
9 0.502420 0.200188
12 0.522253 0.118680
2007 3 0.378120 0.171192
6 0.431989 0.145158
9 0.612036 0.178097
12 0.519766 0.252196
2008 3 0.547705 0.202163
6 0.560985 0.238591
9 0.617320 0.199537
12 0.343939 0.253855
Why not just boolean index directly on your filtered DataFrame with
df[df.tail(3).mean() > df.mean()]
Demo
>>> df
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
4 9 7 8 9 4
>>> df[df.tail(3).mean() > df.mean()]
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
Update example for MultiIndex edit
The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.
>>> df
col1 col2
2005 12 -0.340088 -0.574140
12 -0.814014 0.430580
2006 3 0.464008 0.438494
6 0.019508 -0.635128
9 0.622645 -0.824526
12 -1.674920 -1.027275
2007 3 0.397133 0.659467
6 0.026170 -0.052063
9 0.835561 0.608067
12 0.736873 -0.613877
2008 3 0.344781 -0.566392
6 -0.653290 -0.264992
9 0.080592 -0.548189
12 0.585642 1.149779
>>> df.loc[:,df.tail(3).mean() > df.mean()]
col2
2005 12 -0.574140
12 0.430580
2006 3 0.438494
6 -0.635128
9 -0.824526
12 -1.027275
2007 3 0.659467
6 -0.052063
9 0.608067
12 -0.613877
2008 3 -0.566392
6 -0.264992
9 -0.548189
12 1.149779

Categories