filling cells in DataFrame - python

I created the DataFrame and faced a problem:
r value
0 0.8 2.5058
1 0.9 -1.9320
2 1.0 -2.6097
3 1.2 -1.6840
4 1.4 -0.8906
5 0.8 2.6955
6 0.9 -1.9552
7 1.0 -2.6641
8 1.2 -1.7169
9 1.4 -0.9056
... ... ...
For r from 0.8 to 1.4, I want to assign the value for r = 1.0.
Therefore the desired Dataframe should look like:
r value
0 0.8 -2.6097
1 0.9 -2.6097
2 1.0 -2.6097
3 1.2 -2.6097
4 1.4 -2.6097
5 0.8 -2.6641
6 0.9 -2.6641
7 1.0 -2.6641
8 1.2 -2.6641
9 1.4 -2.6641
... ... ....
My first idea wast to create the condition:
np.where(data['r']==1.0, data['value'], 1.0)
but it does not solve my problem.

Try this:
def subr(df):
isone = df.r == 1.0
if isone.any():
atone = df.value[isone].iloc[0]
# Improvement suggested by #root
df.loc[df.r.between(0.8, 1.4), 'value'] = atone
# df.loc[(df.r >= .8) & (df.r <= 1.4), 'value'] = atone
return df
df.groupby((df.r < df.r.shift()).cumsum()).apply(subr)

Starting with this:
r value
0 0.8 -2.6097
1 0.9 -2.6097
2 1.0 -2.6097
3 1.2 -2.6097
4 1.4 -2.6097
5 0.8 -2.6641
6 0.9 -2.6641
7 1.0 -2.6641
8 1.2 -2.6641
9 1.4 -2.6641
df3['grp'] = (df3['r'] ==.8).cumsum()
grpd = dict(df3[['grp','value']][df3['r'] == 1].values)
df3["value"] = df3["grp"].map(grpd)
df3 = df3.drop('grp', axis=1)
r value
0 0.8 -2.6097
1 0.9 -2.6097
2 1.0 -2.6097
3 1.2 -2.6097
4 1.4 -2.6097
5 0.8 -2.6641
6 0.9 -2.6641
7 1.0 -2.6641
8 1.2 -2.6641
9 1.4 -2.6641

Related

Number Of Rows Since Positive/Negative in Pandas

I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0

create new rows based on values of one of the columns in the above row with specific condition - pandas or numpy

I have a data frame as shown below
B_ID no_show Session slot_num walkin ns_w c_ns_w c_walkin
1 0.4 S1 1 0.2 0.2 0.2 0.2
2 0.3 S1 2 0.5 -0.2 0.2 0.7
3 0.8 S1 3 0.5 0.3 0.5 1.2
4 0.3 S1 4 0.8 -0.5 0.0 2.0
5 0.6 S1 5 0.4 0.2 0.2 2.4
6 0.8 S1 6 0.2 0.6 0.8 2.6
7 0.9 S1 7 0.1 0.8 1.4 2.7
8 0.4 S1 8 0.5 -0.1 1.3 3.2
9 0.6 S1 9 0.1 0.5 1.8 3.3
12 0.9 S2 1 0.9 0.0 0.0 0.9
13 0.5 S2 2 0.4 0.1 0.1 1.3
14 0.3 S2 3 0.1 0.2 0.3 1.4
15 0.7 S2 4 0.4 0.3 0.6 1.8
20 0.7 S2 5 0.1 0.6 1.2 1.9
16 0.6 S2 6 0.3 0.3 1.5 2.2
17 0.8 S2 7 0.5 0.3 1.8 2.7
19 0.3 S2 8 0.8 -0.5 1.3 3.5
where,
df[ns_w] = df['no_show'] - df['walkin']
c_ns_w = cumulaitve of ns_w
df['c_ns_w'] = df.groupby(['Session'])['ns_w'].cumsum()
c_walkin = cumulative of walkin
df['c_walkin'] = df.groupby(['Session'])['walkin'].cumsum()
From the above I would like to calculate two columns called u_ns_w and u_c_walkin.
And when ever u_c_walkin > 0.9 create a new row with no_show = 0, walkin=0 and all other values will be same as the above row. where B_ID = walkin1, 2, etc, and subtract 1 from the above u_c_walkin.
At the same time when ever u_c_ns_w > 0.8 add a new row with B_ID = overbook1, 2 etc, with no_show = 0.5, walkin=0, ns_w = 0.5 and all other values same as above row and subtract 0.5 from the above u_c_ns_w.
Expected output:
B_ID no_show Session slot_num walkin ns_w c_ns_w c_walkin u_c_walkin u_c_ns_w
1 0.4 S1 1 0.2 0.2 0.2 0.2 0.2 0.2
2 0.3 S1 2 0.5 -0.2 0.2 0.7 0.7 0.2
3 0.8 S1 3 0.5 0.3 0.5 1.2 1.2 0.5
walkin1 0.0 S1 3 0.0 0.3 0.5 1.2 0.2 0.5
4 0.3 S1 4 0.8 -0.5 0.0 2.0 1.0 0.0
walkin2 0.0 S1 4 0.0 -0.5 0.0 2.0 0.0 0.0
5 0.6 S1 5 0.4 0.2 0.2 2.4 0.4 0.2
6 0.8 S1 6 0.2 0.6 0.8 2.6 0.6 0.8
7 0.9 S1 7 0.1 0.8 1.4 2.7 0.7 1.4
overbook1 0.5 S1 7 0.0 0.5 1.4 2.7 0.7 0.9
8 0.4 S1 8 0.5 -0.1 1.3 3.2 1.2 0.8
walkin3 0.0 S1 8 0.0 -0.1 1.3 3.2 0.2 0.8
9 0.6 S1 9 0.1 0.5 1.8 3.3 0.1 1.3
overbook2 0.5 S1 9 0.0 0.5 1.8 3.3 0.1 0.8
12 0.9 S2 1 0.9 0.0 0.0 0.9 0.9 0.0
13 0.5 S2 2 0.4 0.1 0.1 1.3 1.3 0.1
walkin1 0.0 S2 2 0.0 0.1 0.1 1.3 0.3 0.1
14 0.3 S2 3 0.1 0.2 0.3 1.4 0.4 0.3
15 0.7 S2 4 0.4 0.3 0.6 1.8 0.8 0.6
20 0.7 S2 5 0.1 0.6 1.2 1.9 0.9 1.2
overbook1 0.5 S2 5 0.0 0.5 1.2 1.9 0.9 0.7
16 0.6 S2 6 0.3 0.3 1.5 2.2 1.2 1.0
walkin2 0.0 S2 6 0.3 0.3 1.5 2.2 0.2 1.0
overbook2 0.5 S2 6 0.0 0.5 1.5 2.2 0.2 0.5
17 0.8 S2 7 0.5 0.3 1.8 2.7 0.7 0.8
19 0.3 S2 8 0.8 -0.5 1.3 3.5 1.5 0.3
walkin3 0.0 S2 8 0.8 -0.5 1.3 3.5 0.5 0.3
I tried below code to create the walkin rows but not able to create for overbook rows.
def create_u_columns (ser):
l_index = []
arr_ns = ser.to_numpy()
# array for latter insert
arr_idx = np.zeros(len(ser), dtype=int)
walkin_id = 1
for i in range(len(arr_ns)-1):
if arr_ns[i]>0.8:
# remove 1 to u_no_show
arr_ns[i+1:] -= 1
# increment later idx to add
arr_idx[i] = walkin_id
walkin_id +=1
#return a dataframe with both columns
return pd.DataFrame({'u_cumulative': arr_ns, 'mask_idx':arr_idx}, index=ser.index)
df[['u_c_walkin', 'mask_idx']]= df.groupby(['Session'])['c_walkin'].apply(create_u_columns)
# select the rows
df_toAdd = df.loc[df['mask_idx'].astype(bool), :].copy()
# replace the values as wanted
df_toAdd['no_show'] = 0
df_toAdd['walkin'] = 0
df_toAdd['EpisodeNumber'] = 'walkin'+df_toAdd['mask_idx'].astype(str)
df_toAdd['u_c_walkin'] -= 1
# add 0.5 to index for later sort
df_toAdd.index += 0.5
new_df = pd.concat([df,df_toAdd]).sort_index()\
.reset_index(drop=True).drop('mask_idx', axis=1)
Here you can modify the function this way to do both check at the same time. Please check that it is exactly the conditions you want to apply for the walkin and overbook dataframes.
def create_columns(dfg):
arr_walkin = dfg['c_walkin'].to_numpy()
arr_ns = dfg['c_ns_w'].to_numpy()
# array for latter insert
arr_idx_walkin = np.zeros(len(arr_walkin), dtype=int)
arr_idx_ns = np.zeros(len(arr_ns), dtype=int)
walkin_id = 1
oberbook_id = 1
for i in range(len(arr_ns)):
# condition on c_walkin
if arr_walkin[i]>0.9:
# remove 1 to u_no_show
arr_walkin[i+1:] -= 1
# increment later idx to add
arr_idx_walkin[i] = walkin_id
walkin_id +=1
# condition on c_ns_w
if arr_ns[i]>0.8:
# remove 1 to u_no_show
arr_ns[i+1:] -= 0.5
# increment later idx to add
arr_idx_ns[i] = oberbook_id
oberbook_id +=1
#return a dataframe with both columns
return pd.DataFrame({'u_c_walkin': arr_walkin,
'u_c_ns_w': arr_ns,
'mask_idx_walkin':arr_idx_walkin,
'mask_idx_ns': arr_idx_ns }, index=dfg.index)
df[['u_c_walkin', 'u_c_ns_w', 'mask_idx_walkin', 'mask_idx_ns']]=\
df.groupby(['Session'])[['c_walkin', 'c_ns_w']].apply(create_columns)
# select the rows for walkin
df_walkin = df.loc[df['mask_idx_walkin'].astype(bool), :].copy()
# replace the values as wanted
df_walkin['no_show'] = 0
df_walkin['walkin'] = 0
df_walkin['B_ID'] = 'walkin'+df_walkin['mask_idx_walkin'].astype(str)
df_walkin['u_c_walkin'] -= 1
# add 0.5 to index for later sort
df_walkin.index += 0.2
# select the rows for ns_w
df_ns = df.loc[df['mask_idx_ns'].astype(bool), :].copy()
# replace the values as wanted
df_ns['no_show'] = 0.5
df_ns['walkin'] = 0
df_ns['ns_w'] = 0.5
df_ns['B_ID'] = 'overbook'+df_ns['mask_idx_ns'].astype(str)
df_ns['u_c_ns_w'] -= 0.5
# add 0.5 to index for later sort
df_ns.index += 0.4
new_df = pd.concat([df,df_walkin, df_ns]).sort_index()\
.reset_index(drop=True).drop(['mask_idx_walkin','mask_idx_ns'], axis=1)
and you get:
print (new_df)
B_ID no_show Session slot_num walkin ns_w c_ns_w c_walkin \
0 1 0.4 S1 1 0.2 0.2 0.2 0.2
1 2 0.3 S1 2 0.5 -0.2 0.2 0.7
2 3 0.8 S1 3 0.5 0.3 0.5 1.2
3 walkin1 0.0 S1 3 0.0 0.3 0.5 1.2
4 4 0.3 S1 4 0.8 -0.5 0.0 2.0
5 walkin2 0.0 S1 4 0.0 -0.5 0.0 2.0
6 5 0.6 S1 5 0.4 0.2 0.2 2.4
7 6 0.8 S1 6 0.2 0.6 0.8 2.6
8 7 0.9 S1 7 0.1 0.8 1.4 2.7
9 overbook1 0.5 S1 7 0.0 0.5 1.4 2.7
10 8 0.4 S1 8 0.5 -0.1 1.3 3.2
11 walkin3 0.0 S1 8 0.0 -0.1 1.3 3.2
12 9 0.6 S1 9 0.1 0.5 1.8 3.3
13 overbook2 0.5 S1 9 0.0 0.5 1.8 3.3
14 12 0.9 S2 1 0.9 0.0 0.0 0.9
15 13 0.5 S2 2 0.4 0.1 0.1 1.3
16 walkin1 0.0 S2 2 0.0 0.1 0.1 1.3
17 14 0.3 S2 3 0.1 0.2 0.3 1.4
18 15 0.7 S2 4 0.4 0.3 0.6 1.8
19 20 0.7 S2 5 0.1 0.6 1.2 1.9
20 overbook1 0.5 S2 5 0.0 0.5 1.2 1.9
21 16 0.6 S2 6 0.3 0.3 1.5 2.2
22 walkin2 0.0 S2 6 0.0 0.3 1.5 2.2
23 overbook2 0.5 S2 6 0.0 0.5 1.5 2.2
24 17 0.8 S2 7 0.5 0.3 1.8 2.7
25 19 0.3 S2 8 0.8 -0.5 1.3 3.5
26 walkin3 0.0 S2 8 0.0 -0.5 1.3 3.5
u_c_walkin u_c_ns_w
0 0.2 0.2
1 0.7 0.2
2 1.2 0.5
3 0.2 0.5
4 1.0 0.0
5 0.0 0.0
6 0.4 0.2
7 0.6 0.8
8 0.7 1.4
9 0.7 0.9
10 1.2 0.8
11 0.2 0.8
12 0.3 1.3
13 0.3 0.8
14 0.9 0.0
15 1.3 0.1
16 0.3 0.1
17 0.4 0.3
18 0.8 0.6
19 0.9 1.2
20 0.9 0.7
21 1.2 1.0
22 0.2 1.0
23 1.2 0.5
24 0.7 0.8
25 1.5 0.3
26 0.5 0.3

Merging different length dataframe in Python/pandas

I have 2 dataframe:
df1
aa gg pm
1 3.3 0.5
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5
6 3.3 0.5
7 11.1 3.0
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3
8 3.3 0.3
and df2:
aa gg st
1 3.3 in
2 0.3 in
5 1.3 in
7 11.1 in
8 5.3 in
I would like to merge these two dataframe on col aa and gg to get results like:
aa gg pm st
1 3.3 0.5 in
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6 in
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5 in
6 3.3 0.5
7 11.1 3.0 in
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3 in
8 3.3 0.3
I want to map the col st details to based on col aa and gg.
please let me know how to do this.
You can multiple float columns by 1000 or 10000 and convert to integers and then use these new columns for join:
df1['gg_int'] = df1['gg'].mul(1000).astype(int)
df2['gg_int'] = df2['gg'].mul(1000).astype(int)
df = df1.merge(df2.drop('gg', axis=1), on=['aa','gg_int'], how='left')
df = df.drop('gg_int', axis=1)
print (df)
aa gg pm st
0 1 3.3 0.5 in
1 1 0.0 4.7 NaN
2 1 9.3 0.2 NaN
3 2 0.3 0.6 in
4 2 14.0 91.0 NaN
5 3 13.0 31.0 NaN
6 4 13.1 64.0 NaN
7 5 1.3 0.5 in
8 6 3.3 0.5 NaN
9 7 11.1 3.0 in
10 7 11.3 24.0 NaN
11 8 3.2 0.0 NaN
12 8 5.3 0.3 in
13 8 3.3 0.3 NaN

Remove linearly increasing "count" columns pandas

I have a dataframe with some columns representing counts for every timestep, I would like to automatically drop these, for example like the df.dropna() functionality, but something like df.dropcounts().
Here is an example dataframe
array = [[0.0,1.6,2.7,12.0],[1.0,3.5,4.5,13.0],[2.0,6.5,8.6,14.0]]
pd.DataFrame(array)
0 1 2 3
0 0.0 1.6 2.7 12.0
1 1.0 3.5 4.5 13.0
2 2.0 6.5 8.6 14.0
I would like to drop the first and last columns
I believe need:
val = 1
df = df.loc[:, df.diff().fillna(val).ne(val).any()]
print (df)
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6
Explanation:
First compare by DataFrame.diff:
print (df.diff())
0 1 2 3
0 NaN NaN NaN NaN
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Replace NaNs:
print (df.diff().fillna(val))
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Compare if not equal by ne:
print (df.diff().fillna(val).ne(val))
0 1 2 3
0 False False False False
1 False True True False
2 False True True False
And chck at least one True per column by DataFrame.any:
print (df.diff().fillna(val).ne(val).any())
0 False
1 True
2 True
3 False
dtype: bool
Using all
d.loc[:,~d.diff().fillna(1).eq(1).all().values]
Out[295]:
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6

Calculating the accumulated summation of clustered data in data frame in pandas

Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!
You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3

Categories