How to decode column value from rare label by matching column names - python

I have two dataframes like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
My objective is to
a) Replace the Rare or rare values in feature column of tdf dataframe by original value from cdf dataframe.
b) To identify original value, we can make use of the string before = Rare or =rare or = rare etc. That string represents the column name in cdf dataframe (from where original value to replace rare can be found)
I was trying something like the below but not sure how to go from here
replace_df = cdf.merge(tdf,how='inner',on='Id')
replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))])
I have to apply this on a big data where I have million rows and more than 1000 replacements to be made like this.
I expect my output to be like as shown below

Here is one possible approach using MultiIndex.map to substitute values from cdf into tdf:
s = tdf['feature'].str.split('=')
m = s.str[1].isin(['rare', 'Rare'])
v = tdf[m].set_index(['Id', s[m].str[0]]).index.map(cdf.set_index('Id').stack())
tdf.loc[m, 'feature'] = s[m].str[0] + '=' + v.astype(str)
print(tdf)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70

# list comprehension to find where rare is in the feature col
tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
# create a mask where feature is in columns of cdf
mask = tdf['feature'].isin(cdf.columns)
# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
right_index=True, left_on=['Id', 'feature'])[0].astype(str)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70

My feeling is there's no need to look for Rare values.
Extract the column name from tdf to lookup in cdf. After, flatten your cdf dataframe to extract the right values:
r = tdf.set_index('Id')['feature'].str.split('=').str[0].str.lower()
tdf['feature'] = r.values + '=' + cdf.set_index('Id').unstack() \
.loc[zip(r.values, r.index)] \
.astype(str).values
Output:
>>> tdf
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=A 3.70
>>> r
Id # <- the index is the row of cdf
1 grade # <- the values are the column of cdf
1 dash
1 dumma
1 dumeel
3 dash
3 dumma
3 grade
Name: feature, dtype: object

Related

Average score per attempt for entries with non fully overlapping attempts

I have a pandas dataframe that has a column that contains a list of attempt numbers and another column that contains the score achieved on those attempts. A simplified example is below:
scores = [[0,1,0], [0,0], [0,6,2]]
attempt_num = [[1,2,3], [2,4], [2,3,4]]
df = pd.DataFrame([attempt_num, scores]).T
df.columns = ['Attempt', 'Score']
Each row represents a different person, which for the purposes of this question, we can assume are unique. The data is incomplete, and so I have attempt number 1, 2 and 3 for the first person, 2 and 4 for the second and 2, 3 and 4 for the last. What I want to do is to get an average score per attempt. For example, attempt 1 only shows up once and so the average would be 0, the score achieved when it did show up. Attempt 2 shows up for all persons which gives an average of 0.33 ((1 + 0 + 0)/3) and so on. So the expected output would be:
Attempt_Number Average_Score
0 1 0.00
1 2 0.33
2 3 3.00
3 4 1.00
I could loop through every element of row of the dataframe and then through every element in the list in that row, append the score to an ordered list and calculate the average for every element in that list, but this would seem to be very inefficient. Is there a better way?
Use DataFrame.explode with aggregate mean:
df = (df.explode(['Number','Score'])
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)
print (df)
Attempt_Number Average_Score
0 1 0.000000
1 2 0.333333
2 3 3.000000
3 4 1.000000
For oldier pandas versions use:
df = (df.apply(pd.Series.explode)
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)

Python - delete a row based on condition from a pandas.core.series.Series after groupby

I have this pandas.core.series.Series after grouping by 2 columns case and area
case
area
A
1
2494
2
2323
B
1
59243
2
27125
3
14
I want to keep only areas that are in case A , that means the result should be like this:
case
area
A
1
2494
2
2323
B
1
59243
2
27125
I tried this code :
a = df['B'][~df['B'].index.isin(df['A'].index)].index
df['B'].drop(a)
And it worked, the output was :
But it didn't drop it in the dataframe, it still the same.
when I assign the result of droping, all the values became NaN
df['B'] = df['B'].drop(a)
what should I do ?
it is possible to drop after grouping, here's one way
import pandas
import numpy as np
np.random.seed(1)
ungroup_df = pd.DataFrame({
'case':[
'A','A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'B','B','B','B','B','B',
],
'area':[
1,2,1,2,1,2,
1,2,1,2,1,2,
1,2,3,1,2,3,
1,2,3,1,2,3,
],
'value': np.random.random(24),
})
df = ungroup_df.groupby(['case','area'])['value'].sum()
print(df)
#index into the multi-index to just the 'A' areas
#the ":" is saying any value at the first level (A or B)
#then the df.loc['A'].index is filtering to second level of index (area) that match A's
filt_df = df.loc[:,df.loc['A'].index]
print(filt_df)
Test df:
case area
A 1 1.566114
2 2.684593
B 1 1.983568
2 1.806948
3 2.079145
Name: value, dtype: float64
Output after dropping
case area
A 1 1.566114
2 2.684593
B 1 1.983568
2 1.806948
Name: value, dtype: float64

pandas.DataFrame.round() not accepting pd.NA or pd.NAN

pandas version: 1.2
I have a dataframe that columns as 'float64' with null values represented as pd.NAN. Is there way to round without converting to string then decimal:
df = pd.DataFrame([(.21, .3212), (.01, .61237), (.66123, .03), (.21, .18),(pd.NA, .18)],
columns=['dogs', 'cats'])
df
dogs cats
0 0.21 0.32120
1 0.01 0.61237
2 0.66123 0.03000
3 0.21 0.18000
4 <NA> 0.18000
Here is what I wanted to do, but it is erroring:
df['dogs'] = df['dogs'].round(2)
TypeError: float() argument must be a string or a number, not 'NAType'
Here is another way I tried but this silently fails and no conversion occurs:
tn.round({'dogs': 1})
dogs cats
0 0.21 0.32120
1 0.01 0.61237
2 0.66123 0.03000
3 0.21 0.18000
4 <NA> 0.18000
While annoying, the pandas.NA is still relatively new and doesn't support ALL numpy ufuncs. Oddly I'm also encountering errors trying to change the "dogs" column's dtype from object -> float which seems like a bug to me. There's a couple of alternatives that you can achieve your desired result though:
mask the NA away and round the rest of the column
na_mask = df["dogs"].notnull()
df.loc[na_mask, "dogs"] = df.loc[na_mask, "dogs"].astype(float).round(1)
print(df)
dogs cats
0 0.2 0.32120
1 0 0.61237
2 0.7 0.03000
3 0.2 0.18000
4 <NA> 0.18000
Replace the pd.NA with np.nan and then round
df = df.replace(pd.NA, np.nan).round({"dogs": 1})
print(df)
dogs cats
0 0.2 0.32120
1 0.0 0.61237
2 0.7 0.03000
3 0.2 0.18000
4 NaN 0.18000
df['dogs'] = df['dogs'].apply(lambda x: round(x,2) if str(x) != '<NA>' else x)
Does the following code work?
import pandas as pd
import numpy as np
df = pd.DataFrame([(.21, .3212), (.01, .61237), (.66123, .03), (.21, .18), (np.nan, .18)],
columns=['dogs', 'cats'])
df['dogs'] = df['dogs'].round(2)
print(df)

Producing a "best fit" slope gradient from pandas df and populating new columnb

I'm trying to add a slope calculation on individual subsets of two fields in a dataframe and have that value of slope applied to all rows in each subset. (I've used the "slope" function in excel previously, although I'm not married to the exact algo. The "desired_output" field is what I'm expecting as the output. The subsets are distinguished by the "strike_order" column, subsets starting at 1 and not having a specific highest value.
"IV" is the y value
"Strike" is the x value
Any help would be appreciated as I don't even know where to begin with this....
import pandas
df = pandas.DataFrame([[1200,1,.4,0.005],[1210,2,.35,0.005],[1220,3,.3,0.005],
[1230,4,.25,0.005],[1200,1,.4,0.003],[1210,2,.37,.003]],columns=
["strike","strike_order","IV","desired_output"])
df
strike strike_order IV desired_output
0 1200 1 0.40 0.005
1 1210 2 0.35 0.005
2 1220 3 0.30 0.005
3 1230 4 0.25 0.005
4 1200 1 0.40 0.003
5 1210 2 0.37 0.003
Let me know if this isn't a well posed question and I'll try to make it better.
You can use numpy's least square
We can rewrite the line equationy=mx+c as y = Ap, where A = [[x 1]] and p = [[m], [c]]. Then use lstsq to solve for p, so we need to create A by adding a column of ones to df
import numpy as np
df['ones']=1
A = df[['strike','ones']]
y = df['IV']
m, c = np.linalg.lstsq(A,y)[0]
Alternatively you can use scikit learn's linear_model Regression model
you can verify the result by plotting the data as scatter plot and the line equation as plot
import matplotlib.pyplot as plt
plt.scatter(df['strike'],df['IV'],color='r',marker='d')
x = df['strike']
#plug x in the equation y=mx+c
y_line = c + m * x
plt.plot(x,y)
plt.xlabel('Strike')
plt.ylabel('IV')
plt.show()
the resulting plot is shown below
Try this.
First create a subset column by iterating over the dataframe, using the strike_order value transitioning to 1 as the boundary between subsets
#create subset column
subset_counter = 0
for index, row in df.iterrows():
if row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter += 1
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
Then run a linear regression over each subset using groupby
# run linear regression on subsets of the dataframe using groupby
from sklearn import linear_model
model = linear_model.LinearRegression()
for (group, df_gp) in df.groupby('subset'):
X=df_gp[['strike']]
y=df_gp.IV
model.fit(X,y)
df.loc[df.subset == df_gp.iloc[0].subset, 'slope'] = model.coef_
df
strike strike_order IV desired_output subset slope
0 1200 1 0.40 0.005 0 -0.005
1 1210 2 0.35 0.005 0 -0.005
2 1220 3 0.30 0.005 0 -0.005
3 1230 4 0.25 0.005 0 -0.005
4 1200 1 0.40 0.003 1 -0.003
5 1210 2 0.37 0.003 1 -0.003
# Scott This worked except it went subset value 0, 1 and all subsequent subset values were 2. I added an extra conditional at the beginning and a very clumsy seed "seed" value to stop it looking for row -1.
import scipy
seed=df.loc[0,"date_exp"]
#seed ="08/11/200015/06/2001C"
#print(seed)
subset_counter = 0
for index, row in df.iterrows():
#if index['strike_order']==0:
if row['date_exp'] ==seed:
df.loc[index,'subset']=0
elif row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter = 1 + df.loc[index-1,'subset']
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
This now does exactly what I want although I think using the seed value is clunky, would have preferred to use if row == 0 etc. But it's friday and this works.
Cheers

Creating DataFrame with Hierarchical Columns

What is the easiest way to create a DataFrame with hierarchical columns?
I am currently creating a DataFrame from a dict of names -> Series using:
df = pd.DataFrame(data=serieses)
I would like to use the same columns names but add an additional level of hierarchy on the columns. For the time being I want the additional level to have the same value for columns, let's say "Estimates".
I am trying the following but that does not seem to work:
pd.DataFrame(data=serieses,columns=pd.MultiIndex.from_tuples([(x, "Estimates") for x in serieses.keys()]))
All I get is a DataFrame with all NaNs.
For example, what I am looking for is roughly:
l1 Estimates
l2 one two one two one two one two
r1 1 2 3 4 5 6 7 8
r2 1.1 2 3 4 5 6 71 8.2
where l1 and l2 are the labels for the MultiIndex
This appears to work:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.concat({"Estimates": pd.DataFrame(data)}, axis=1, names=["l1", "l2"])
l1 Estimates
l2 a b c
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
I know the question is really old but for pandas version 0.19.1 one can use direct dict-initialization:
d = {('a','b'):[1,2,3,4], ('a','c'):[5,6,7,8]}
df = pd.DataFrame(d, index=['r1','r2','r3','r4'])
df.columns.names = ('l1','l2')
print df
l1 a
l2 b c
r1 1 5
r2 2 6
r3 3 7
r4 4 8
Im not sure but i think the use of a dict as input for your DF and a MulitIndex dont play well together. Using an array as input instead makes it work.
I often prefer dicts as input though, one way is to set the columns after creating the df:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.DataFrame(np.array(data.values()).T, index=['r1','r2','r3','r4'])
tups = zip(*[['Estimates']*len(data),data.keys()])
df.columns = pd.MultiIndex.from_tuples(tups, names=['l1','l2'])
l1 Estimates
l2 a c b
r1 1 10 100
r2 2 20 200
r3 3 30 300
r4 4 40 400
Or when using an array as input for the df:
data_arr = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
tups = zip(*[['Estimates']*data_arr.shape[0],['a','b','c'])
df = pd.DataFrame(data_arr.T, index=['r1','r2','r3','r4'], columns=pd.MultiIndex.from_tuples(tups, names=['l1','l2']))
Which gives the same result.
The solution by Rutger Kassies worked in my case, but I have
more than one column in the "upper level" of the column hierarchy.
Just want to provide what worked for me as an example since it is a more general case.
First, I have data with that looks like this:
> df
(A, a) (A, b) (B, a) (B, b)
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
2 8.51 9.60 66.67 50.70
3 0.03 508.99 56.00 8.58
I would like it to look like this:
> df
A B
a b a b
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
...
The solution is:
tuples = df.transpose().index
new_columns = pd.MultiIndex.from_tuples(tuples, names=['Upper', 'Lower'])
df.columns = new_columns
This is counter-intuitive because in order to create columns, I have to do it through index.

Categories