I have 2 dataframe as shown below
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
WA WB WC
0 0.4 0.2 0.4
1 0.1 0.3 0.6
2 0.3 0.2 0.5
3 0.3 0.3 0.4
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
stv_Astv_Bstv_c
0 0.5 0.2 0.4
Is there anyway to append dff2 which only consist of one row to every single row in ddf? Resulting dataframe should thus have 6 columns and rows
You can use:
dff[dff2.columns] = dff2.squeeze()
print(dff)
# Output
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
Pandas does the broadcasting for you when you assign a scalar as a column:
import pandas as pd
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
for col in dff2.columns:
dff[col] = dff2[col][0] # Pass a scalar
print(dff)
Output:
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
You can first repeat the row in dff2 len(dff) times with different methods, then concat the repeated dataframe to dff
df = pd.concat([dff, pd.concat([dff2]*len(dff)).reset_index(drop=True)], axis=1)
print(df)
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
Related
I have two data frames (df) one equally divided into weeks by count of the week for the month(February 4 weeks March 5 weeks). The other one has actual data.
equally divided df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
actual df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3
I want to replace the data in equally divided df from the actual df. Condition for the week 1 to 4 (Feb) is if the actual df week 1 to 4 sum = 1. For example.
actual df
Sunil 0.2 + 0.4 + 0.3 + 0.1 = 1
Then replace the number in equally divided df. So Sunil 0.25 0.25 0.25 0.25 will replace with the above values.
Week 5-9 is the same. If some of the values in actual df = 1 then replace.
So for Sunil, it's 0 + 0 + 0 + 0 + 0 not equal to 1 So not replace the values for weeks 5 to 9.
So the data frame looks like the below.
equally divided with edit df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0.2 0.2 0.2 0.2 0.2
Amaya 0.5 0 0.3 0.2 0.2 0.2 0.2 0.2 0.2
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
I'm trying to bring all the values from actual df to equally divided frame and going to edit it but couldn't get a way.
equally ['Feb_1_actual'] = equally ['Name'].map(actual.set_index('Name')['Feb_1'])
## then get a some and if it's equal to 1 replace the value other wise keep the same value
Is there another way to do it?
Any help would be appreciate. Thanks in advance!
The idea is not to deal with variable column name as it can't be used with pandas methods. So we need to unpivot them first through pd.melt, perform complex grouping, and then pivot back
Try this
import pandas as pd
from functools import partial
# Read data
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
equally = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
df = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
# Processing
equally_unpivot = pd.melt(equally, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
df_unpivot = pd.melt(df, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
valid_entries = df_unpivot[["Name", "Month", "Val"]].groupby(["Name", "Month"], as_index=False).sum().query("Val == 1").drop(columns=["Val"])
merged_df = (
equally[["Name"]].merge(
pd.concat([
equally_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="left",
indicator=True
).query("_merge == 'left_only'").drop(columns=["_merge"]),
df_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="inner",
)
])
.drop(columns=["Month", "Week"])
.pivot(index="Name", columns="Month_Week", values="Val")
.rename_axis(None, axis=1)
.reset_index(drop=False)
)
)
print(merged_df)
A more concise approach:
splitting dataframes in 2 segments Feb and Mar
arrange them by matched Name column
update each segment with pd.DataFrame.update and finally concat them
eq_Feb = eq_df.set_index('Name').filter(like='Feb')
eq_Mar = eq_df.set_index('Name').filter(like='Mar')
actual_df_ = actual_df[actual_df.Name.isin(eq_df.Name)]
actual_Feb = actual_df_.set_index('Name').filter(like='Feb')
actual_Mar = actual_df_.set_index('Name').filter(like='Mar')
eq_Feb.update(actual_Feb[actual_Feb.sum(1).astype(int).eq(1.0)])
eq_Mar.update(actual_Mar[actual_Mar.sum(1).astype(int).eq(1.0)])
res_df = pd.concat([eq_Feb, eq_Mar], axis=1)
Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Name
Sunil 0.20 0.40 0.30 0.10 0.2 0.2 0.2 0.2 0.2
Amaya 0.50 0.00 0.30 0.20 0.2 0.2 0.2 0.2 0.2
Will 0.80 0.20 0.00 0.00 0.1 0.2 0.3 0.1 0.3
Francis 0.40 0.20 0.30 0.10 0.2 0.4 0.0 0.4 0.0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.50 0.20 0.30 0.00 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
I have a dataframe df
A B C
0.1 0.3 0.5
0.2 0.4 0.6
0.3 0.5 0.7
0.4 0.6 0.8
0.5 0.7 0.9
For each row I would I would like to add a value to each element from dataframe df1
X
0.1
0.2
0.3
0.4
0.5
Such that the final result would be
A B C
0.2 0.4 0.6
0.4 0.6 0.8
0.6 0.8 1.0
0.8 1.0 1.2
1.0 1.2 1.4
I have tried using df_new =df.sum(df1, axis=0), but got the following error TypeError: stat_func() got multiple values for argument 'axis' I would be open to numpy solutions as well
You can use np.add:
df = np.add(df, df1.to_numpy())
print(df)
Prints:
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
import pandas as pd
df = pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df1 = [0.1, 0.2, 0.3, 0.4, 0.5]
# In one Pandas instruction
df = df.add(pd.Series(df1), axis=0)
results :
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
Try concat with .stack() and .sum()
df_new = pd.concat([df1.stack(),df2.stack()],1).bfill().sum(axis=1).unstack(1).drop('X',1)
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
df= pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df["X"]=[0.1, 0.2, 0.3, 0.4, 0.5]
columns_to_add= df.columns[:-1]
for col in columns_to_add:
df[col]+=df['X'] #this is where addition or any other operation can be performed
df.drop('X',axis=0)
I'm starting to learn pandas and I am currently unable to construct the dataframe I would like to and would like some advice.
Let's say I have two DataFrames :
T1df: max min
0 0.2 0.1
1 0.2 0.1
2 0.2 0.1
3 0.2 0.1
T2df: max min
0 0.4 0.3
1 0.4 0.3
2 0.4 0.3
3 0.4 0.3
How could I merge them to end up with this shape of DataFrame?
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3
Use concat by axis=1 with keys parameter and then traspose by DataFrame.T for MultiIndex in index:
df = pd.concat([T1df, T2df], axis=1, keys=('T1','T2')).T
print (df)
0 1 2 3
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3
Assume that we have the following pandas series resulted from an apply function applied on a dataframe after groupby.
<class 'pandas.core.series.Series'>
0 (1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2])
1 (2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1])
2 (1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4])
3 (1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
4 (3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6])
dtype: object
Can we convert this into a dataframe when the sigList=['sig1','sig2', 'sig3'] are given?
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
1 0 0.2 0.2 0.2 0.2 0.2 0.2
2 1000 0.6 0.7 0.5 0.1 0.3 0.1
1 0 0.4 0.4 0.4 0.4 0.4 0.4
1 0 0.5 0.5 0.5 0.5 0.5 0.5
3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Thanks in advance
Do it the old fashioned (and fast) way, using a list comprehension:
columns = ("Length Distance sig1Max sig2Max"
"sig3Max sig1Min sig2Min sig3Min").split()
df = pd.DataFrame([[a, b, *c, *d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Or, perhaps you meant, do it a little more dynamically
sigList = ['sig1', 'sig2', 'sig3']
columns = ['Length', 'Distance']
columns.extend(f'{s}{lbl}' for lbl in ('Max', 'Min') for s in sigList )
df = pd.DataFrame([[a,b,*c,*d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6
You may check
newdf=pd.DataFrame(s.tolist())
newdf=pd.concat([newdf[[0,1]],pd.DataFrame(newdf[2].tolist()),pd.DataFrame(newdf[3].tolist())],1)
newdf.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
newdf
Out[163]:
Length Distance sig1Max ... sig1Min sig2Min sig3Min
0 1 0 0.2 ... 0.2 0.2 0.2
1 2 1000 0.6 ... 0.1 0.3 0.1
2 1 0 0.4 ... 0.4 0.4 0.4
3 1 0 0.5 ... 0.5 0.5 0.5
4 3 14000 0.8 ... 0.6 0.6 0.6
[5 rows x 8 columns]
You can flatten each element and then convert each to a Series itself. Converting each element to a Series turns the main Series (s in the example below) into a DataFrame. Then just set the column names as you wish.
For example:
import pandas as pd
# load in your data
s = pd.Series([
(1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2]),
(2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1]),
(1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4]),
(1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
(3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6]),
])
def flatten(x):
# note this is not very robust, but works for this case
return [x[0], x[1], *x[2], *x[3]]
df = s.apply(flatten).apply(pd.Series)
df.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
Then you have df as:
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1.0 0.0 0.2 0.2 0.2 0.2 0.2 0.2
1 2.0 1000.0 0.6 0.7 0.5 0.1 0.3 0.1
2 1.0 0.0 0.4 0.4 0.4 0.4 0.4 0.4
3 1.0 0.0 0.5 0.5 0.5 0.5 0.5 0.5
4 3.0 14000.0 0.8 0.8 0.8 0.6 0.6 0.6
I created a 3D Pandas dataframe like this:
A= ['ECFP', 'ECFP', 'ECFP', 'FCFP', 'FCFP', 'FCFP', 'RDK5', 'RDK5', 'RDK5']
B = ['R', 'tau', 'RMSEc', 'R', 'tau', 'RMSEc', 'R', 'tau', 'RMSEc']
C = array([[ 0.1 , 0.3 , 0.5 , nan, 0.6 , 0.4 ],
[ 0.4 , 0.3 , 0.3 , nan, 0.4 , 0.3 ],
[ 1.2 , 1.3 , 1.1 , nan, 1.5 , 1. ],
[ 0.4 , 0.3 , 0.4 , 0.8 , 0.1 , 0.2 ],
[ 0.2 , 0.3 , 0.3 , 0.3 , 0.5 , 0.6 ],
[ 1. , 1.2 , 1. , 0.9 , 1.2 , 1. ],
[ 0.4 , 0.7 , 0.5 , 0.4 , 0.6 , 0.6 ],
[ 0.6 , 0.5 , 0.3 , 0.3 , 0.3 , 0.5 ],
[ 1.2 , 1.5 , 1.3 , 0.97, 1.5 , 1. ]])
df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))
df = df.dropna(axis=0, how='any')
The final Dataframe looks like this:
ECFP FCFP RDK5
R tau RMSEc R tau RMSEc R tau RMSEc
0 0.1 0.4 1.2 0.4 0.2 1.0 0.4 0.6 1.2
1 0.3 0.3 1.3 0.3 0.3 1.2 0.7 0.5 1.5
2 0.5 0.3 1.1 0.4 0.3 1.0 0.5 0.3 1.3
4 0.6 0.4 1.5 0.1 0.5 1.2 0.6 0.3 1.5
5 0.4 0.3 1.0 0.2 0.6 1.0 0.6 0.5 1.0
How can I get the correlation matrix only between 'R' values for all types of data ('ECFP', 'FCFP', 'RDK5')?
use IndexSlice:
In [53]: df.loc[:, pd.IndexSlice[:, 'R']]
Out[53]:
ECFP FCFP RDK5
R R R
0 0.1 0.4 0.4
1 0.3 0.3 0.7
2 0.5 0.4 0.5
4 0.6 0.1 0.6
5 0.4 0.2 0.6
By using slice
df.loc[:,(slice(None),'R')]
Out[375]:
ECFP FCFP RDK5
R R R
0 0.1 0.4 0.4
1 0.3 0.3 0.7
2 0.5 0.4 0.5
4 0.6 0.1 0.6
5 0.4 0.2 0.6
Both answers work, but first I must lexstort, otherwise I get this error:
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'
The solution is:
df.sortlevel(axis=1, inplace=True)
print "Correlation matrix of Pearson's R values among all feature vector types:"
df.loc[:, pd.IndexSlice[:, 'R']].corr()