I'm starting to learn pandas and I am currently unable to construct the dataframe I would like to and would like some advice.
Let's say I have two DataFrames :
T1df: max min
0 0.2 0.1
1 0.2 0.1
2 0.2 0.1
3 0.2 0.1
T2df: max min
0 0.4 0.3
1 0.4 0.3
2 0.4 0.3
3 0.4 0.3
How could I merge them to end up with this shape of DataFrame?
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3
Use concat by axis=1 with keys parameter and then traspose by DataFrame.T for MultiIndex in index:
df = pd.concat([T1df, T2df], axis=1, keys=('T1','T2')).T
print (df)
0 1 2 3
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3
Related
I have two data frames (df) one equally divided into weeks by count of the week for the month(February 4 weeks March 5 weeks). The other one has actual data.
equally divided df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
actual df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3
I want to replace the data in equally divided df from the actual df. Condition for the week 1 to 4 (Feb) is if the actual df week 1 to 4 sum = 1. For example.
actual df
Sunil 0.2 + 0.4 + 0.3 + 0.1 = 1
Then replace the number in equally divided df. So Sunil 0.25 0.25 0.25 0.25 will replace with the above values.
Week 5-9 is the same. If some of the values in actual df = 1 then replace.
So for Sunil, it's 0 + 0 + 0 + 0 + 0 not equal to 1 So not replace the values for weeks 5 to 9.
So the data frame looks like the below.
equally divided with edit df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0.2 0.2 0.2 0.2 0.2
Amaya 0.5 0 0.3 0.2 0.2 0.2 0.2 0.2 0.2
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
I'm trying to bring all the values from actual df to equally divided frame and going to edit it but couldn't get a way.
equally ['Feb_1_actual'] = equally ['Name'].map(actual.set_index('Name')['Feb_1'])
## then get a some and if it's equal to 1 replace the value other wise keep the same value
Is there another way to do it?
Any help would be appreciate. Thanks in advance!
The idea is not to deal with variable column name as it can't be used with pandas methods. So we need to unpivot them first through pd.melt, perform complex grouping, and then pivot back
Try this
import pandas as pd
from functools import partial
# Read data
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
equally = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
df = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
# Processing
equally_unpivot = pd.melt(equally, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
df_unpivot = pd.melt(df, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
valid_entries = df_unpivot[["Name", "Month", "Val"]].groupby(["Name", "Month"], as_index=False).sum().query("Val == 1").drop(columns=["Val"])
merged_df = (
equally[["Name"]].merge(
pd.concat([
equally_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="left",
indicator=True
).query("_merge == 'left_only'").drop(columns=["_merge"]),
df_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="inner",
)
])
.drop(columns=["Month", "Week"])
.pivot(index="Name", columns="Month_Week", values="Val")
.rename_axis(None, axis=1)
.reset_index(drop=False)
)
)
print(merged_df)
A more concise approach:
splitting dataframes in 2 segments Feb and Mar
arrange them by matched Name column
update each segment with pd.DataFrame.update and finally concat them
eq_Feb = eq_df.set_index('Name').filter(like='Feb')
eq_Mar = eq_df.set_index('Name').filter(like='Mar')
actual_df_ = actual_df[actual_df.Name.isin(eq_df.Name)]
actual_Feb = actual_df_.set_index('Name').filter(like='Feb')
actual_Mar = actual_df_.set_index('Name').filter(like='Mar')
eq_Feb.update(actual_Feb[actual_Feb.sum(1).astype(int).eq(1.0)])
eq_Mar.update(actual_Mar[actual_Mar.sum(1).astype(int).eq(1.0)])
res_df = pd.concat([eq_Feb, eq_Mar], axis=1)
Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Name
Sunil 0.20 0.40 0.30 0.10 0.2 0.2 0.2 0.2 0.2
Amaya 0.50 0.00 0.30 0.20 0.2 0.2 0.2 0.2 0.2
Will 0.80 0.20 0.00 0.00 0.1 0.2 0.3 0.1 0.3
Francis 0.40 0.20 0.30 0.10 0.2 0.4 0.0 0.4 0.0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.50 0.20 0.30 0.00 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
I have 2 dataframes df1 and df2. df1 is filled with values and df2 is empty.
df1 and df2, as it can be seen, both dataframes's index and columns will always be same, just difference is df1 doesn't contain duplicate values of columns and indexes but df2 does contain.
How to fill values in df2 from df1, so that it also considers the combination of index and columns?
df1 = pd.DataFrame({'Ind':pd.Series([1,2,3,4]),1:pd.Series([1,0.2,0.2,0.8])
,2:pd.Series([0.2,1,0.2,0.8]),3:pd.Series([0.2,0.2,1,0.8])
,4:pd.Series([0.8,0.8,0.8,1])})
df1 = df1.set_index(['Ind'])
df2 = pd.DataFrame(columns = [1,1,2,2,3,4], index=[1,1,2,2,3,4])
IIUC, you want to update:
df2.update(df1)
print(df2)
1 1 2 2 3 4
1 1.0 1.0 0.2 0.2 0.2 0.8
1 1.0 1.0 0.2 0.2 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
3 0.2 0.2 0.2 0.2 1.0 0.8
4 0.8 0.8 0.8 0.8 0.8 1.0
##1
M_members = [1000 , 1450, 1900]
M = pd.DataFrame(M_members)
##2
a_h_members = [0.4 , 0.6 , 0.8 ]
a_h = pd.DataFrame(a_h_members)
##3
d_h_members = [0.1 , 0.2 ]
d_h = pd.DataFrame(d_h_members)
As the output I want is in dataframe form:
1000 0.4 0.1
1000 0.4 0.2
1000 0.6 0.1
1000 0.6 0.2
1000 0.8 0.1
1000 0.8 0.2
1450 0.4 0.1
1450 0.4 0.2
1450 0.6 0.1
1450 0.6 0.2
1450 0.8 0.1
1450 0.8 0.2
1900 0.4 0.1
1900 0.4 0.2
1900 0.6 0.1
1900 0.6 0.2
1900 0.8 0.1
1900 0.8 0.2
I want to do this loop for more dataframes actually.
Use itertools.product
>>> import itertools
>>> pd.DataFrame(itertools.product(*[M_members, a_h_members, d_h_members]))
0 1 2
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
Depending on your data size, expand_grid from pyjanitor may help with performance:
# pip install pyjanitor
import janitor as jn
import pandas as pd
others = {'a':M, 'b':a_h, 'c':d_h}
jn.expand_grid(others = others)
a b c
0 0 0
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
You can drop a column level, or flatten it:
jn.expand_grid(others = others).droplevel(axis = 1, level = 1)
a b c
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
If you're starting from the DataFrames, you can use a repeated cross merge:
dfs = [M, a_h, d_h]
from functools import reduce
out = (reduce(lambda a,b: a.merge(b, how='cross'), dfs)
.set_axis(range(len(dfs)), axis=1)
)
Output:
0 1 2
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
I have 2 dataframe as shown below
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
WA WB WC
0 0.4 0.2 0.4
1 0.1 0.3 0.6
2 0.3 0.2 0.5
3 0.3 0.3 0.4
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
stv_Astv_Bstv_c
0 0.5 0.2 0.4
Is there anyway to append dff2 which only consist of one row to every single row in ddf? Resulting dataframe should thus have 6 columns and rows
You can use:
dff[dff2.columns] = dff2.squeeze()
print(dff)
# Output
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
Pandas does the broadcasting for you when you assign a scalar as a column:
import pandas as pd
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
for col in dff2.columns:
dff[col] = dff2[col][0] # Pass a scalar
print(dff)
Output:
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
You can first repeat the row in dff2 len(dff) times with different methods, then concat the repeated dataframe to dff
df = pd.concat([dff, pd.concat([dff2]*len(dff)).reset_index(drop=True)], axis=1)
print(df)
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
I have a dataframe that looks like this
a b c d
0 0.6 -0.4 0.2 0.7
1 0.8 0.2 -0.2 0.3
2 -0.1 0.5 0.5 -0.4
3 0.8 -0.6 -0.7 -0.2
And I wish to create column 'e' such that it displays the column number of the first instance in each row where the value is less than 0
So the goal result will look like this
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
I can do this in Excel using a MATCH(True) type function but am struggling to make progress in Pandas.
Thanks for any help
You can use np.argmax:
# where the values are less than 0
a = df.values < 0
# if the row is all non-negative, return 0
df['e'] = np.where(a.any(1), np.argmax(a,axis=1)+1, 0)
Output:
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Something like idxmin with np.sin
import numpy as np
df['e']=df.columns.get_indexer(np.sign(df).idxmin(1))+1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Get the first max, combined with get indexer for to get the column numbers:
df["e"] = df.columns.get_indexer_for(df.lt(0, axis=1).idxmax(axis=1).array) + 1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2