I have two data frames (df) one equally divided into weeks by count of the week for the month(February 4 weeks March 5 weeks). The other one has actual data.
equally divided df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
actual df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3
I want to replace the data in equally divided df from the actual df. Condition for the week 1 to 4 (Feb) is if the actual df week 1 to 4 sum = 1. For example.
actual df
Sunil 0.2 + 0.4 + 0.3 + 0.1 = 1
Then replace the number in equally divided df. So Sunil 0.25 0.25 0.25 0.25 will replace with the above values.
Week 5-9 is the same. If some of the values in actual df = 1 then replace.
So for Sunil, it's 0 + 0 + 0 + 0 + 0 not equal to 1 So not replace the values for weeks 5 to 9.
So the data frame looks like the below.
equally divided with edit df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0.2 0.2 0.2 0.2 0.2
Amaya 0.5 0 0.3 0.2 0.2 0.2 0.2 0.2 0.2
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
I'm trying to bring all the values from actual df to equally divided frame and going to edit it but couldn't get a way.
equally ['Feb_1_actual'] = equally ['Name'].map(actual.set_index('Name')['Feb_1'])
## then get a some and if it's equal to 1 replace the value other wise keep the same value
Is there another way to do it?
Any help would be appreciate. Thanks in advance!
The idea is not to deal with variable column name as it can't be used with pandas methods. So we need to unpivot them first through pd.melt, perform complex grouping, and then pivot back
Try this
import pandas as pd
from functools import partial
# Read data
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
equally = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
df = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
# Processing
equally_unpivot = pd.melt(equally, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
df_unpivot = pd.melt(df, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
valid_entries = df_unpivot[["Name", "Month", "Val"]].groupby(["Name", "Month"], as_index=False).sum().query("Val == 1").drop(columns=["Val"])
merged_df = (
equally[["Name"]].merge(
pd.concat([
equally_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="left",
indicator=True
).query("_merge == 'left_only'").drop(columns=["_merge"]),
df_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="inner",
)
])
.drop(columns=["Month", "Week"])
.pivot(index="Name", columns="Month_Week", values="Val")
.rename_axis(None, axis=1)
.reset_index(drop=False)
)
)
print(merged_df)
A more concise approach:
splitting dataframes in 2 segments Feb and Mar
arrange them by matched Name column
update each segment with pd.DataFrame.update and finally concat them
eq_Feb = eq_df.set_index('Name').filter(like='Feb')
eq_Mar = eq_df.set_index('Name').filter(like='Mar')
actual_df_ = actual_df[actual_df.Name.isin(eq_df.Name)]
actual_Feb = actual_df_.set_index('Name').filter(like='Feb')
actual_Mar = actual_df_.set_index('Name').filter(like='Mar')
eq_Feb.update(actual_Feb[actual_Feb.sum(1).astype(int).eq(1.0)])
eq_Mar.update(actual_Mar[actual_Mar.sum(1).astype(int).eq(1.0)])
res_df = pd.concat([eq_Feb, eq_Mar], axis=1)
Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Name
Sunil 0.20 0.40 0.30 0.10 0.2 0.2 0.2 0.2 0.2
Amaya 0.50 0.00 0.30 0.20 0.2 0.2 0.2 0.2 0.2
Will 0.80 0.20 0.00 0.00 0.1 0.2 0.3 0.1 0.3
Francis 0.40 0.20 0.30 0.10 0.2 0.4 0.0 0.4 0.0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.50 0.20 0.30 0.00 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Related
##1
M_members = [1000 , 1450, 1900]
M = pd.DataFrame(M_members)
##2
a_h_members = [0.4 , 0.6 , 0.8 ]
a_h = pd.DataFrame(a_h_members)
##3
d_h_members = [0.1 , 0.2 ]
d_h = pd.DataFrame(d_h_members)
As the output I want is in dataframe form:
1000 0.4 0.1
1000 0.4 0.2
1000 0.6 0.1
1000 0.6 0.2
1000 0.8 0.1
1000 0.8 0.2
1450 0.4 0.1
1450 0.4 0.2
1450 0.6 0.1
1450 0.6 0.2
1450 0.8 0.1
1450 0.8 0.2
1900 0.4 0.1
1900 0.4 0.2
1900 0.6 0.1
1900 0.6 0.2
1900 0.8 0.1
1900 0.8 0.2
I want to do this loop for more dataframes actually.
Use itertools.product
>>> import itertools
>>> pd.DataFrame(itertools.product(*[M_members, a_h_members, d_h_members]))
0 1 2
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
Depending on your data size, expand_grid from pyjanitor may help with performance:
# pip install pyjanitor
import janitor as jn
import pandas as pd
others = {'a':M, 'b':a_h, 'c':d_h}
jn.expand_grid(others = others)
a b c
0 0 0
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
You can drop a column level, or flatten it:
jn.expand_grid(others = others).droplevel(axis = 1, level = 1)
a b c
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
If you're starting from the DataFrames, you can use a repeated cross merge:
dfs = [M, a_h, d_h]
from functools import reduce
out = (reduce(lambda a,b: a.merge(b, how='cross'), dfs)
.set_axis(range(len(dfs)), axis=1)
)
Output:
0 1 2
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
I have 2 dataframe as shown below
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
WA WB WC
0 0.4 0.2 0.4
1 0.1 0.3 0.6
2 0.3 0.2 0.5
3 0.3 0.3 0.4
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
stv_Astv_Bstv_c
0 0.5 0.2 0.4
Is there anyway to append dff2 which only consist of one row to every single row in ddf? Resulting dataframe should thus have 6 columns and rows
You can use:
dff[dff2.columns] = dff2.squeeze()
print(dff)
# Output
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
Pandas does the broadcasting for you when you assign a scalar as a column:
import pandas as pd
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
for col in dff2.columns:
dff[col] = dff2[col][0] # Pass a scalar
print(dff)
Output:
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
You can first repeat the row in dff2 len(dff) times with different methods, then concat the repeated dataframe to dff
df = pd.concat([dff, pd.concat([dff2]*len(dff)).reset_index(drop=True)], axis=1)
print(df)
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
I have a dataframe that looks like this
a b c d
0 0.6 -0.4 0.2 0.7
1 0.8 0.2 -0.2 0.3
2 -0.1 0.5 0.5 -0.4
3 0.8 -0.6 -0.7 -0.2
And I wish to create column 'e' such that it displays the column number of the first instance in each row where the value is less than 0
So the goal result will look like this
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
I can do this in Excel using a MATCH(True) type function but am struggling to make progress in Pandas.
Thanks for any help
You can use np.argmax:
# where the values are less than 0
a = df.values < 0
# if the row is all non-negative, return 0
df['e'] = np.where(a.any(1), np.argmax(a,axis=1)+1, 0)
Output:
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Something like idxmin with np.sin
import numpy as np
df['e']=df.columns.get_indexer(np.sign(df).idxmin(1))+1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Get the first max, combined with get indexer for to get the column numbers:
df["e"] = df.columns.get_indexer_for(df.lt(0, axis=1).idxmax(axis=1).array) + 1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
I'm starting to learn pandas and I am currently unable to construct the dataframe I would like to and would like some advice.
Let's say I have two DataFrames :
T1df: max min
0 0.2 0.1
1 0.2 0.1
2 0.2 0.1
3 0.2 0.1
T2df: max min
0 0.4 0.3
1 0.4 0.3
2 0.4 0.3
3 0.4 0.3
How could I merge them to end up with this shape of DataFrame?
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3
Use concat by axis=1 with keys parameter and then traspose by DataFrame.T for MultiIndex in index:
df = pd.concat([T1df, T2df], axis=1, keys=('T1','T2')).T
print (df)
0 1 2 3
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3
I am running Python 3.6 and Pandas 0.19.2 in PyCharm Community Edition 2016.3.2 and am trying to ensure a set of rows in my dataframe adds up to 1.
Initially my dataframe looks as follows:
hello world label0 label1 label2
abc def 1.0 0.0 0.0
why not 0.33 0.34 0.33
hello you 0.33 0.38 0.15
I proceed as follows:
# get list of label columns (all column headers that contain the string 'label')
label_list = df.filter(like='label').columns
# ensure every row adds to 1
if (df[label_list].sum(axis=1) != 1).any():
print('ERROR')
Unfortunately this code does not work for me. What seems to be happening is that instead of summing my rows, I just get the value of the first column in my filtered data. In other words: df[label_list].sum(axis=1)returns:
0 1.0
1 0.33
2 0.33
This should be trivial, but I just can't figure out what I'm doing wrong. Thanks up front for the help!
UPDATE:
This is an excerpt from my original data after I have filtered for label columns:
label0 label1 label2 label3 label4 label5 label6 label7 label8
1 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
2 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
3 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
4 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
5 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
6 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
7 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
8 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
9 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
My code from above still does not work, and I still have absolutely no idea why. When I run my code in python console everything works perfectly fine, but when I run my code in Pycharm 2016.3.2, label_data.sum(axis=1) just returns the values of the first column.
With your sample data for me it works. Just try to reproduce your sample adding a new column check to control the sum:
In [3]: df
Out[3]:
hello world label0 label1 label2
0 abc def 1.00 0.00 0.00
1 why not 0.33 0.34 0.33
2 hello you 0.33 0.38 0.15
In [4]: df['check'] = df.sum(axis=1)
In [5]: df
Out[5]:
hello world label0 label1 label2 check
0 abc def 1.00 0.00 0.00 1.00
1 why not 0.33 0.34 0.33 1.00
2 hello you 0.33 0.38 0.15 0.86
In [6]: label_list = df.filter(like='label').columns
In [7]: label_list
Out[7]: Index([u'label0', u'label1', u'label2'], dtype='object')
In [8]: df[label_list].sum(axis=1)
Out[8]:
0 1.00
1 1.00
2 0.86
dtype: float64
In [9]: if (df[label_list].sum(axis=1) != 1).any():
...: print('ERROR')
...:
ERROR
Turns out my data type was not consistent. I used astype(float) and things worked out.