Summing a Pandas Dataframe - python

I am running Python 3.6 and Pandas 0.19.2 in PyCharm Community Edition 2016.3.2 and am trying to ensure a set of rows in my dataframe adds up to 1.
Initially my dataframe looks as follows:
hello world label0 label1 label2
abc def 1.0 0.0 0.0
why not 0.33 0.34 0.33
hello you 0.33 0.38 0.15
I proceed as follows:
# get list of label columns (all column headers that contain the string 'label')
label_list = df.filter(like='label').columns
# ensure every row adds to 1
if (df[label_list].sum(axis=1) != 1).any():
print('ERROR')
Unfortunately this code does not work for me. What seems to be happening is that instead of summing my rows, I just get the value of the first column in my filtered data. In other words: df[label_list].sum(axis=1)returns:
0 1.0
1 0.33
2 0.33
This should be trivial, but I just can't figure out what I'm doing wrong. Thanks up front for the help!
UPDATE:
This is an excerpt from my original data after I have filtered for label columns:
label0 label1 label2 label3 label4 label5 label6 label7 label8
1 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
2 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
3 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
4 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
5 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
6 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
7 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
8 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
9 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
My code from above still does not work, and I still have absolutely no idea why. When I run my code in python console everything works perfectly fine, but when I run my code in Pycharm 2016.3.2, label_data.sum(axis=1) just returns the values of the first column.

With your sample data for me it works. Just try to reproduce your sample adding a new column check to control the sum:
In [3]: df
Out[3]:
hello world label0 label1 label2
0 abc def 1.00 0.00 0.00
1 why not 0.33 0.34 0.33
2 hello you 0.33 0.38 0.15
In [4]: df['check'] = df.sum(axis=1)
In [5]: df
Out[5]:
hello world label0 label1 label2 check
0 abc def 1.00 0.00 0.00 1.00
1 why not 0.33 0.34 0.33 1.00
2 hello you 0.33 0.38 0.15 0.86
In [6]: label_list = df.filter(like='label').columns
In [7]: label_list
Out[7]: Index([u'label0', u'label1', u'label2'], dtype='object')
In [8]: df[label_list].sum(axis=1)
Out[8]:
0 1.00
1 1.00
2 0.86
dtype: float64
In [9]: if (df[label_list].sum(axis=1) != 1).any():
...: print('ERROR')
...:
ERROR

Turns out my data type was not consistent. I used astype(float) and things worked out.

Related

Two data frame Name mapping and replace a values base on condition

I have two data frames (df) one equally divided into weeks by count of the week for the month(February 4 weeks March 5 weeks). The other one has actual data.
equally divided df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
actual df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3
I want to replace the data in equally divided df from the actual df. Condition for the week 1 to 4 (Feb) is if the actual df week 1 to 4 sum = 1. For example.
actual df
Sunil 0.2 + 0.4 + 0.3 + 0.1 = 1
Then replace the number in equally divided df. So Sunil 0.25 0.25 0.25 0.25 will replace with the above values.
Week 5-9 is the same. If some of the values in actual df = 1 then replace.
So for Sunil, it's 0 + 0 + 0 + 0 + 0 not equal to 1 So not replace the values for weeks 5 to 9.
So the data frame looks like the below.
equally divided with edit df
Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0.2 0.2 0.2 0.2 0.2
Amaya 0.5 0 0.3 0.2 0.2 0.2 0.2 0.2 0.2
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
I'm trying to bring all the values from actual df to equally divided frame and going to edit it but couldn't get a way.
equally ['Feb_1_actual'] = equally ['Name'].map(actual.set_index('Name')['Feb_1'])
## then get a some and if it's equal to 1 replace the value other wise keep the same value
Is there another way to do it?
Any help would be appreciate. Thanks in advance!
The idea is not to deal with variable column name as it can't be used with pandas methods. So we need to unpivot them first through pd.melt, perform complex grouping, and then pivot back
Try this
import pandas as pd
from functools import partial
# Read data
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Amaya 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Will 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Francis 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
equally = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
text = """Name Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Sunil 0.2 0.4 0.3 0.1 0 0 0 0 0
Hima 0.5 0.2 0.3 0 0.2 0.3 0.2 0.2 0.1
Kamal 0.3 0 0.5 0.2 0.1 0.5 0.1 0.1 0.2
Amaya 0.5 0 0.3 0.2 0 0 0 0 0
Jacob 0.2 0.4 0 0.4 0.4 0 0.2 0.2 0.2
Preety 0.7 0.1 0.1 0.1 0.2 0.1 0.4 0.3 0
Will 0.8 0.2 0 0 0.1 0.2 0.3 0.1 0.3
Francis 0.4 0.2 0.3 0.1 0.2 0.4 0 0.4 0
Tara 0 0.5 0.2 0.3 0.2 0 0.3 0.2 0.3"""
column_names = [x.split() for x in text.split("\n")][0]
values = [x.split() for x in text.split("\n")][1:]
df = pd.DataFrame(
values, columns=column_names
).assign(**{k: partial(lambda key, df: df[key].astype(np.float64), k) for k in column_names[1:]})
# Processing
equally_unpivot = pd.melt(equally, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
df_unpivot = pd.melt(df, id_vars='Name', value_vars=column_names[1:], var_name='Month_Week', value_name='Val').assign(**{
"Month": lambda df: df["Month_Week"].str.split("_").str[0],
"Week": lambda df: df["Month_Week"].str.split("_").str[1],
})
valid_entries = df_unpivot[["Name", "Month", "Val"]].groupby(["Name", "Month"], as_index=False).sum().query("Val == 1").drop(columns=["Val"])
merged_df = (
equally[["Name"]].merge(
pd.concat([
equally_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="left",
indicator=True
).query("_merge == 'left_only'").drop(columns=["_merge"]),
df_unpivot.merge(
valid_entries,
on=["Name", "Month"],
how="inner",
)
])
.drop(columns=["Month", "Week"])
.pivot(index="Name", columns="Month_Week", values="Val")
.rename_axis(None, axis=1)
.reset_index(drop=False)
)
)
print(merged_df)
A more concise approach:
splitting dataframes in 2 segments Feb and Mar
arrange them by matched Name column
update each segment with pd.DataFrame.update and finally concat them
eq_Feb = eq_df.set_index('Name').filter(like='Feb')
eq_Mar = eq_df.set_index('Name').filter(like='Mar')
actual_df_ = actual_df[actual_df.Name.isin(eq_df.Name)]
actual_Feb = actual_df_.set_index('Name').filter(like='Feb')
actual_Mar = actual_df_.set_index('Name').filter(like='Mar')
eq_Feb.update(actual_Feb[actual_Feb.sum(1).astype(int).eq(1.0)])
eq_Mar.update(actual_Mar[actual_Mar.sum(1).astype(int).eq(1.0)])
res_df = pd.concat([eq_Feb, eq_Mar], axis=1)
Feb_1 Feb_2 Feb_3 Feb_4 Mar_5 Mar_6 Mar_7 Mar_8 Mar_9
Name
Sunil 0.20 0.40 0.30 0.10 0.2 0.2 0.2 0.2 0.2
Amaya 0.50 0.00 0.30 0.20 0.2 0.2 0.2 0.2 0.2
Will 0.80 0.20 0.00 0.00 0.1 0.2 0.3 0.1 0.3
Francis 0.40 0.20 0.30 0.10 0.2 0.4 0.0 0.4 0.0
Kadeep 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Hima 0.50 0.20 0.30 0.00 0.2 0.2 0.2 0.2 0.2
Lazy 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2
Joseph 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2

Find values that exceed the minimum or maximum

I am attempting. My dataframe looks similar to this:
Name DateTime Na Na Err Mg Mg Err Al Al Err Si Si Err
STD1 2/11/2020 0.3 0.11 1.6 0.08 0.6 0.12 21.5 0.14
STD2 2/11/2020 0.2 0.10 1.6 0.08 0.2 0.12 21.6 0.14
STD3 2/11/2020 0.2 0.10 1.6 0.08 0.5 0.12 21.7 0.14
STD4 2/11/2020 0.1 0.10 1.3 0.08 0.5 0.12 21.4 0.14
Here is what I have:
elements=['Na','Mg', 'Al', 'Si',...]
quant=df[elements].quantile([lower, upper]) #obtain upper/lower limits
outsideBounds=(quant.loc[lower_bound, elements] < df[elements].to_numpy()) \
& (df[elements].to_numpy()<quant.loc[lower_bound, elements])
However, this gives me a "ValueError: Lengths must match to compare". Any help would be appreciated
Here's a solution (I chose 0.3 and 0.7 for lower and upper bounds, respectively, but that can be changed of course):
lower = 0.3
upper = 0.7
elements=['Na','Mg', 'Al', 'Si']
df[elements]
bounds = df[elements].quantile([lower, upper]) #obtain upper/lower limits
out_of_bounds = df[elements].lt(bounds.loc[lower, :]) | df[elements].gt(bounds.loc[upper, :])
df[elements][out_of_bounds]
The resulting bounds are:
Na Mg Al Si
0.3 0.19 1.57 0.47 21.49
0.7 0.21 1.60 0.51 21.61
The result of df[elements][out_of_bounds] is:
Na Mg Al Si
0 0.3 NaN 0.6 NaN
1 NaN NaN 0.2 NaN
2 NaN NaN NaN 21.7
3 0.1 1.3 NaN 21.4

Can't find how to combine DataFrames this way with pandas

I'm starting to learn pandas and I am currently unable to construct the dataframe I would like to and would like some advice.
Let's say I have two DataFrames :
T1df: max min
0 0.2 0.1
1 0.2 0.1
2 0.2 0.1
3 0.2 0.1
T2df: max min
0 0.4 0.3
1 0.4 0.3
2 0.4 0.3
3 0.4 0.3
How could I merge them to end up with this shape of DataFrame?
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3
Use concat by axis=1 with keys parameter and then traspose by DataFrame.T for MultiIndex in index:
df = pd.concat([T1df, T2df], axis=1, keys=('T1','T2')).T
print (df)
0 1 2 3
T1 max 0.2 0.2 0.2 0.2
min 0.1 0.1 0.1 0.1
T2 max 0.4 0.4 0.4 0.4
min 0.3 0.3 0.3 0.3

If the value of particular ID does not exist in another ID, insert row with value to ID

I would like to update and insert a new row, if D1 value is not existing in other ID's, whilst my df['Value'] is left blank (N/A). Your help is appreciated.
Input
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.02 2 4.5
0.04 2 4.1
0.08 2 3.6
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Expected output:
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.1 1
0.02 2 4.5
0.04 2 4.1
0.06 2
0.08 2 3.6
0.1 2
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Unfortunately the codes I have written have been way off or simply gets multiple error messages, unlike my other questions I do not have examples to show.
Use unstack and stack. Chain additional sort_index and reset_index to achieve desired order
df_final = (df.set_index(['D1', 'ID']).unstack().stack(dropna=False)
.sort_index(level=[1,0]).reset_index())
Out[952]:
D1 ID Value
0 0.02 1 1.2
1 0.04 1 1.6
2 0.06 1 1.9
3 0.08 1 2.8
4 0.10 1 NaN
5 0.02 2 4.5
6 0.04 2 4.1
7 0.06 2 NaN
8 0.08 2 3.6
9 0.10 2 NaN
10 0.02 3 2.7
11 0.04 3 2.9
12 0.06 3 2.4
13 0.08 3 2.1
14 0.10 3 1.9

Add two columns with each other

Assume this is my df I am looking for a way to add one column with another.
For example:
Except "col a" every thing else should be add with each other
colb + colc, cold + cole, colf + colg, colh + coli
a b c d e f g h i
group
A 0.15 0.1 0.1 0.15 0.15 0.1 0.1 0.10 0.05
B 0.13 NaN NaN NaN 0.40 0.2 NaN 0.13 0.06
desired output:
a b d f h
group
A 0.15 0.2 0.30 0.2 0.15
B 0.13 NaN 0.40 0.2 0.19
I know I can manually add both the columns but I'm looking for a easier way or a apply function to achieve the output.
I couldn't figure out a way. Any help?
Use add of shifted DataFrame with no first column selected by iloc or removed by drop, last filter by list of columns names:
cols = ['a','b','d','f','h']
df = df.add(df.iloc[:, 1:].shift(-1,axis=1), fill_value=0)[cols]
Alternative:
df = df.add(df.drop('a', axis=1).shift(-1,axis=1), fill_value=0)[cols]
print (df)
a b d f h
group
A 0.15 0.2 0.3 0.2 0.15
B 0.13 NaN 0.4 0.2 0.19

Categories