Create new dataframe fields from row calculations - python

I'd like to create some new columns based on calculation from each row values
For example,
input
data = {"c1": [10], "c2": [20], "c3":[30], "c4":[40], "c5":[50], "c6":[10]}
df = pd.DataFrame(data=data)
Let us say we take values from series=c2:c6, [20 30 40 50 10]
new_column1= np.mean(series[0:2]). # np.mean([20,30]) = 25
new_column2 = np.mean(series[2:4]) # np.mean(40,50) = 45
new_column3 = new_column1+new_column2 # 70
output:
c1 c2 c3 c4 c5 c6 new_column1 new_column_2 new_column_3
0 10 20 30 40 50 10. 25. 45 70
I am looking for an efficient way (list comprehension or apply function?) instead of iterrows

Looks like you want:
df['new_column1'] = df.loc[:, 'c2':'c3'].mean(axis=1)
df['new_column2'] = df.loc[:, 'c4':'c5'].mean(axis=1)
df['new_column3'] = df[['new_column1', 'new_column2']].sum(axis=1)
print(df)
Output:
c1 c2 c3 c4 c5 c6 new_column1 new_column2 new_column3
0 10 20 30 40 50 10 25.0 45.0 70.0

Related

Is there a method to save not missing values in another data frame?

I have a data frame by 20441 rows and 158 columns.
in each row, there are a lot of "NA" values. so I want to convert it to sth like this:
If a value is not NA, I save it's row name , column name and value in another data frame.
for example my first data frame is :
row and column name
c1
c2
c3
c4
c5
r1
NA
NA
NA
5
6
r2
1
3
NA
NA
NA
row name
c1
c2
r1
c4
5
r1
c5
6
r2
c1
1
r2
c2
3
The answer using df.melt works but is slower than the below code.
I used %%timeit to measure the time of each code.
df.melt takes 2.47 ms ± 93.5 and
below code takes 314 µs ± 10.4
if performance is not important then df.melt is better since it is a one line code.
row = []
cols = []
val = []
for col in data.columns[1:]:
for i,e in enumerate(data[col]):
if pd.isna(e) == False:
row.append(data['row&col name'][i])
cols.append(col)
val.append(e)
new_data = pd.DataFrame(list(zip(row,cols,val)),columns=['row','col','val'])
Use melt:
out = (df.melt('row and column name', var_name='C1', value_name='C2')
.dropna().astype({'C2': int}))
Output:
row and column name
C1
C2
r2
c1
1
r2
c2
3
r1
c4
5
r1
c5
6

Filter grouped Pandas data frame by column aggregate, when groups are from a MultiIndex level

How can I drop data from one level of a multi-level indexed data frame, based on aggregated information I get from a column within a groupby on that level?
For example, with data frame dfmi:
midx = pd.MultiIndex.from_product([['A0','A1','A2'], ['B0','B1','B2']], names=["index_1", "index_2"])
columns = ['foo', 'bar']
dfmi = pd.DataFrame(np.arange(18).reshape((len(midx), len(columns))),
index=midx, columns=columns)
dfmi
foo bar
index_1 index_2
A0 B0 0 1
B1 2 3
B2 4 5
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
Let's say I only want to keep levels of index_1 if the mean for foo exceeds a certain threshold.
Like this:
thresh = 5
for grp, data in dfmi.groupby("index_1"):
print(data.foo.mean() > thresh)
False <-- drop this level
True
True
Desired output:
foo bar
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
In this toy example I can get what I want with dfmi.loc[pd.IndexSlice["A1":"A2", :]]. But I can't figure out how to use IndexSlice or loc variants to do aggregations inside a grouped MultiIndex and then slice the full data frame based on the results.
My best solution so far is to just keep track of the level values that qualify as keepers (with grp), and then use the accumulated keepers collection with IndexSlice:
keepers = list()
for grp, data in dfmi.groupby("index_1"):
if data.foo.mean() > thresh:
keepers.append(grp)
dfmi.loc[pd.IndexSlice[keepers, :]]
I'm looking for a more efficient and/or more elegant way to accomplish that with native Pandas functionality.
You can use loc once you have created your mask, like so:
mask = dfmi.groupby(level=0)['foo'].mean()>thresh
dfmi.loc[mask.index[mask]]
Yields:
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17

Compare groupby output for different dataframes

What could be a way to compare multiple groupby outputs.
I have multiple groupby outputs from different dataframes, like below
>>> tmp1
account place balance type
0 A A1 10 B1
1 A A1 20 B1
2 A A1 30 B1
3 A A1 10 B4
4 A A1 20 B4
5 A A1 10 B5
6 A A1 10 B6
7 B A2 10 B7
8 B A2 20 B1
9 B A2 100 B1
I do
>>>tmp1.groupby(['account','place','type']['balance'].last().sum(level=0).astype(int)
account
A 70
B 110
Name: balance, dtype: int64
Similarly
>>> tmp2
account place balance type
0 A A1 100 B1
1 A A1 200 B1
2 A A1 100 B1
3 A A1 100 B4
4 A A1 200 B4
5 A A1 100 B5
6 A A1 100 B6
7 B A2 100 B7
8 B A2 200 B1
9 B A2 200 B1
>>>tmp2.groupby(['account','place','type']['balance'].last().sum(level=0).astype(int)
account
A 500
B 300
Name: balance, dtype: int64
#similarly tmp3 grouped..and so on
Is there a way to find the df with maximum sum balance. eg. in this case tmp2 has greater sum (70+110 < 500+300).
My try:
One of the ways I tried was taking the sum and maintaining a list, like below
mylist=[]
mylist.append(tmp1.groupby(['account','place','type']['balance'].last().sum(.astype(int))
mylist.append(tmp2.groupby(['account','place','type']['balance'].last().sum(.astype(int))
>>> mylist
[180,800]
Now I can take max from list, but I loose account information (800 is max but I need info on account A having 500, B having 300)
I tried
>>>tmp2.groupby(['account','place','type'])['balance'].last().sum(level=0).to_dict()
{'A': 500, 'B': 300}
So for every df I have a dict, I just need to find maximum of such lists (I think I have come very close to solving it)
I intend to find which dataframe had maximum sum (along with account)
If I understand you correctly, in case you have more than 2 dfs.
tmp1 = pd.DataFrame([{'acount':'A', 'balance':100, 'type':'A1'},
{'acount':'A', 'balance':200, 'type':'A2'},
{'acount':'B', 'balance':200, 'type':'B1'},
{'acount':'B', 'balance':300, 'type':'B2'}])
tmp2 = pd.DataFrame([{'acount':'A', 'balance':100, 'type':'A1'},
{'acount':'A', 'balance':200, 'type':'A2'},
{'acount':'B', 'balance':400, 'type':'B1'},
{'acount':'B', 'balance':300, 'type':'B2'}])
tmplist = [tmp1,tmp2]
tmprlist = [tmp.groupby(['acount','type']).last().sum(level=0).astype(int) for tmp in tmplist]
tmpslist = [tmp.groupby(['acount','type'])['balance'].last().sum() for tmp in tmplist]
tmprlist[np.argmax(tmpslist)]
Result:
acount balance
A 300
B 700

Extract all the following rows in pandas

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

Clean way of slicing + stacking pandas dataframe

I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.
The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like
A B C date1result date2result date2result
a1 b1 c1 12 15 17
a2 b2 c3 5 8 3
But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be
A B C result date
a1 b1 c1 12 date1
a1 b1 c1 15 date2
a1 b1 c1 17 date3
a2 b2 c3 5 date1
a2 b2 c3 8 date2
a2 b2 c3 3 date3
So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:
core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]
df_final=pd.Series([])
for year in [2015,2016]:
for month in range(1, 13):
label = "%i_%02i" % (year,month)
date = []
for i in range(core.shape[0]):
date.append("01/%02i/%i"%(month,year))
df_date=pd.Series(date) #I don't know to create this 1xn df
df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))
That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?
Thank you very much.
Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping
Ithink you need melt:
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
A B C date result
0 a1 b1 c1 date1result 12
1 a2 b2 c3 date1result 5
2 a1 b1 c1 date2result 15
3 a2 b2 c3 date2result 8
4 a1 b1 c1 date3result 17
5 a2 b2 c3 date3result 3
And then convert to_datetime:
print (df)
A B C 2015_01 2016_10 2016_12
0 a1 b1 c1 12 15 17
1 a2 b2 c3 5 8 3
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
A B C date result
0 a1 b1 c1 2015-01-01 12
1 a2 b2 c3 2015-01-01 5
2 a1 b1 c1 2016-10-01 15
3 a2 b2 c3 2016-10-01 8
4 a1 b1 c1 2016-12-01 17
5 a2 b2 c3 2016-12-01 3

Categories