index-1 on itterrow makes new row at the end - python

Based on this table
pernr
plans
mnth
jum_mnth
123
000
1
NaN
123
001
3
NaN
123
001
6
NaN
789
002
10
NaN
789
003
2
NaN
789
003
2
NaN
789
002
2
NaN
I want to set 'jum_mnth' from 'mnth'. 'jum_mnth' have value if:
its last row from same plans
last row from same pernr
so i tried:
for index, row in que.iterrows():
if row['pernr'] != nipp:
que_cop.at[index-1, 'jum_mnth'] = mon
nipp = row['pernr']
plan = row['plans']
mon = row['mnth']
else:
if row['plans'] == plan:
mon = mon + row['mnth']
else:
que_cop.at[index-1, 'jum_mnth'] = mon
print(str(nipp),plan,str(mon))
plan = row['plans']
mon = row['mnth']
if index == que_cop.index[-2]:
que_cop.at[index, 'jum_mnth'] = mon
but it resulting new row ( index -1) at the last like this:
pernr
plans
mnth
jum_mnth
123
000
1
1.0
123
001
3
NaN
123
001
6
9.0
789
002
10
10.0
789
003
2
NaN
789
003
2
4.0
789
002
2
NaN
NaN
NaN
NaN
0.0
and the last row didnt have jum_mnth (it should have jum_mnth)
expected:
pernr
plans
mnth
jum_mnth
123
000
1
1
123
001
3
NaN
123
001
6
9
789
002
10
10
789
003
2
NaN
789
003
2
4
789
002
2
2
so what happened?
any help i would appreciate it.

You can use:
grp = (df[['pernr', 'plans']].ne(df[['pernr', 'plans']].shift())
.any(axis=1).cumsum()
)
g = df.groupby(grp)['mnth']
df['jum_mnth'] = g.transform('sum').where(g.cumcount(ascending=False).eq(0))
Output:
pernr plans mnth jum_mnth
0 123 000 1 1.0
1 123 001 3 NaN
2 123 001 6 9.0
3 789 002 10 10.0
4 789 003 2 NaN
5 789 003 2 4.0
6 789 002 2 2.0

Related

How to group, sort and calculate difference in this pandas dataframe?

I created this dataframe and need to group my data into category with the same number of beds, city, baths and sort(descending) each elements in the group by price.
Secondly I need to find the difference between each price with the one ranked after into the same group.
For example the result should be like that:
1 bed, 1 bath, Madrid, 10
1 bed, 1 bath, Madrid, 8
1 bed, 1 bath, Madrid, 5
1 bed, 1 bath, Madrid, 1
I should get 2, 3, 4...
I tried some code it seems far than what I expect to find...
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df
df['gap'] = df.sort_values('price',ascending=False).groupby(['city','beds','baths'])['price'].diff()
print (df)
Many thanks in advance.
I would use pd.to_numeric with errors = 'coerce'
to get rid of the strings in the price column, I would then calculate the difference without taking into account those rooms whose price is unknown (using DataFrame.dropna). Then I show the result ordering in DataFrame and without ordering:
df['price']=pd.to_numeric(df['price'],errors = 'coerce')
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
or using GroupBy.shift:
df['difference_price'] = df['price'].sub( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])
.price
.shift(-1) )
Display result
print(df,'\n'*3,'Sorted DatFrame: ')
print(df.sort_values(['city','beds','baths','price'],ascending = [True,True,True,False]))
Output
id city beds baths price difference_price
0 1 paris 1 2 10.0 4.0
1 2 madrid 2 2 8.0 1.0
2 3 madrid 2 2 11.0 3.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
5 6 madrid 2 1 7.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
8 9 madrid 1 4 NaN NaN
9 10 paris 2 1 3.0 NaN
10 11 madrid 2 2 7.0 NaN
11 12 paris 2 3 12.0 NaN
12 13 madrid 2 3 7.0 NaN
13 14 madrid 1 1 3.0 NaN
14 15 paris 1 1 3.0 NaN
15 16 madrid 1 1 4.0 1.0
16 17 paris 1 1 5.0 2.0
Sorted DatFrame:
id city beds baths price difference_price
15 16 madrid 1 1 4.0 1.0
13 14 madrid 1 1 3.0 NaN
8 9 madrid 1 4 NaN NaN
5 6 madrid 2 1 7.0 NaN
2 3 madrid 2 2 11.0 3.0
1 2 madrid 2 2 8.0 1.0
10 11 madrid 2 2 7.0 NaN
12 13 madrid 2 3 7.0 NaN
16 17 paris 1 1 5.0 2.0
14 15 paris 1 1 3.0 NaN
0 1 paris 1 2 10.0 4.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
9 10 paris 2 1 3.0 NaN
11 12 paris 2 3 12.0 NaN
If I understand correctly with:
group my data into category with the same number of beds, city, baths and sort(descending)
All data that does not fulfill the value should be deleted? (Where beds and baths are different). This is my code to provide an answer given your problem:
import numpy as np
import pandas as pd
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df_new = df[df['beds'] == df['baths']]
df_new = df_new.sort_values(['city','price'],ascending=[False,False]).reset_index(drop=True)
df_new['diff_price'] = df_new.groupby(['city','beds','baths'])['price'].diff(-1)
print(df_new)
Output:
id city beds baths price diff_price
0 17 paris 1 1 5 NaN
1 15 paris 1 1 3 -2
2 3 madrid 2 2 11 NaN
3 2 madrid 2 2 8 -3
4 11 madrid 2 2 7 -1
5 16 madrid 1 1 4 NaN
6 14 madrid 1 1 3 -1

Resolve complementary missing values between rows

I have a df that looks like this
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 NaN yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 NaN yes 3 no 2 no 2
I'm looking for a way to add the ID_2 column value to all rows where ID matches (i.e., for Participant 1, fill in the NaN values with the values from the other row where ID=Participant 1). I've looked into using combine but that doesn't seem to work for this particular case.
Expected output:
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 yes 3 no 2 no 2
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 yes 3 no 2 no 2
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
or
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
I think you could try
df.ID_2 = df.groupby('ID').ID_2.ffill()
# 29 PS 6 42
# 35 PS 6 42
# 49 PS 1 89
# 85 PS 1 89
Not tested, but something like this should work - can't copy your df into my browser.
print(df)
Day ID ID_2 AS D E AS1 D1 E1
0 72 Participant_1 PS_6_42 NaN NaN NaN NaN NaN NaN
1 78 Participant_1 NaN yes 3.0 no 2.0 no 2.0
2 22 Participant_2 PS_1_89 NaN NaN NaN NaN NaN NaN
3 18 Participant_2 NaN yes 3.0 no 2.0 no 2.0
df2 = df.set_index('ID').groupby('ID').transform('ffill').transform('bfill').reset_index()
print(df2)
ID Day ID_2 AS D E AS1 D1 E1
0 Participant_1 72 PS_6_42 yes 3 no 2 no 2
1 Participant_1 78 PS_6_42 yes 3 no 2 no 2
2 Participant_2 22 PS_1_89 yes 3 no 2 no 2
3 Participant_2 18 PS_1_89 yes 3 no 2 no 2

Calculations between different rows

I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?
Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0
You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64

Transposing and Aggregating DataFrame

I have a dataframe like this
name tag time val
0 ABC A 1 10
0 ABC A 1 12
1 ABC B 1 12
1 ABC B 1 14
2 ABC A 2 11
3 ABC C 2 12
4 DEF B 3 10
5 DEF C 3 9
6 GHI A 4 14
7 GHI B 4 12
8 GHI C 5 10
Each row is a timestamp and shows the value between the name and tag in that row.
What I want is a dataframe where each row shows the mean value from each tag at each timestamp, like this:
name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
I can achieve this successfully by grouping by name and time and returning a transposed series each time:
def transpose_df(observation_df):
ser = pd.Series()
for tag in tags:
ser[tag] = observation_df[observation_df['tag'] == tag]['val'].mean()
return ser
tdf = df.groupby(['name', 'time']).apply(transpose_df).reset_index()
But this is slow. I feel like there must be a smarter way using a builtin transpose/reshape tool, but I can't figure it out. Can anyone see suggest a better alternative?
In [175]: df.pivot_table(index=['name','time'], columns='tag', values='val').reset_index()
Out[175]:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
Option 1
Use pivot_table:
df.pivot_table(values='val',index=['name','time'],columns='tag',aggfunc='mean').reset_index()
Output:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
Option 2:
Use groupby and unstack
df.groupby(['name','time','tag']).agg('mean')['val'].unstack().reset_index()
Output:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
Option 3
Use set_index and mean and unstack:
df.set_index(['name','time','tag']).mean(level=[0,1,2])['val'].unstack().reset_index()
Output:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
You can also groupby and then unstack (equivalent to a pivot table).
>>> df.groupby(['name', 'time', 'tag'])['val'].mean().unstack('tag').reset_index()
tag name time A B C
0 ABC 1 11 13 NaN
1 ABC 2 11 NaN 12
2 DEF 3 NaN 10 9
3 GHI 4 14 12 NaN
4 GHI 5 NaN NaN 10
By the way, transform is for when you want to maintain the shape of your original dataframe, e.g.
>>> df.assign(tag_mean=df.groupby(['name', 'time', 'tag'])['val'].transform(np.mean))
name tag time val tag_mean
0 ABC A 1 10 11
0 ABC A 1 12 11
1 ABC B 1 12 13
1 ABC B 1 14 13
2 ABC A 2 11 11
3 ABC C 2 12 12
4 DEF B 3 10 10
5 DEF C 3 9 9
6 GHI A 4 14 14
7 GHI B 4 12 12
8 GHI C 5 10 10

Pandas: group some data

I have dataframe
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456
And I want to get
id date count
0 123 10-12-2015 0
1 123 11-12-2015 0
2 123 12-12-2015 1
3 123 13-12-2015 1
4 123 14-12-2015 0
5 123 15-12-2015 1
6 123 16-12-2015 1
7 123 17-12-2015 0
8 123 18-12-2015 1
9 456 10-12-2015 1
10 456 11-12-2015 0
11 456 12-12-2015 0
12 456 13-12-2015 1
13 456 14-12-2015 0
14 456 15-12-2015 1
I try before
df = df.groupby('id').resample('D').size().reset_index(name='val')
But it search date between existing to every id. How can I do it to some period?
You can achieve what you want by reindexing in the aggregation of each group and filling NaNs with 0.
import io
import pandas as pd
data = io.StringIO("""\
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456""")
df = pd.read_csv(data, delim_whitespace=True)
df['date'] = pd.to_datetime(df['date'], format="%d-%m-%Y")
startdate = df['date'].min()
enddate = df['date'].max()
alldates = pd.date_range(startdate, enddate, freq='D', name='date')
def process_id(g):
return g.resample('D').size().reindex(alldates).fillna(0)
output = (df.set_index('date')
.groupby('id')
.apply(process_id)
.stack()
.rename('val')
.reset_index('id'))
print(output)
# id val
# date
# 2015-12-10 123 0.0
# 2015-12-11 123 0.0
# 2015-12-12 123 1.0
# 2015-12-13 123 1.0
# 2015-12-14 123 0.0
# 2015-12-15 123 1.0
# 2015-12-16 123 1.0
# 2015-12-17 123 0.0
# 2015-12-18 123 1.0
# 2015-12-10 456 1.0
# 2015-12-11 456 0.0
# 2015-12-12 456 0.0
# 2015-12-13 456 1.0
# 2015-12-14 456 0.0
# 2015-12-15 456 1.0
# 2015-12-16 456 0.0
# 2015-12-17 456 0.0
# 2015-12-18 456 0.0

Categories