Here's what my data look like:
user_id
prior_elapse_time
timestamp
115
NaN
0
115
10
1000
115
5
2000
222212
NaN
0
222212
8
500
222212
12
3000
222212
NaN
5000
222212
15
8000
I found similar posts that teach me how to get the first occurrence of a user:
train_df.groupby('user_id')['prior_elapsed_time'].first()
This would nicely get me all the first appearance of each user. However, now I'm at a loss at how to correctly assign 0 to the NaN only at the first occurrence of the user. Due to logging error, you can see that NaN appears elsewhere, but I only want to assign 0 to the boldfaced NaN.
I also tried
train_df['prior_elapse_time'][(train_df['prior_elapse_time'].isna()) & (train_df['timestamp'] == 0)] = 0
But then I get the "copy" vs. "view" assignment problem (which I don't fully understand).
Any help?
If your df is sorted by user_id:
>>> df.loc[df.user_id.diff().ne(0), 'prior_elapse_time'] = 0
>>> df
user_id prior_elapse_time timestamp
0 115 0.0 0
1 115 10.0 1000
2 115 5.0 2000
3 222212 0.0 0
4 222212 8.0 500
5 222212 12.0 3000
6 222212 NaN 5000
7 222212 15.0 8000
Alternatively, use pandas.Series.mask
>>> df['prior_elapse_time'] = df.prior_elapse_time.mask(df.user_id.diff().ne(0), 0)
If not sorted, then get the indices via groupby:
>>> idx = df.reset_index().groupby('user_id')['index'].first()
>>> df.loc[idx, 'prior_elapse_time'] = 0
If you want to set 0 to only those places where it was previously NaN, add pandas.Series.isnull mask to the columns.
>>> df.loc[
(df.user_id.diff().ne(0) & df.prior_elapse_time.isnull()),
'prior_elapse_time'
] = 0
Related
I have Pandas dataframe where I have points and corresponding lengths to another points. I am able to get minimal value of the calculated columns, however, I need the column names itself. I am unable to figure out how can I get the column names corresponding to values in a new column. My dataframe looks like this:
df.head():
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 218.039561
71 100.0 381.0 925.324708 ... 647.707783 169.856557 169.856557
61 225.0 69.0 751.353014 ... 515.152768 122.377490 122.377490
0 and 1 are datapoints, the rest are distances to datapoints #1 to 7, in some cases the number of points can differ, does not really matter for the question. The code I use to count min is following:
new = users.iloc[:,2:].min(axis=1)
users["min"] = new
#could also do the following way
#users.assign(Min=lambda users: users.iloc[:,2:].min(1))
This is quite simple and there is no much about finding the minimum of multiple columns. However, I need to get the col name instead of the value. So my desired output would look like this (in the example all are 7, which is not rule):
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 7
71 100.0 381.0 925.324708 ... 647.707783 169.856557 7
61 225.0 69.0 751.353014 ... 515.152768 122.377490 7
Is there a simple way to achieve this?
Use df.idxmin:
In [549]: df['min'] = df.iloc[:,2:].idxmin(axis=1)
In [550]: df
Out[550]:
0 1 2 6 7 min
9 58.0 94.0 984.003636 696.667367 218.039561 7
71 100.0 381.0 925.324708 647.707783 169.856557 7
61 225.0 69.0 751.353014 515.152768 122.377490 7
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
I want to assign a value to a list from another list with values. I'm trying to find a in a list where does it belong in a range from another list
I tried .merge but it didn't work, I tried a for loop to go through all the list but I was not able to connect all the pieces.
I have two list and i want to do the 3rd table
import numpy as np
import pandas as pd
s = pd.Series([0,1001,2501])
t = pd.Series([1000,2500,4000])
u=pd.Series([6.5,8.5,10])
df = pd.DataFrame(s,columns = ["LRange"])
df["uRange"] =t
df["Cost"]=u
print (df)
p=pd.Series([550,1240,2530,230])
dp=pd.DataFrame(p,columns = ["Power"])
print (dp)
LRange uRange Cost
0 0 1000 6.5
1 1001 2500 8.5
2 2501 4000 10
Power
1 550
2 1240
3 2530
4 230
I want my result to be:
Power Cost p/kW
1 550 6.5
2 1240 8.5
3 2530 10.0
4 230 6.5
I have a dataframe which looks like this
Date |index_numer
26/08/17|200
27/08/17|300
28/08/17|400
29/08/17|100
30/08/17|150
01/09/17|160
02/09/17|170
03/09/17|280
I am trying to do a division where the first row divides by the second row.
Date |index_numer| Divison by next row
26/08/17|200 | 0.666
27/08/17|300 | 0.75
28/08/17|400 | 4
29/08/17|100 |..
I did this in a for loop and then extracted the division number and merge back the DF. however, I am not sure if it can be done in pandas/numpy.
Does anyone have any idea?
Use shift:
df['divison'] = df['index_numer'] / df['index_numer'].shift(-1)
Output:
Date index_numer divison
0 26/08/17 200 0.666667
1 27/08/17 300 0.750000
2 28/08/17 400 4.000000
3 29/08/17 100 0.666667
4 30/08/17 150 0.937500
5 01/09/17 160 0.941176
6 02/09/17 170 0.607143
7 03/09/17 280 NaN
I would to change the value of certain DataFrame values only if a certain condition is met an n number of consecutive times.
Example:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12,0]=-40
df.iloc[10:12,1]=-40
Which gives me this DF:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 -1.045209
4 40.000000 0.598657 -1.268399
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 1.744822
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 1.416020
10 -1.337494 -40.000000 -1.195780
11 -0.703669 -40.000000 0.657519
12 -40.000000 -0.288235 -0.840145
13 -1.084869 -0.298030 -1.592004
14 -0.617568 -1.046210 -0.531523
Now, if I do
a=df.copy()
a[ abs(a) > abs(a.std()) ] = float('nan')
I get
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 NaN 0.598657 NaN
5 NaN 0.442297 -0.016363
6 NaN -0.316817 NaN
7 NaN 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
which is fair. However, I would like only to replace the values with NaN if these conditions were met by a maximum of 2 consecutive entries (so I can interpolate later). For example, I wanted the result to be
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
Apparently there's no ready-to-use method to do this. The solution I found that closest resembles my problem was this one, but I couldn't make it work for me.
Any ideas?
See below - the tricky part is (cond[c] != cond[c].shift(1)).cumsum() which breaks the data into contiguous runs of the same value.
In [23]: cond = abs(df) > abs(df.std())
In [24]: for c in df.columns:
...: grouper = (cond[c] != cond[c].shift(1)).cumsum() * cond[c]
...: fill = (df.groupby(grouper)[c].transform('size') <= 2)
...: df.loc[fill, c] = np.nan
In [25]: df
Out[25]:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
To explain a bit more, cond[c] is a boolean series indicating whether your condition is true or not.
The cond[c] != cond[c].shift(1) compares the current row's condition to the next row's. This has the effecting of 'marking' where a run of values begins with the value True.
The .cumsum() converts the bools to integers and takes the cumulative sum. It may not be immediately intuitive, but this 'numbers' the groups of contiguous values. Finally the * cond[c] reassigns all groups that didn't meet the criteria to 0 (using False == 0)
So now you have groups of contiguous numbers that meet your condition, the next step performs a groupby to count how many values are in each group (transform('size').
Finally a new bool condition is used to assign missing values to those groups with 2 or less values meeting the condition.