I have a dataset where I want to fill the columns and rows in python as shown below:
Dataset:
| P | Q |
|678|1420|
|678|---|
|609|---|
|583|1260|
|---|1260|
|---|1261|
|---|1262|
|584|1263|
|---|403|
|---|---|
Expected Result:
| P | Q |
|678|1420|
|678|1420|
|609|---|
|583|1260|
|583|1260|
|583|1261|
|583|1262|
|584|1263|
|584|403|
|584|403|
I have filled the column P using fillna() but cannot do the same for column Q since the values needs to be filled for key pair.
can someone please help.
With your shown samples, please try following.
df['P'] = df['P'].ffill()
df['Q'] = df.groupby('P')['Q'].ffill()
Output will be as follows:
P Q
0 678.0 1420.0
1 678.0 1420.0
2 609.0
3 583.0 1260.0
4 583.0 1260.0
5 583.0 1261.0
6 583.0 1262.0
7 584.0 1263.0
8 584.0 403.0
9 584.0 403.0
ffill documentation
Related
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).
I have a dataframe which looks like this
Date |index_numer
26/08/17|200
27/08/17|300
28/08/17|400
29/08/17|100
30/08/17|150
01/09/17|160
02/09/17|170
03/09/17|280
I am trying to do a division where the first row divides by the second row.
Date |index_numer| Divison by next row
26/08/17|200 | 0.666
27/08/17|300 | 0.75
28/08/17|400 | 4
29/08/17|100 |..
I did this in a for loop and then extracted the division number and merge back the DF. however, I am not sure if it can be done in pandas/numpy.
Does anyone have any idea?
Use shift:
df['divison'] = df['index_numer'] / df['index_numer'].shift(-1)
Output:
Date index_numer divison
0 26/08/17 200 0.666667
1 27/08/17 300 0.750000
2 28/08/17 400 4.000000
3 29/08/17 100 0.666667
4 30/08/17 150 0.937500
5 01/09/17 160 0.941176
6 02/09/17 170 0.607143
7 03/09/17 280 NaN
I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.
You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64
Forgive any bad wording as I'm rather new to Pandas. I've done a fair amount of Googling but can't quite figure out the keywords I need to get the answer I'm looking for. I have some rather simple data containing counts of a certain flag grouped by IDs and dates, similar to the below:
id date flag count
-------------------------------------
CAZ1 02/03/2012 Y 12
CAZ1 02/03/2012 N 7
CAZ2 03/03/2012 Y 6
CAZ2 03/03/2012 N 2
CRI2 02/03/2012 Y 14
CRI2 02/03/2012 G 5
LMU3 01/12/2013 G 7
LMU4 02/12/2013 G 4
LMU5 01/12/2014 G 3
LMU6 01/12/2014 G 2
LMU7 05/12/2014 G 2
EUR4 01/16/2014 N 3
What I'm looking to do is group the IDs by certain flag combinations, sum their counts, and then get means for these per year. Resulting data should look something like:
2012 2013 2014 Mean Calculations:
--------------------------------------
Y,N | 6.75 NaN NaN (((12+7)/2)+((6+2)/2))/2
--------------------------------------
Y,G | 9.5 NaN NaN (14+5)/2
--------------------------------------
G | NaN 5.5 2.33 (7+4)/2, (3+2+2)/3
--------------------------------------
N | NaN NaN 3 (3)
Not sure if this makes sense. I think I need to perform multiple GroupBys at the same time, with the option to define the different criteria for each of the different groupings.
Happy to clarify further if needed. My initial attempts at coding this have been filled with errors so I don't think there's much benefit in posting progress so far. In fact, I just tried to write something and it seemed more misleading than helpful. Sorry, >_<.
IIUC, you can get what you want by first doing a groupby and then building a pivot_table:
[original version]
df["date"] = pd.to_datetime(df["date"])
grouped = df.groupby(["id","date"], as_index=False)
df_new = grouped.agg({"flag": ",".join, "count": "sum"})
df_new["year"] = df_new["date"].dt.year
df_final = df_new.pivot_table(index="flag", columns="year")
produces
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 19.0 NaN NaN
Y,N 13.5 NaN NaN
[updated after the question was edited]
If you want the mean instead of the sum, just write mean instead of sum when doing the aggregation, i.e.
df_new = grouped.agg({"flag": ",".join, "count": "mean"})
which gives
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 9.50 NaN NaN
Y,N 6.75 NaN NaN
The only tricky part is passing the dictionary to agg so we can perform two aggregation operations at once:
>>> df_new
id date count flag year
0 CAZ1 2012-02-03 19 Y,N 2012
1 CAZ2 2012-03-03 8 Y,N 2012
2 CRI2 2012-02-03 19 Y,G 2012
3 EUR4 2014-01-16 3 N 2014
4 LMU3 2013-01-12 7 G 2013
5 LMU4 2013-02-12 4 G 2013
6 LMU5 2014-01-12 3 G 2014
7 LMU6 2014-01-12 2 G 2014
8 LMU7 2014-05-12 2 G 2014
It's usually easier to work with these flat formats as much as you can and then pivot only at the end.
For example, if your real dataset is more complicated than the one you posted, you might need another groupby -- but that's easy enough using this pattern.