Preserving id columns in dataframe after applying assign and groupby - python

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?

Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN

Related

Edit row of DataFrame IF row contains specific string

I have multiple dataframes stored in a dictionary.
Each dataframe has 3 columns as shown below
exceldata_1['Sheet1']
0 1 2
0 Sv2.55+Fv2.04R02[2022-01-01T00 16 29.464Z]
1 - SC OK NaN
2 - PC1 Number 1 NaN
3 - PC1 Main Status OK NaN
4 - PC1 PV 4294954868 NaN
... ... ... ...
1046 - C Temperature 17�C NaN
1047 Sv2.55+Fv2.04R02[2022-01-01T23 16 30.782Z]
1048 - Level SS High NaN
1049 Sv2.55+Fv2.04R02[2022-01-01T23 16 34.235Z]
1050 Sv2.55+Fv2.04R02[2022-01-01T23 16 38.657Z]
1051 rows × 3 columns
I want to do this : Search each row of dataframe if it contains "Sv2." and change that row as follows
Remove "Sv2.55+Fv2.04R02[" this part and combine the remaining data to form the date and time correctly in each column....Showing the desired outpu below...The last column can be deleted as it will not contain any data after performing this operation.
0 1 2
0 2022-01-01 00:16:29 NaN
1 - SC OK NaN
2 - PC1 Number 1 NaN
3 - PC1 Main Status OK NaN
4 - PC1 PV 4294954868 NaN
... ... ... ...
1046 - C Temperature 17�C NaN
1047 2022-01-01 23:16:30 NaN
1048 - Level SS High NaN
1049 2022-01-01 23:16:34 NaN
1050 2022-01-01 23:16:38 NaN
1051 rows × 3 columns
How can I achieve this?
Using regular expressions should work
for i in range(len(df)):
text=df['0'][i]
if re.search('Sv',text)!=None:
item_list=re.split('\[|T|\s\s|Z',text[:-1])
df.iloc[i,0]=item_list[1]
df.iloc[i,1]=item_list[2]+':'+item_list[3]+':'+item_list[4]
With df one of your dataframes you could try the following:
m = df[0].str.contains("Sv2.")
ser = df.loc[m, 0] + " " + df.loc[m, 1] + " " + df.loc[m, 2]
datetime = pd.to_datetime(
ser.str.extract(r"Sv2\..*?\[(.*?)\]")[0].str.replace(r"\s+", " ", regex=True),
format="%Y-%m-%dT%H %M %S.%fZ"
)
df.loc[m, 0] = datetime.dt.strftime("%Y-%m-%d")
df.loc[m, 1] = datetime.dt.strftime("%H:%M:%S")
df.loc[m, 2] = np.NaN
First build a mask m that selects the rows that contain a "Sv2." in the first column.
Based on that build a series ser with the relevant strings, added together with a blank inbetween.
Use .str.extract to fetch the datetime-part via the capture group of a regex: Look for the "Sv2."-part, then go forward until the opening bracket "[", and then catch all until the closing bracket "]".
Convert those strings with pd.to_datetime to datetimes (see here for the format codes).
Extract the required parts with .dt.strftime into the resp. columns.
Alternative approach without real datetimes:
m = df[0].str.contains("Sv2.")
ser = df.loc[m, 0] + " " + df.loc[m, 1] + " " + df.loc[m, 2]
datetime = ser.str.extract(
r"Sv2\..*?\[(\d{4}-\d{2}-\d{2}).*?(\d{2}\s+\d{2}\s+\d{2})\."
)
datetime[1] = datetime[1].str.replace(r"\s+", ":", regex=True)
df.loc[m, [0, 1]] = datetime
df.loc[m, 2] = np.NaN
Result for the following sample df (taken from your example)
0 1 2
0 Sv2.55+Fv2.04R02[2022-01-01T00 16 29.464Z]
1 - SC Ok NaN
2 - PC Number 1 NaN
3 - PC MS Ok NaN
4 - PC PValue 8 NaN
5 - Level SS High NaN
6 Sv2.55+Fv2.04R02[2022-01-01T23 16 34.235Z]
7 Sv2.55+Fv2.04R02[2022-01-01T23 16 38.657Z]
is
0 1 2
0 2022-01-01 00:16:29 NaN
1 - SC Ok NaN
2 - PC Number 1 NaN
3 - PC MS Ok NaN
4 - PC PValue 8 NaN
5 - Level SS High NaN
6 2022-01-01 23:16:34 NaN
7 2022-01-01 23:16:38 NaN
Thanks for the idea on how to proceed #Irsyaduddin ..With some modifications to his answer, I was able to achieve it.
Make sure all the data types in your dataframe are strings
import re
for i in range(len(df1)):
text= (df1[0][i])+df1[1][i]+(df1[2][i]) #combining data from all cols
if re.search('Sv',text)!=None:
item_list=re.split('\[|T|Z',text)
df1.iloc[i,0]=item_list[1]
df1.iloc[i,1]=item_list[2][:2]+":"+item_list[2]
[2:4]+":"+item_list[2][4:6]
df1.iloc[i,2]='NaN'
df1
Result:
0 1 2
0 2022-01-01 00:16:29 NaN
1 - Server Connection OK nan
2 - PC1 Number 1 nan
3 - PC1 MS OK nan
4 - PC1 PV 4294954868 nan
... ... ... ...
1046 - C Temperature 17�C nan
1047 2022-01-01 23:16:30 NaN
1048 - Level Sensor Status High nan
1049 2022-01-01 23:16:34 NaN
1050 2022-01-01 23:16:38 NaN
1051 rows × 3 columns
Result of Split:
item_list
['Sv2.55+Fv2.04R02', '2022-01-01', '001629.464', '] ']

Find First occurrence of a user and assign values to it

Here's what my data look like:
user_id
prior_elapse_time
timestamp
115
NaN
0
115
10
1000
115
5
2000
222212
NaN
0
222212
8
500
222212
12
3000
222212
NaN
5000
222212
15
8000
I found similar posts that teach me how to get the first occurrence of a user:
train_df.groupby('user_id')['prior_elapsed_time'].first()
This would nicely get me all the first appearance of each user. However, now I'm at a loss at how to correctly assign 0 to the NaN only at the first occurrence of the user. Due to logging error, you can see that NaN appears elsewhere, but I only want to assign 0 to the boldfaced NaN.
I also tried
train_df['prior_elapse_time'][(train_df['prior_elapse_time'].isna()) & (train_df['timestamp'] == 0)] = 0
But then I get the "copy" vs. "view" assignment problem (which I don't fully understand).
Any help?
If your df is sorted by user_id:
>>> df.loc[df.user_id.diff().ne(0), 'prior_elapse_time'] = 0
>>> df
user_id prior_elapse_time timestamp
0 115 0.0 0
1 115 10.0 1000
2 115 5.0 2000
3 222212 0.0 0
4 222212 8.0 500
5 222212 12.0 3000
6 222212 NaN 5000
7 222212 15.0 8000
Alternatively, use pandas.Series.mask
>>> df['prior_elapse_time'] = df.prior_elapse_time.mask(df.user_id.diff().ne(0), 0)
If not sorted, then get the indices via groupby:
>>> idx = df.reset_index().groupby('user_id')['index'].first()
>>> df.loc[idx, 'prior_elapse_time'] = 0
If you want to set 0 to only those places where it was previously NaN, add pandas.Series.isnull mask to the columns.
>>> df.loc[
(df.user_id.diff().ne(0) & df.prior_elapse_time.isnull()),
'prior_elapse_time'
] = 0

Pandas Dataframes - Search an integer from one data frame in a string column in another dataframe

I have two data frames:
DF1
cid dt tm id distance
2 ed032f716995 2021-01-22 16:42:48 43 21.420561
3 16e2fd96f9ca 2021-01-23 23:19:43 539 198.359355
102 cf092e68fa82 2021-01-22 09:03:14 8 39.599627
104 833ccf05433b 2021-01-24 02:53:08 11 33.168314
DF2
id cluster
0 3
1 6 7,8,43
2 20 1817
3 25
4 10 11,13,14,15,9,539
I want to search each id in df1 in cluster column of df2. The desired output is:
cid dt tm id distance cluster
2 ed032f716995 2021-01-22 16:42:48 43 21.420561 7,8,43
3 16e2fd96f9ca 2021-01-23 23:19:43 539 198.359355 11,13,14,15,9,539
102 cf092e68fa82 2021-01-22 09:03:14 8 39.599627 7,8,43
104 833ccf05433b 2021-01-24 02:53:08 11 33.168314 11,13,14,15,9,539
In the above df1 - line 1, since 43 is present in df2, I am including the entire cluster details for df1 - line 1.
I tried the following:
for index, rows in df1.iterrows():
for idx,rws in df2.iterrows():
if (str(rows['id']) in str(rws['cluster'])):
print([rows['id'],rws['cluster']])
This looks like working. However, since the df2['cluster'] is a string, even if there is a partial match, it is returning the result. For example, if df1['id'] = 34 and df2['cluster'] has 344,432, etc, it still matches based on 344 and returns a positive result.
I tried another option from SO here:
d = {k: set(v.split(',')) for k, v in df2.set_index('id')['cluster'].items()}
df1['idc'] = [next(iter([k for k, v in d.items() if set(x).issubset(v)]), '') for x in str(df1['id'])]
However, in the above I am getting an error indicating the length of variable is different between the two datasets.
How do I get the cluster mapped based on exact match of the id column in df1?
One way is split the cluster, explode it and map:
to_map = (df2.assign(cluster_i=df2.cluster.str.split(','))
.explode('cluster_i').dropna()
.set_index('cluster_i')['cluster']
)
df1['cluster'] = df1['id'].astype(str).map(to_map)
Output:
cid dt tm id distance cluster
2 ed032f716995 2021-01-22 16:42:48 43 21.420561 7,8,43
3 16e2fd96f9ca 2021-01-23 23:19:43 539 198.359355 11,13,14,15,9,539
102 cf092e68fa82 2021-01-22 09:03:14 8 39.599627 7,8,43
104 833ccf05433b 2021-01-24 02:53:08 11 33.168314 11,13,14,15,9,539

check if each user has consecutive dates in a python 3 pandas dataframe

Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).

How to update da Pandas Panel without duplicates

Currently i'm working on a Livetiming-Software for a motorsport-application. Therefore i have to crawl a Livetiming-Webpage and copy the Data to a big Dataframe. This Dataframe is the source of several diagramms i want to make. To keep my Dataframe up to date, i have to crawl the webpage very often.
I can download the Data and save them as a Panda.Dataframe. But my Problem is step from the downloaded DataFrame to the Big Dataframe, that includes all the Data.
import pandas as pd
import numpy as np
df1= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:30,000','1:45,000','1:50,000','1:25,333','1:13,366','1:17,000'],
'Laps':['1','1','1','1','1','1']})
df2= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,],
'Laps':['2','2','2','2','2','2']})
df3= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:31,000','1:41,000','1:51,000','1:21,333','1:11,366','1:11,000'],
'Laps':['2','2','2','2','2','2']})
df1.set_index(['CLS','Nr.','Laps'],inplace=True)
df2.set_index(['CLS','Nr.','Laps'],inplace=True)
df3.set_index(['CLS','Nr.','Laps'],inplace=True)
df1 shows a Dataframe from previous laps.
df2 shows a Dataframe in the second lap. The Lap is not completed, so i have a nan.
df3 shows a Dataframe after the second lap is completed.
My target is to have just one row for each Lap per Car per Class.
Either i have the problem, that i have duplicates with incomplete Laps or all date get overwritten.
I hope that someone can help me with this problem.
Thank you so far.
MrCrunsh
If I understand your problem correctly, your issue is that you have overlapping data for the second lap: information while the lap is still in progress and information after it's over. If you want to put all the information for a given lap in one row, I'd suggest use multi-index columns or changing the column names to reflect the difference between measurements during and after laps.
df = pd.concat([df1, df3])
df = pd.concat([df, df2], axis=1, keys=['after', 'during'])
The result will look like this:
after during
Pos Zeit Pos Zeit
CLS Nr. Laps
V4 24 1 5 1:13,366 NaN NaN
2 5 1:11,366 5.0 NaN
55 1 4 1:25,333 NaN NaN
2 4 1:21,333 4.0 NaN
985 1 6 1:17,000 NaN NaN
2 6 1:11,000 6.0 NaN
V5 13 1 1 1:30,000 NaN NaN
2 1 1:31,000 1.0 NaN
30 1 3 1:50,000 NaN NaN
2 3 1:51,000 3.0 NaN
700 1 2 1:45,000 NaN NaN
2 2 1:41,000 2.0 NaN

Categories