group by and concatenate dataframe

group by and concatenate dataframe - python

I have df with frame, m_label, and details so all of them can be duplicated, in same frame may be different labels with different details, but you need to know m_label+details have a constant pattern of several option for example Findings may be PL or DV , so "Findings PL start" always have "Findings PL end", except BBPS, it may be start in details as 3 and end as 2 or same number. In final I need to know which label when it start (for example Action IR, start in frame 31) and when it end (end as Action IR in frame 101).
That my input:
frame m_label details
0 BBPS 3
0 BBPS start
0 Findings DV
0 Findings start
0 Findings DV
0 Findings end
31 Actions IR
31 Actions start
99 BBPS 2
99 Findings PL
99 Findings start
99 BBPS end
99 Findings PL
99 Findings end
101 Action IR
101 Action end
So I want convert this df to something like this:
frame m_label details
0 Findings.DV start
0 Findings.DV end
0 BBPS.3 start
31 Actions.IR start
99 Action.IR end
99 Findings.PL start
99 Findings.PL end
99 BBPS.2 end
101 Action.IR end
So I need concatenate row only without start/end and groupby(?) or transform(?) by frame..
I try this code, but then I got stuck:
def concat_func(x):
if not x[1] in ['start', 'end']:
result = x[0]+'.'+x[1]
else:
result=np.nan
return result
data_cv["concat"]=data_cv[["m_label","details"]].apply(concat_func,axis=1)

First I find it useful to move the start/end info to a new column, which is done by merging together the rows that have start/end on one side and the ones that don’t on the other:
>>> detail_type = df['details'].isin({'start', 'end'})
>>> df = pd.merge(df[~detail_type], df[detail_type].rename(columns={'details': 'detail_type'}))
>>> df
frame m_label details detail_type
0 0 BBPS 3 start
1 0 Findings DV start
2 0 Findings DV end
3 0 Findings DV start
4 0 Findings DV end
5 31 Actions IR start
6 99 BBPS 2 end
7 99 Findings PL start
8 99 Findings PL end
9 99 Findings PL start
10 99 Findings PL end
11 101 Action IR end
Now we can replace the 2 columns by their concatenated text:
>>> df = df.drop(columns=['m_label', 'details']).join(df['m_label'].str.cat(df['details'], sep='.'))
>>> df.drop_duplicates()
frame detail_type m_label
0 0 start BBPS.3
1 0 start Findings.DV
2 0 end Findings.DV
5 31 start Actions.IR
6 99 end BBPS.2
7 99 start Findings.PL
8 99 end Findings.PL
11 101 end Action.IR
You could even pivot to have a start and an end column:
>>> df.drop_duplicates().pivot(columns='detail_type', index='m_label', values='frame')
detail_type end start
m_label
Action.IR 101.0 NaN
Actions.IR NaN 31.0
BBPS.2 99.0 NaN
BBPS.3 NaN 0.0
Findings.DV 0.0 0.0
Findings.PL 99.0 99.0
But for that to be efficient you’ll first need to define rules that uniquely name your labels, e.g. BBPS regardless of details 2 and 3, Action / Actions always spelled the same way, etc.

I don't think groupby would help, as the order inside the group also matter.
Try this (since you didn't post the df in a copiable way, I can't test it myself):
df = df.assign(new_label=None).sort_values(['frame', 'm_label'])
df.loc[~df['details'].isin(['start', 'end']), 'new_label'] = df['m_label'] + '.' + df['details']
df.loc[(df['frame'] == df['frame'].shift(-1).fillna('')) & (df['m_label'] == df['m_label'].shift(-1).fillna('')) & df['details'].shift(-1).isin(['start', 'end']), 'details'] = df['details'].shift(-1).fillna('')
df = df.loc[pd.notna(df['new_label']) & df['details'].isin(['start', 'end']), ['frame', 'new_label', 'details']]

Related

how to combine the first 2 column in pandas/python with n/a value

I have some questions about combining the first 2 columns in pandas/python with n/a value
long story: I need to read an excel and alter those changes. I can not change anything in excel, so any change has been done by python.
Here is the excel input
and the expected expect output will be
I manage to read it in, but when I try to combine the first 2 columns, I have some problems. since in excel, the first row is merged, so once it is read in. only one row has value, but the rest of the row is all N/A.
such as below:
Year number 2016
Month Jan
Month 2016-01
Grade 1 100
NaN 2 99
NaN 3 98
NaN 4 96
NaN 5 92
NaN Total 485
Is there any function that can easily help me to combine the first two columns and make it as below:
Year 2016
Month Jan
Month 2016-01
Grade 1 100
Grade 2 99
Grade 3 98
Grade 4 96
Grade 5 92
Grade Total 485
Anything will be really appreciated.
I searched and google the key word for so long but did not find any answer that fits my situation here.

d = '''
Year,number,2016
Month,,Jan
Month,,2016-01
Grade,1, 100
NaN,2, 99
NaN,3, 98
NaN,4, 96
NaN,5, 92
NaN,Total,485
'''
df = pd.read_csv(StringIO(d))
df
df['Year'] = df.Year.fillna(method='ffill')
df = df.fillna('') # skip this step if your data from excel does not have nan in col 2.
df['Year'] = df.Year + ' ' + df.number.astype('str')
df = df.drop('number',axis=1)
df

How to group by a df in Python by a column with the difference between the max value of one column and the min of another column?

I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.

Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2

There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79

You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64

changing wide to long table format and splitting dates by year

I have a table that looks like this:
temp = [['K98R', 'AB',34,'2010-07-27', '2013-08-17', '2008-03-01', '2011-05-02', 44],['S33T','ES',55, '2009-07-23', '2012-03-12', '2010-09-17', '', 76]]
Data = pd.DataFrame(temp,columns=['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
What you see in the table above, is that there is an entry and exit dates, with dates for the events 1 and 2, there is also a missing date for event 2 for the second patient because the event didn't happen. Also note that the event1 for the first patient happened before entry date.
What I am trying to achieve is a two fold:
1. Split the time between the entry and exit into years
2. Convert the wide format to long one with one row per year
3. Check if event 1 and 2 have occurred during the time period included in each row
To explain further, here is the output I am trying to ge.
ID Initial Age Entry Exit Event1 Event2 Weight
K89R AB 34 27/07/2010 31/12/2010 1 0 44
K89R AB 35 1/01/2011 31/12/2011 1 1 44
K89R AB 36 1/01/2012 31/12/2012 1 1 44
K89R AB 37 1/01/2013 17/08/2013 1 1 44
S33T ES 55 23/07/2009 31/12/2009 0 0 76
S33T ES 56 1/01/2010 31/12/2010 1 0 76
S33T ES 57 1/01/2011 31/12/2011 1 0 76
S33T ES 58 1/01/2012 12/03/2012 1 0 76
What you notice here is that the entry to exit date period is split into individual rows per patient, each representing a year. The event columns are now coded as 0 (meaning the event has not yet happened) or 1 (the event happened) which is then carried over to the years after because the event has already happened.
The age increases in every row per patient as time progresses
The patient ID and initial remain the same as well as the weight.
Could anyone please help with this, thank you

Begin by getting the number of years between Entry and Exit:
# Convert to datetime
df.Entry = pd.to_datetime(df.Entry)
df.Exit = pd.to_datetime(df.Exit)
df.Event1 = pd.to_datetime(df.Event1)
df.Event2 = pd.to_datetime(df.Event2)
# Round up, to include the upper years
import math
df['Years_Between'] = (df.Exit - df.Entry).apply(lambda x: math.ceil(x.days/365))
# printing the df will provide the following:
ID Initials Age Entry Exit Event1 Event2 Weight Years_Between
0 K98R AB 34 2010-07-27 2013-08-17 2008-03-01 2011-05-02 44 4
1 S33T ES 55 2009-07-23 2012-03-12 2010-09-17 NaT 76 3
Loop through your data and create a new row for each year:
new_data = []
for idx, row in df.iterrows():
year = row['Entry'].year
new_entry = pd.to_datetime(year, format='%Y')
for y in range(row['Years_Between']):
new_entry = new_entry + pd.DateOffset(years=1)
new_exit = new_entry + pd.DateOffset(years=1) - pd.DateOffset(days=1)
record = {'Entry': new_entry,'Exit':new_exit}
if row['Entry']> new_entry:
record['Entry'] = row['Entry']
if row['Exit']< new_exit:
record['Exit'] = row['Exit']
for col in ['ID', 'Initials', 'Age', 'Event1', 'Event2', 'Weight']:
record[col] = row[col]
new_data.append(record)
Create a new DataFrame, the compare dates:
df_new = pd.DataFrame(new_data, columns = ['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
df_new['Event1'] = (df_new.Event1 <= df_new.Exit).astype(int)
df_new['Event2'] = (df_new.Event2 <= df_new.Exit).astype(int)
# printing df_new will provide:
ID Initials Age Entry Exit Event1 Event2 Weight
0 K98R AB 34 2011-01-01 2011-12-31 1 1 44
1 K98R AB 34 2012-01-01 2012-12-31 1 1 44
2 K98R AB 34 2013-01-01 2013-08-17 1 1 44
3 K98R AB 34 2014-01-01 2013-08-17 1 1 44
4 S33T ES 55 2010-01-01 2010-12-31 1 0 76
5 S33T ES 55 2011-01-01 2011-12-31 1 0 76
6 S33T ES 55 2012-01-01 2012-03-12 1 0 76

Separate Pandas DataFrame into sections between rows that satisfy a condition

I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan

IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary

Reading a Specific Row of a CSV File based on the 1st Occurrence of a value within a Colum

Below is the CSV File that I have:
Record Time Value 1 Value 2 Value 3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13
I would like to use pandas in order to parse through the 'Record' Column until I find the first FFC and then print that entire row. Additionally, I would like to print the row that is two above the first found FFC. Any suggestions on how to approach this?
My reasoning for wanting to use Pandas is that I am going to need to call upon specific values within the two printed rows and plot them.
To start I have:
csvfile = pd.read_csv('Test.csv')
print(csvfile)
Thank you very much for your assistance, it is greatly appreciated!

This is one way.
import pandas as pd
from io import StringIO
mystr = StringIO("""Record Time Value1 Value2 Value3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, delim_whitespace=True)
# get index of condition
idx = df[df['Record'] == 'FFC'].index[0]
# filter for appropriate indices
res1 = df.loc[idx]
res2 = df.loc[idx-2]
To output a dataframe:
print(res1.to_frame().T)
# Record Time Value1 Value2 Value3
# 3 FFC 4 12 89 94

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.