Python Pandas GroupBy until Value change - python

Suppose I have a dataset with three columns: Name, Application and Duration.
I am trying to figure how to group by Name and Application, where a different hop to another application will end the current grouping and start a new one, and if I return to the original application, it will count it as a new grouping
The image here illustrates the table
My desired output would be:
1. John, Excel, 5 mins
2. John, Spotify, 1 mins
3. John, Excel, 1 mins
4. John, Spotify, 2 mins
5. Emily, Excel, 5 mins
6. John, Excel, 3 mins
I have been attempting to do this in Pandas but I cannot manage to ensure that it aggregates by different application hops, even if it comes back to a previous application.

You can use Pandas .shift() to compare the values of the series with the next row, build up a session value based on the "hops", and then group by that session value.
import pandas as pd
df = pd.DataFrame({
'name' : ['John', 'John', 'John', 'John', 'John', 'Emily', 'Emily', 'John'],
'app' : ['Excel','Excel','Spotify','Excel','Spotify','Excel', 'Excel', 'Excel'],
'duration':[3,2,1,1,2,4,1,3]})
session = ((df.name != df.name.shift()) | (df.app != df.app.shift())).cumsum()
df2 = df.groupby(['name', 'app', session], as_index=False, sort=False)['duration'].sum()
print(df2)
Output:
name app duration
0 John Excel 5
1 John Spotify 1
2 John Excel 1
3 John Spotify 2
4 Emily Excel 5
5 John Excel 3

One solution would be to add a column to define hops. Then group by that column
hop_id = 1
for i in df.index:
df.loc[i,'hop_id'] = hop_id
if (df.loc[i,'Name']!= df.loc[i+1,'Name']) or (df.loc[i,'Application'] != df.loc[i+1,'Application']):
hop_id = hop_id +1
df.groupby('hop_id')['Duration'].sum()

Related

How to change values in specific rows/columns to NaN based on condition?

I’ve got some strange values in my date column of my dataset. I’m trying to change these unexpected values into NaN.
I don’t know what these unexpected values will be, hence why I made df 2 - where I’m searching for months (e.g. Dec, March) and then removing these and then seeing what I’ve got left. So now I know that the weird data is in row 1 and 3. But how do I now change the Birthday column value for row 1 and row 3 to say NaN?
My real dataset is much bigger so it’s a bit awkward to just type in the row numbers manually.
#Creating the example df
import pandas as pd
data = {'Age': [20, 21, 19, 18],
'Name': ['Tom', 'nick', 'krish', 'jack'],
'Birthday': ["Dec-82", "heidgo", "Mar-84", "ishosdg"]}
df = pd.DataFrame(data)
#Finding out which rows have the weird values
df2 = df[~df["Birthday"].str.contains("Dec|Mar")]
Locate records that fit the condition to fill their Birthday column with NaN:
df.loc[~df["Birthday"].str.contains("Dec|Mar"), 'Birthday'] = np.nan
Age Name Birthday
0 20 Tom Dec-82
1 21 nick NaN
2 19 krish Mar-84
3 18 jack NaN

Search for multiple encounters across rows in pandas

I'm trying to take a dataframe of patient data and create a new df that includes their name and date if they had an encounter with three services on the same date.
first I have a dataframe
import pandas as pd
df = pd.DataFrame({'name': ['Bob', 'Charlie', 'Bob', 'Sam', 'Bob', 'Sam', 'Chris'],
'date': ['06-02-2023', '01-02-2023', '06-02-2023', '20-12-2022', '06-02-2023','08-06-2015', '26-08-2020'],
'department': ['urology', 'urology', 'oncology', 'primary care', 'radiation', 'primary care', 'oncology']})
I tried group by on the name and date with an agg function to create a list
df_group = df.groupby(['name', 'date']).agg({'department': pd.Series.unique})
For bob, this created made department contain [urology, oncology, radiation].
now when I try to search for the departments in the list, to then just find the rows that contain the departments in question, I get an error.
df_group.loc[df_group['department'].str.contains('primary care')]
for instance results in KeyError: '[nan nan nan nan nan] not in index'
I assume there is a much easier way but ultimately, I want to just get a dataframe of people with the date when they have an encounter for urology, oncology, and radiation. In the above df it would result in:
Name Date
Bob 06-02-2023
Easy solution
# define a set of departments to check for
s = {'urology', 'oncology', 'radiation'}
# groupby and aggregate to identify the combination
# of name and date that has all the required departments
out = df.groupby(['name', 'date'], as_index=False)['department'].agg(s.issubset)
Result
# out
name date department
0 Bob 06-02-2023 True
1 Charlie 01-02-2023 False
2 Chris 26-08-2020 False
3 Sam 08-06-2015 False
4 Sam 20-12-2022 False
# out[out['department'] == True]
name date department
0 Bob 06-02-2023 True

How to turn header inside rows into columns?

How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.
Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.
You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6

Update a value based on another dataframe pairing

I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140

Create multiple dataframes from one

I have a dataframe like this:
name time session1 session2 session3
Alex 135 10 3 5
Lee 136 2 6 4
I want to make multiple dataframes based on each session. for example, i want to make dataframe one that has name, time, and session1. and dataframe 2 has name, time, and session2. and dataframe 3 has name, time, and session3. I want to use for loop or any other way is better but don't know how to choose column 1,2,3 at one time but column 1,2, 4 and etc. Any one has idea about that. The data is saved in pandas dataframe. I just don't know how to type it in Stackoverflow here. Thank you
I don't think you need to create a new dictionary for that.
Just directly slice your data frame whenever needed.
df[['name', 'time', 'session 1']]
If you think the following design can help you, you can set the name and time to be indexes (df.set_index(['name', 'time'])) and just simply
df['session 1']
Organize it into a dictionary of dataframes:
dict_of_dfs = {f'df {i}':df[['name','time', i]] for i in df.columns[2:]}
Then you can access each dataframe as you would any other dictionary values:
>>> dict_of_dfs['df session1']
name time session1
0 Alex 135 10
1 Lee 136 2
>>> dict_of_dfs['df session2']
name time session2
0 Alex 135 3
1 Lee 136 6

Categories