I'm trying to take a dataframe of patient data and create a new df that includes their name and date if they had an encounter with three services on the same date.
first I have a dataframe
import pandas as pd
df = pd.DataFrame({'name': ['Bob', 'Charlie', 'Bob', 'Sam', 'Bob', 'Sam', 'Chris'],
'date': ['06-02-2023', '01-02-2023', '06-02-2023', '20-12-2022', '06-02-2023','08-06-2015', '26-08-2020'],
'department': ['urology', 'urology', 'oncology', 'primary care', 'radiation', 'primary care', 'oncology']})
I tried group by on the name and date with an agg function to create a list
df_group = df.groupby(['name', 'date']).agg({'department': pd.Series.unique})
For bob, this created made department contain [urology, oncology, radiation].
now when I try to search for the departments in the list, to then just find the rows that contain the departments in question, I get an error.
df_group.loc[df_group['department'].str.contains('primary care')]
for instance results in KeyError: '[nan nan nan nan nan] not in index'
I assume there is a much easier way but ultimately, I want to just get a dataframe of people with the date when they have an encounter for urology, oncology, and radiation. In the above df it would result in:
Name Date
Bob 06-02-2023
Easy solution
# define a set of departments to check for
s = {'urology', 'oncology', 'radiation'}
# groupby and aggregate to identify the combination
# of name and date that has all the required departments
out = df.groupby(['name', 'date'], as_index=False)['department'].agg(s.issubset)
Result
# out
name date department
0 Bob 06-02-2023 True
1 Charlie 01-02-2023 False
2 Chris 26-08-2020 False
3 Sam 08-06-2015 False
4 Sam 20-12-2022 False
# out[out['department'] == True]
name date department
0 Bob 06-02-2023 True
Related
A first dataframe has a column containing categories which are the same as the headers of the second. There are multiple row entries for one name in df1. df2 will have 1 row entry per name. df1 has 1 row entry per category per name. All rows for one name occur in sequence in df1.
As shown:
enter image description here
The headers of df2 are as follows:
enter image description here
And the desired output is below:
enter image description here
How can I map data from the df1 to df2?
More specifically, how can I map multiple rows from df1 to 1 row and the respective columns of df2 in a more efficient way than looping twice to check for each category under each name?
Any help is appreciated,
Have a great day
Code:
import pandas as pds
df1 = pds.DataFrame({'Client': ['Rick', 'Rick', 'John'], 'Category': ['Service1', 'Service2', 'Service1'], 'Amount': [250, 6, 79]})
df2 = pds.DataFrame(columns = ['Client', 'Due_Date', 'Service1', 'Service2'])
output = pds.DataFrame({'Client': ['Rick', 'John'], 'Due_Date': [None,None] , 'Service1': [250, 79], 'Service2': [6, 0]})
This is an alternative approach using .pivot() and .assign()
df1_pivot = (df1.pivot(index='Client', columns='Category', values='Amount')
.reset_index().assign(Due_Date=None))
df_out = df2.assign(**df1_pivot)
print(df_out)
Client Due_Date Service1 Service2
0 John None 79.0 NaN
1 Rick None 250.0 6.0
You're looking for pandas.DataFrame.pivot :
out = (df1.pivot(index="Client", columns="Category")
.reset_index()
.set_axis(["Client", "Service1", "Service2"], axis=1)
.assign(Due_Date= None)
)
NB : I suggest you to use import pandas as pd as per the import convention
.
Output :
print(out)
Client Service1 Service2 Due_Date
0 John 79.0 NaN None
1 Rick 250.0 6.0 None
Suppose I have a dataset with three columns: Name, Application and Duration.
I am trying to figure how to group by Name and Application, where a different hop to another application will end the current grouping and start a new one, and if I return to the original application, it will count it as a new grouping
The image here illustrates the table
My desired output would be:
1. John, Excel, 5 mins
2. John, Spotify, 1 mins
3. John, Excel, 1 mins
4. John, Spotify, 2 mins
5. Emily, Excel, 5 mins
6. John, Excel, 3 mins
I have been attempting to do this in Pandas but I cannot manage to ensure that it aggregates by different application hops, even if it comes back to a previous application.
You can use Pandas .shift() to compare the values of the series with the next row, build up a session value based on the "hops", and then group by that session value.
import pandas as pd
df = pd.DataFrame({
'name' : ['John', 'John', 'John', 'John', 'John', 'Emily', 'Emily', 'John'],
'app' : ['Excel','Excel','Spotify','Excel','Spotify','Excel', 'Excel', 'Excel'],
'duration':[3,2,1,1,2,4,1,3]})
session = ((df.name != df.name.shift()) | (df.app != df.app.shift())).cumsum()
df2 = df.groupby(['name', 'app', session], as_index=False, sort=False)['duration'].sum()
print(df2)
Output:
name app duration
0 John Excel 5
1 John Spotify 1
2 John Excel 1
3 John Spotify 2
4 Emily Excel 5
5 John Excel 3
One solution would be to add a column to define hops. Then group by that column
hop_id = 1
for i in df.index:
df.loc[i,'hop_id'] = hop_id
if (df.loc[i,'Name']!= df.loc[i+1,'Name']) or (df.loc[i,'Application'] != df.loc[i+1,'Application']):
hop_id = hop_id +1
df.groupby('hop_id')['Duration'].sum()
How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.
Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.
You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6
A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
I already have an idea as to how I'm going to do this - I'm just curious about whether my method is the most efficient.
So for instance, let's say that for whatever reason, I have the following table:
The first 4 columns in the table are all repeated - they just say info about the employee. The reason these rows repeat is because that employee handles multiple clients.
In some cases, I am missing info on the Age and Employee duration of an employee. Another colleague gave me this information in an excel sheet.
So now, I have info on Brian's and Dennis' age and employment duration, and I need to fill all rows with their employee IDs based on the information. My plan for doing that is this:
data = {"14": # Brian's Employee ID
{"Age":31,
:"Employment Duration":3},
"21": # Dennis' Employee ID
{"Age":45,
"Employment Duratiaon":12}
}
After making the above dictionary of dictionaries with the necessary values, my plan is to iterate over each row in the above dataframe, and fill in the 'Age' and 'Employment Duration' columns based on the value in 'Employee ID':
for index, row in df.iterrows:
if row["Employee ID"] in data:
row["Age"] = data["Employee ID"]["Age"]
row["Employment Duration"] = data["Employee ID"]["Employement Duration"]
That's my plan for populating the missing values!
I'm curious about whether there's a simpler way that's just not presenting itself to me, because this was the first thing that sprang to mind!
Don't iterate over rows in pandas when you can avoid it. Instead maximize the pandas library with actions like this:
Assume we have a dataframe:
data = pd.DataFrame({
'name' : ['john', 'john', 'mary', 'mary'],
'age' : ['', '', 25, 25]
})
Which looks like:
name age
0 john
1 john
2 mary 25
3 mary 25
We can apply a lambda function like so:
data['age'] = data.apply(lambda x: 27 if x.name == 'john' else x.age, axis=1)
Or we can use pandas .loc:
data['age'].loc[data.name == 'john'] = 27
Test them out and compare how long each take to execute vs. iterating over rows.
Ensure missing values are represented as null values (np.NaN). The second set of information should be stored in another DataFrame with the same column labels.
Then by setting the Index to the 'Employee ID' update will align on the indices and fill the missing values.
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'Employee ID': ["11", "11", "14", "21"],
'Name': ['Alan', 'Alan', 'Brian', 'Dennis'],
'Age': [14,14, np.NaN, np.NaN],
'Employment Duration': [3,3, np.NaN, np.NaN],
'Clients Handled': ['A', 'B', 'C', 'G']})
data = {"14": {"Age": 31, "Employment Duration": 3},
"21": {"Age": 45, "Employment Duration": 12}}
df2 = pd.DataFrame.from_dict(data, orient='index')
Code
#df = df.replace('', np.NaN) # If not null in your dataset
df = df.set_index('Employee ID')
df.update(df2, overwrite=False)
print(df)
Name Age Employment Duration Clients Handled
Employee ID
11 Alan 14.0 3.0 A
11 Alan 14.0 3.0 B
14 Brian 31.0 3.0 C
21 Dennis 45.0 12.0 G