please I need help, I am recently started learning python. Please, how do I merge rows with the same “PatientID” and the same “Resource” as one with “StartDate” and “EndDate” as the average of the merged rows?
enter image description here
Given df is the name for the pandas.DataFrame containing your data.
To get the earliest StartDate and EndDate of each patient's resource, you can write:
# Group by the 'PatientID' and 'Resource' columns
grouped_df = df.groupby(['PatientID', 'Resource'])
# Select Earliest `StartDate` and `EndDate` from aggregate.
grouped_df = grouped_df.min(['StartDate', 'EndDate'])
# Remove levels from the index.
grouped_df.reset_index(inplace=True)
Related
I am working with an Excel sheet with pandas where I am analysing some data from it.
Inside the excel sheet I have 8 Columns one is Timestamp and another one is City column and so on like Domain, State etc.
I want to analyse city and timestamp columns only.
I have selected the city and Timestamp column from the excel sheet inside a Data Frame. I have find out the city count, means how many rows contain the same city using cities_df['Count'] = df['City_Town_Village '].value_counts()
After finding the city count I have find out the percentages of all the cities using cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Now my question here is, As I am finding the city_count The rows inside my dataframe is decreasing means my df has 238 rows but after the count they are decreasing to 128 no issue at all till now. They are decreasing just because of the count.
I also have the timestamp column inside my df, lets say for city Delhi some people registered in 28-May-2021 and some people registered in 29-May-2021 just like that. But after finding out the city_count my df is just showing me timestamp for the beginning date only i.e 28-may..
I don't know why this is happening, actually I want to segregate the data into two week and want to plot the graph on week wise. and also for city percentage.
Here is my Excel file
This is the code I'm using:
import pandas as pd
df = pd.read_excel('PCS_NWR_Sheet.xlsx')
df.head()
pd.set_option('display.max_rows', 300)
cities_df = pd.DataFrame()
cities_df['Count'] = df['City_Town_Village '].value_counts()
cities_df.index.names=['City']
cities_df.reset_index(inplace = True)
cities_df['Timestamp'] = df['Timestamp']
column_names = ['Timestamp', 'City', 'Count']
cities_df = cities_df.reindex(columns=column_names)
cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Metro_list = ['Hyderabad', 'Kolkata', 'Delhi', 'Pune', 'Bengaluru', 'Noida', 'Kanpur', 'Gurgaon']
top_metro=cities_df[cities_df['City'].isin(Metro_list)]
top_metro
.value_counts() will return a series where the size is equal to the number of unique elements in what you are counting. So you are getting less rows because it is grouping those things.
I can think of two ways to solve the question (if I understand it right).
Do .value_counts() on both the date column and the city column.
df[['City_Town_Village','Date']].value_counts()
If you don't currently have your timestamps as dates, you'll need to make a date column that does that (you probably won't be able to group on datetimes since the times will vary. This will give a series where the row count is equal to the size of every existing combination of the two columns.
Make a separate dataframe with the value_count of the town column, and merge them. That is, if you want a column in your main dataframe that has the number of times that the town comes up ever in the data, that column is a different size (as we said), so you'll store it somewhere else but can bring it back in as needed.
df2 = pd.DataFrame(df['City_Town_Village'].value_counts())
df2.reset_index(inplace=True) # by fault, making the df from value_counts() will make your city/town the index, this makes a normal index)
df2.columns = ['City_Town_Village','Count'] #rename the columns
df = df.merge(df2,how='left',on='City_Town_Village')
This will make df to have the Counts column added, where it will be the count of the City/Town in the original dataset.
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I have a list of company names, dates, and pe ratios.
I need to find an average of the previous 10 years data of the given date such that only month-end date is considered.
for example if I need to find average as of 31st dec, 2015..... I need to first find data of all previous month ends from 31/12/2005 to 31/12/2015. and then their average.
sample data I have
required output:
required output
here is what I have done soo far....
df = pd.read_csv('daily_valuation_ratios_cc.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
columns = ['pe', 'price_bv', 'mcap_ns', 'ev_ebidta']
df_mean = df.groupby('Company Name')[columns].resample('M').mean()
but this method is finding mean on daily basis and is showing result monthly, unlike my sample output.
i am new to pandas, pls help.
Edit:
df3 = df.groupby(['Company Name','year','month'])
df3.first()
this code works, now I just have one problem, to export dataframe to to_csv. pls help
A dataframe has a special function called groupby that selects a column, and can be aggregated.
So if you were to run, data.groupby('pe') you would get that column.
Now if you were to tack on .describe, you would get the standard deviation/mean/min/ect.
Example:
data.groupby('pe').describe()
Edit: You can also use built-in aggregate functions such as .max()/.mean()/ect. with groupby().
I have a DataFrame with a date_time column. The date_time column contains a date and time. I also managed to convert the column to a datetime object.
I want to create a new DataFrame containing all the rows of a specific DAY.
I managed to do it when I set the date column as the index and used the "loc" method.
Is there a way to do it even if the date column is not set as the index? I only found a method which returns the rows between two days.
You can use groupby() function. Let's say your dataframe is df,
df_group = df.groupby('Date') # assuming the column containing dates is called Date.
Now you can access rows of any date by passing the date in the get_group function,
df_group.get_group('date_here')
My df1 is something like first table in the below image with the key column being Name. I want to add new rows from another dataframe, df2, which has only Name, Year, and Value columns. The new rows should get added based on Name. Other columns would just repeat the same value per Name. Results should be similar to the second table in the below image. How can I do this in pandas ?
Create a sub table df3 of df1 consist of Group, Name, and Other and only keep distinct records. And left join df2 and df3 to get desired result.