I have not been able to find an answer to the following online. Your valuable help would be much appreciated.
I have a DataFrame like this with 20k rows:
ID Date Country
2008-0001 2008-01-02 India
2008-0001 2008-01-02 France
2008-0002 2008-01-03 USA
I want to take all the duplicates in ID such as in rows 1 and 2 and then increment the second ID to the highest number after the dash.
So for instance because there is already 2008-0002 (assume that 0002 is the highest number after the dash in that column for that year) then I want to increment to one above that so one of the duplicate id values 2008-0001 would become 2008-0003.
I can identify and drop duplicates using the following code
drop_duplicate_df = train_df.drop_duplicates(['ID'])
but this is not what I need.
I believe this will get it done:
isdup = df.duplicated(subset=['ID1', 'ID2'])
dups, uniques = df[isdup], df[~isdup]
ids = ['ID1', 'ID2']
for i, row in dups.iterrows():
while (row[ids] == uniques[ids]).all(axis=1).any():
row.loc['ID2'] += 1
uniques = uniques.append(row)
id1 = uniques.ID1.astype(str)
id2 = uniques.ID2.astype(str).str.zfill(4)
uniques.loc[:, 'ID'] = id1 + '-' + id1
uniques.drop(['ID1', 'ID2'], axis=1, inplace=True)
print uniques.sort_index()
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA
The below works with the sample data, and assumes you have data for several years that you all want to relabel according to the same logic:
df.Date = pd.to_datetime(df.Date) # to datetime to extract years
years = df.groupby(df.Date.dt.year) # analysis per year
new_df = pd.DataFrame()
for year, data in years:
data.loc[data.duplicated(subset='ID'), 'ID'] = '{0}-{1}'.format(year, str(int(df.ID.max().split('-')[1]) + 1).zfill(4))
new_df = pd.concat([new_df, data])
to get:
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA
Related
Basically this is the challenge I have
Data set with time range and unique ID, what I need to do is to find if ID is duplicated in date range.
123 transaction 1/1/2021
345 transaction 1/1/2021
123 transaction 1/2/2021
123 transaction 1/20/2021
Where I want to return 1 for ID 123 because the duplicate transaction is in range of 7 days.
I can do this with Excel and I added some more date ranges depending on day for exple Wednesday range up to 6 days, Thursday 5 days, Friday 4 days range. But I have no idea how to accomplish this with pandas...
The reason why I want to do this with pandas is because each data set has up to 1M rows and it takes forever with Excel to accomplish and on top of that I need to split by category and it's just a pain to do all that manual work.
Is there any recommendations or ideas in how to accomplish that task?
The df:
df = pd.read_csv(StringIO(
"""id,trans_date
123,1/1/2021
345,1/1/2021
123,1/2/2021
123,1/20/2021
345,1/3/2021
"""
)) # added extra record for demo
df
id trans_date
0 123 1/1/2021
1 345 1/1/2021
2 123 1/2/2021
3 123 1/20/2021
4 345 1/3/2021
df['trans_date'] = pd.to_datetime(df['trans_date'])
As you have to look into each of the ids separately, you can group by id and then get the maximum and minimum dates and if the difference is greater than 7, then those would be 1. Otherwise, 0.
result = df.groupby('id')['trans_date'].apply(
lambda x: True if (x.max()-x.min()).days > 7 else False)
result
id
123 True
345 False
Name: trans_date, dtype: bool
If you just need the required ids, then
result.index[result].values
array([123])
The context and data you've provided about your situation are scanty, but you can probably do something like this:
>>> df
id type date
0 123 transaction 2021-01-01
1 345 transaction 2021-01-01
2 123 transaction 2021-01-02
3 123 transaction 2021-01-20
>>> dupes = df.groupby(pd.Grouper(key='date', freq='W'))['id'].apply(pd.Series.duplicated)
>>> dupes
0 False
1 False
2 True
3 False
Name: id, dtype: bool
There, item 2 (the third item) is True because 123 already occured in the past week.
As far as I can understand the question, I think this is what you need.
from datetime import datetime
import pandas as pd
df = pd.DataFrame({
"id": [123, 345, 123, 123],
"name": ["transaction", "transaction", "transaction", "transaction"],
"date": ["01/01/2021", "01/01/2021", "01/02/2021", "01/10/2021"]
})
def dates_in_range(dates):
num_days_frame = 6
processed_dates = sorted([datetime.strptime(date, "%m/%d/%Y") for date in dates])
difference_in_range = any(abs(processed_dates[i] - processed_dates[i-1]).days < num_days_frame for i in range(1, len(processed_dates)))
return difference_in_range and 1 or 0
group = df.groupby("id")
df_new = group.apply(lambda x: dates_in_range(x["date"]))
print(df_new)
"""
print(df_new)
id
123 1
345 0
"""
Here you first group by the id such that you get all dates for that particular id in the same row.
After which a row-wise function operation is applied to the aggregated dates such that, first they are sorted and afterward checked if the difference between consecutive items is greater than the defined range. The sorting makes sure that consecutive differences will actually result in a true or false outcome if dates are close by.
Finally if any such row exists for which the difference of consecutive sorted dates are less than num_days_frame (6), we return a 1 else we return a 0.
All that being said this might not be as performant as each row is being sorted. One way to avoid that is sort the entire df first and apply the group operation to ensure sorted dates.
I have accident data and part of this data includes the year of the accident, degree of injury and age of the injured person. this is an example of the DataFrame:
df = pd.DataFrame({'Year': ['2010', '2010','2010','2010','2010','2011','2011','2011','2011'],
'Degree_injury': ['no_injury', 'death', 'first_aid', 'minor_injury','disability','disability', 'disability', 'death','first_aid'],
'Age': [50,31,40,20,45,29,60,18,48]})
print(df)
I want three output variables to be grouped in a table by year when the age is less than 40 and get counts for number of disabilities, number of deaths, and number of minor injuries.
The output should be like this:
I generated the three variables (num_disability, num_death, num_minor_injury) when the age is < 40 as shown below.
disability_filt = (df['Degree_injury'] =='disability') &\
(df['Age'] <40)
num_disability = df[disability_filt].groupby('Year')['Degree_injury'].count()
death_filt = (df['Degree_injury'] == 'death')& \
(df['Age'] <40)
num_death = df[death_filt].groupby('Year')['Degree_injury'].count()
minor_injury_filt = (df['Degree_injury'] == 'death') & \
(df['Age'] <40)
num_minor_injury = df[minor_injury_filt].groupby('Year')['Degree_injury'].count()
How to combine these variables in one table to be as illustrated in the above table?
Thank you in advance,
Use pivot_table after filter your rows according your condition:
out = df[df['Age'].lt(40)].pivot_table(index='Year', columns='Degree_injury',
values='Age', aggfunc='count', fill_value=0)
print(out)
# Output:
Degree_injury death disability minor_injury
Year
2010 1 0 1
2011 1 1 0
# prep data
df2 = df.loc[df.Age<40,].groupby("Year").Degree_injury.value_counts().to_frame().reset_index(level=0, inplace=False)
df2 = df2.rename(columns={'Degree_injury': 'Count'})
df2['Degree_injury'] = df2.index
df2
# Year Count Degree_injury
# death 2010 1 death
# minor_injury 2010 1 minor_injury
# death 2011 1 death
# disability 2011 1 disability
# pivot result
df2.pivot(index='Year',columns='Degree_injury')
# death disability minor_injury
# Year
# 2010 1.0 NaN 1.0
# 2011 1.0 1.0 NaN
I do have health diagnosis data for last year and I did like get count of diagnosis for each month. Here is my data:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
print (df2)
How can I get diagnosis counts for each month. Example is how many diagnosis of bacteria sepsis we had for that month. Final result is a table showing value counts of diagnosis for each month
If you want to see results per month, you can use pivot_table.
df2.pivot_table(index=['outcome','diagnosis'], columns=pd.to_datetime(df2['Date']).dt.month, aggfunc='size', fill_value=0)
Date 4 5 6
outcome diagnosis
alive Risk sepsis 0 1 0
bacteria sepsis 2 0 0
dead Neonatal sepsis 0 0 1
Sepsis 0 1 0
4,5,6 are the months in your dataset.
Try playing around with the parameters here, you might be able to land on a better view that suits your ideal result better.
I modified your dataframe by setting the Date column as index:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
df2.index = pd.to_datetime(df2['Date']) # <--- I set your Date column as the index (also convert it to datetime)
df2.drop('Date',inplace=True, axis=1) # <--- Drop the Date column
print (df2)
if you groupby the dataframe by a pd.Grouper and the columns you want to group with (diagnosis and outcome):
df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
Output:
ID
Date diagnosis outcome
2020-04-30 bacteria sepsis alive 1
2020-05-31 Risk sepsis alive 1
Sepsis dead 1
2020-06-30 Neonatal sepsis dead 1
2021-04-30 bacteria sepsis alive 1
Note: the freq='M' in pd.Grouper groups the dataframe by month. Read more about the freq attribute here
Edit: Assigning the grouped dataframe to new_df and resetting the other indices except Date:
new_df = df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
new_df.reset_index(level=[1,2],inplace=True)
Iterate over each month and get the table separately inside df_list:
df_list = [] # <--- this will contain each separate table for each month
for month in np.unique(new_df.index):
df_list += [pd.DataFrame(new_df.loc[[month]])]
df_list[0] # <-- get the first dataframe in df_list
will return:
diagnosis outcome ID
Date
2020-04-30 bacteria sepsis alive 1
First you need to create a month variable through to_datetime() function; then you can group by the month and make a value_counts() within the month
import pandas as pd
df2['month'] = pd.to_datetime(df2['Date']).dt.month
df2.groupby('month').apply(lambda x: x['diagnosis'].value_counts())
month
4 bacteria sepsis 2
5 Risk sepsis 1
Sepsis 1
6 Neonatal sepsis 1
Name: diagnosis, dtype: int64
I think what you mean by for each month is not only mean month figure only, but year-month combination. As such, let's approach as follows:
First, we create a 'year-month' column according to the Date column. Then use .groupby() on this new year-month column and get .value_counts() on column diagnosis, as follows:
df2['year-month'] = pd.to_datetime(df2['Date']).dt.strftime("%Y-%m")
df2.groupby('year-month')['diagnosis'].value_counts().to_frame(name='Count').reset_index()
Result:
year-month diagnosis Count
0 2020-04 bacteria sepsis 1
1 2020-05 Risk sepsis 1
2 2020-05 Sepsis 1
3 2020-06 Neonatal sepsis 1
4 2021-04 bacteria sepsis 1
I am trying to analyse a DataFrame which contains the Date as the index, and Name and Message as columns.
df.head() returns:
Name Message
Date
2020-01-01 Tom image omitted
2020-01-01 Michael image omitted
2020-01-02 James image Happy new year you wonderfully awfully people...
2020-01-02 James I was waiting for you image
2020-01-02 James QB whisperer image
This is the pivot table I was trying to call off the initial df, which the aggfunc being the count of the existence of a word (eg. image)
df_s = df.pivot_table(values='Message',index='Date',columns='Name',aggfunc=(lambda x: x.value_counts()['image']))
Which ideally would show, as an example:
Name Tom Michael James
Date
2020-01-01 1 1 0
2020-01-02 0 0 3
For instance, I've done another df.pivot_table using
df_m = df.pivot_table(values='Message',index='Date',columns='Name',aggfunc=lambda x: len(x.unique()))
Which aggregates based off the number of messages in a day and this returns the table fine.
Thanks in advance
Use Series.str.count for number of matched values to new column added to DataFrame by DataFrame.assign and then pivoting with sum:
df_m = (df.reset_index()
.assign(count= df['Message'].str.count('image'))
.pivot_table(index='Date',
columns='Name',
values='count' ,
aggfunc='sum',
fill_value=0))
print (df_m)
Name James Michael Tom
Date
2020-01-01 0 1 1
2020-01-02 3 0 0
This is for the fun of it, and an alternative to the same answer. It is just a play on the different options Pandas provides :
#or df1.groupby(['Date','Name']) if the index has a name
res = (df1.groupby([df1.index,df1.Name])
.Message.agg(','.join)
.str.count('image')
.unstack(fill_value=0)
)
res
Name James Michael Tom
Date
2020-01-01 0 1 1
2020-01-02 3 0 0
I need to find the total monthly cumulative number of order. I have 2 columns OrderDate and OrderId.I cant use a list to find the cumulative numbers since data is so large. and result should be year_month format along with cumulative order total per each months.
orderDate OrderId
2011-11-18 06:41:16 23
2011-11-18 04:41:16 2
2011-12-18 06:41:16 69
2012-03-12 07:32:15 235
2012-03-12 08:32:15 234
2012-03-12 09:32:15 235
2012-05-12 07:32:15 233
desired Result
Date CumulativeOrder
2011-11 2
2011-12 3
2012-03 6
2012-05 7
I have imported my excel into pycharm and use pandas to read excel
I have tried to split the datetime column to year and month then grouped but not getting the correct result.
df1 = df1[['OrderId','orderDate']]
df1['year'] = pd.DatetimeIndex(df1['orderDate']).year
df1['month'] = pd.DatetimeIndex(df1['orderDate']).month
df1.groupby(['year','month']).sum().groupby('year','month').cumsum()
print (df1)
Convert column to datetimes, then to months period by to_period, add new column by numpy.arange and last remove duplicates with keep last dupe by column Date and DataFrame.drop_duplicates:
import numpy as np
df1['orderDate'] = pd.to_datetime(df1['orderDate'])
df1['Date'] = df1['orderDate'].dt.to_period('m')
#use if not sorted datetimes
#df1 = df1.sort_values('Date')
df1['CumulativeOrder'] = np.arange(1, len(df1) + 1)
print (df1)
orderDate OrderId Date CumulativeOrder
0 2011-11-18 06:41:16 23 2011-11 1
1 2011-11-18 04:41:16 2 2011-11 2
2 2011-12-18 06:41:16 69 2011-12 3
3 2012-03-12 07:32:15 235 2012-03 4
df2 = df1.drop_duplicates('Date', keep='last')[['Date','CumulativeOrder']]
print (df2)
Date CumulativeOrder
1 2011-11 2
2 2011-12 3
3 2012-03 4
Another solution:
df2 = (df1.groupby(df1['orderDate'].dt.to_period('m')).size()
.cumsum()
.rename_axis('Date')
.reset_index(name='CumulativeOrder'))
print (df2)
Date CumulativeOrder
0 2011-11 2
1 2011-12 3
2 2012-03 6
3 2012-05 7