how to create month and year columns using regex and pandas - python
Hello Stack overflow Community
I've got the Data Frame here
code sum of August
AA 1000
BB 4000
CC 72262
So there are two columns ['code','sum of August']
I've to convert this dataFrame into ['month', 'year', 'code', 'sum of August'] columns
month year code sum of August
8 2020 AA 1000
8 2020 BB 4000
8 2020 CC 72262
So the ['sum of August'] column sometimes named as just ['August'] or ['august']. Also sometimes, it can be ['sum of November'] or ['November'] or ['november'].
I thought of using regex to extract the month name and covert to month number.
Can anyone please help me with this?
Thanks in advance!
You can do the following:
month = {1:'janauary',
2:'february',
3:'march',
4:'april',
5:'may',
6:'june',
7:'july',
8:'august',
9:'september',
10:'october',
11:'november',
12:'december'}
Let's say your data frame is called df. Then you can create the column month automatically using the following:
df['month']=[i for i,j in month.items() if j in str.lower(" ".join(df.columns))][0]
code sum of August month
0 AA 1000 8
1 BB 4000 8
2 CC 72262 8
That means that if a month's name exists in the column names in any way, return the number of this month.
It looks like you're trying to convert month names to their numbers, and the columns can be uppercse or lowercase.
This might work:
months = ['january','febuary','march','april','may','june','july','august','september','october','november','december']
monthNum = []#If you're using a list, just to make this run
sumOfMonths = ['sum of august','sum of NovemBer']#Just to show functionality
for sumOfMonth in sumOfMonths:
for idx, month in enumerate(months):
if month in sumOfMonth.lower():#If the column month name has any of the month keywords
monthNum.append(str(idx + 1)) #i'm just assuming that it's a list, just add the index + 1 to your variable.
I hope this helps! Of course, this wouldn't be exactly what you do, you fill in the variables and change append() if you're not using it.
Related
How to groupby a dataframe by month while keeping other string columns?
A sample of my dataframe is as follows: |Date_Closed|Owner|Case_Closed_Count| |2022-07-19|JH|1| |2022-07-18|JH|2| |2022-07-17|JH|5| |2022-07-19|DT|3| |2022-07-15|DT|1| |2022-07-01|DT|1| |2022-06-30|JW|30| |2022-06-28|JH|2| My goal is to get a sum of case count per owner per month, which looks like: |Month|Owner|Case_Closed_Count| |2022-07|JH|8| |2022-07|DT|5| |2022-06|JW|30| |2022-06|JH|2| Here is the code I got so far: df = pd.to_datetime(df['Date_Closed']) month = df.Date_Closed.dt.to_period("M") G = df.groupby(month).agg({'Case_Closed_Count':'sum'}) With the code above, I manage to get the case closed count groupby month, but how do I keep the owner column?
here is one way to do it df['Date_Closed'] = pd.to_datetime(df['Date_Closed']) df.groupby([df['Date_Closed'].dt.strftime('%Y-%m'), 'Owner']).sum().reset_index() Date_Closed Owner Case_Closed_Count 0 2022-06 JH 2 1 2022-06 JW 30 2 2022-07 DT 5 3 2022-07 JH 8
How to change the data of a particular column and multiply them depending upon the specific values in another column using pandas?
I want to select the year '2019/0'(string) from the column 'year of entry' and only multiply their 'grades' times 2 which is in another column year of entry Grades 2019/0 14 2010/0 21 2019/0 15 this is what I have tried so far: df.loc[df("Year of Entry"),'2018/9'] = df("Grades")*2 its been giving me an error and im not sure if this is the right method.
You can use: df.loc[df['year of entry'].eq('2019/0'), 'Grades'] *= 2 NB. the modification is in place. modified df: year of entry Grades 0 2019/0 28 1 2010/0 21 2 2019/0 30
Dataframe of the Top X Values of the Top Y Days - Pandas Groupby
I have data about three variables where I want to find the largest X values of one variable on a per day basis. Previously I wrote some code to find the hour where the max value of the day occurred, but now I want to add some options to find more max hours per day. I've been able to find the Top X values per day for all the days, but I've gotten stuck on narrowing it down to the Top X Values from the Top X Days. I've included pictures detailing what the end result would hopefully look like. Data Identified Top 2 Hours Code df = pd.DataFrame( {'ID':['ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1'], 'Year':[2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018], 'Month':[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6], 'Day':[12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17], 'Hour':[19,20,21,22,11,12,13,19,19,20,21,22,18,19,20,21,19,20,21,23,19,20,21,22], 'var_1': [0.83,0.97,0.69,0.73,0.66,0.68,0.78,0.82,1.05,1.05,1.08,0.88,0.96,0.81,0.71,0.88,1.08,1.02,0.88,0.79,0.91,0.91,0.80,0.96], 'var_2': [47.90,42.85,67.37,57.18,66.13,59.96,52.63,54.75,32.54,36.58,36.99,37.23,46.94,52.80,68.79,50.84,37.79,43.54,48.04,38.01,42.22,47.13,50.96,44.19], 'var_3': [99.02,98.10,98.99,99.12,98.78,98.90,99.09,99.20,99.22,99.11,99.18,99.24,99.00,98.90,98.87,99.07,99.06,98.86,98.92,99.32,98.93,98.97,98.99,99.21],}) # Get the top 2 var2 values each day top_two_var2_each_day = df.groupby(['ID', 'Year', 'Month', 'Day'])['var_2'].nlargest(2) top_two_var2_each_day = top_two_var2_each_day.reset_index() # set level_4 index to the current index top_two_var2_each_day = top_two_var2_each_day.set_index('level_4') # use the index from the top_two_var2 to get the rows from df to get values of the other variables when top 2 values occured top_2_all_vars = df[df.index.isin(top_two_var2_each_day.index)] End Goal Result I figure the best way would be to average the two hours to identify what days have the largest average, then go back into top_2_all_vars dataframe and grab the rows where the Days occur. I am unsure how to proceed. mean_day = top_2_all_vars.groupby(['ID', 'Year', 'Month', 'Day'],as_index=False)['var_2'].mean() top_2_day = mean_day.nlargest(2, 'var_2') Final Dataframe This is the result I am trying to find. A dataframe consisting of the Top 2 values of var_2 from each of the Top 2 days. Code I previously used to find the single largest value of each day, but I don't know how I would make it work for more than a single max per day # For each ID and Day, Find the Hour where the Max Amount of var_2 occurred and save the index location df_idx = df.groupby(['ID', 'Year', 'Month', 'Day',])['var_2'].transform(max) == df['var_2'] # Now the hour has been found, store the rows in a new dataframe based on the saved index location top_var2_hour_of_each_day = df[df_idx] Using Groupbys may be not the best way to go about it, but I am open to anything.
This is one approach: If your data spans multiple months its a lot harder dealing with it when the month and day are in different columns. So First I made a new column called 'Date' which just combines the month and the day. df['Date'] = df['Month'].astype('str')+"-"+df['Day'].astype('str') Next we need the top two values of var_2 per day, and then average them. So we can create a really simple function to find exactly that. def topTwoMean(series): top = series.sort_values(ascending = False).iloc[0] second = series.sort_values(ascending = False).iloc[1] return (top+second)/2 We then use our function, sort by the average of var_2 to get the highest 2 days, then save the dates to a list. maxDates = df.groupby('Date').agg({'var_2': [topTwoMean]})\ .sort_values(by = ('var_2', 'topTwoMean'), ascending = False)\ .reset_index()['Date']\ .head(2)\ .to_list() Finally we filter by the dates chosen above, then find the highest two of var_2 on those days. df[df['Date'].isin(maxDates)]\ .groupby('Date')\ .apply(lambda x: x.sort_values('var_2', ascending = False).head(2))\ .reset_index(drop = True) ID Year Month Day Hour var_1 var_2 var_3 Date 0 ID_1 2018 6 12 21 0.69 67.37 98.99 6-12 1 ID_1 2018 6 12 22 0.73 57.18 99.12 6-12 2 ID_1 2018 6 13 11 0.66 66.13 98.78 6-13 3 ID_1 2018 6 13 12 0.68 59.96 98.90 6-13
How to count the number of dropoffs per month for dataframe column
I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column. I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be. Example of the data: Record ID | store ID | drop_off_date a1274c212| 12876| 2011-01-27 a1534c543| 12877| 2011-02-23 a1232c952| 12877| 2018-12-02 The result should look like this: Month: | #of dropoffs: Jan 2011 | 15 ........ Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month: df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3]) Then you apply a groupby on the new created column an then a count(): df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data, I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers. you should then drop any NA's from this so you're only dealing with customers who dropped off. you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date']) print(df) Record ID store ID drop_off_date 0 a1274c212 12876 2011-01-27 1 a1534c543 12877 2011-02-23 2 a1232c952 12877 2018-12-02 Lets create a month column to use as an aggregate df['Month'] = df['drop_off_date'].dt.strftime('%b') then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)? df1 = df.groupby(df['Month'])['Record ID'].count().reset_index() print(df1) Month Record ID 0 Dec 1 1 Feb 1 2 Jan 1 EDIT: To account for years. first lets create a year helper column df['Year'] = df['drop_off_date'].dt.year df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index() print(df) Month Year Record ID 0 Dec 2018 1 1 Feb 2011 1 2 Jan 2011 1
Pandas extend index date using group by
I have a series of transactions similar to this table: ID Customer Date Amount 1 A 6/12/2018 33,223.00 2 A 9/20/2018 635.00 3 B 8/3/2018 8,643.00 4 B 8/30/2018 1,231.00 5 C 5/29/2018 7,522.00 However I need to get the mean amount of the last six months (as of today) I was using df.groupby('Customer').resample('W')['Amount'].sum() And get something like this: CustomerCode PayDate A 2018-05-21 268 2018-05-28 0.00 2018-06-11 0.00 2018-06-18 472,657 2018-06-25 0.00 However with this solution I only get the range of dates where the customers had amount. I need to extend the weeks for each customer so I can get the whole range of the six months (in weeks). In this example I would need to get for customer A from the week of '2018-04-05' (which is exactly six months ago from today) till the week of today (filled with 0 of course since there was no amount)
Heres is the solution I found to my question. First I creates the dates I wanted (last six months but in frequency of weeks) dates = pd.date_range(datetime.date.today() - datetime.timedelta(6*365/12), pd.datetime.today(), freq='W') Then I create a multi-index using the product of the customer with the dates. multi_index = pd.MultiIndex.from_product([pd.Index(df['Customer'].unique()), dates], names=('Customer', 'Date')) Then I reindex the df using the new created multi-index and lastly, I fill with zeroes the missing values. df.reindex(multi_index) df.fillna(0)
Resample is super flexible. To get a 6-month sum instead of the weekly sum you currently have all you need is: df.groupby('Customer').resample('6M')['Amount'].sum() That groups by month end; month start would be '6MS'. More documentation on available frequencies can be found here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases