How to remove certain string from column in pandas dataframe - python

I want to remove a certain keywords or string in a column from pandas dataframe.
The dataframe df looks like this:
YEAR WEEK
2019 WK-01
2019 WK-02
2019 WK-03
2019 WK-14
2019 WK-25
2020 WK-06
2020 WK-07
I would like to remove WK-and 0 from the WEEK column so that my output will looks like this:
YEAR WEEK
2019 1
2019 2
2019 3
2019 14
2019 25
2020 6
2020 7

You can try:
df['WEEK'] = df['WEEK'].str.extract('(\d*)$').astype(int)
Output:
YEAR WEEK
0 2019 1
1 2019 2
2 2019 3
3 2019 14
4 2019 25
5 2020 6
6 2020 7

Shave off the first three characters and convert to int.
df['WEEK'] = df['WEEK'].str[3:].astype(int)

Related

use pandas to tet previous year sales in the same row

I have a table from different companies' sales.
company_name sales year
A 200 2019
A 100 2018
A 30 2017
B 15 2019
B 30 2018
B 45 2017
Now, I want to add a previous year's sales in the same row just like
company_name sales year previous_sales
A 200 2019 100
A 100 2018 30
A 30 2017 Nan
B 15 2019 30
B 30 2018 45
B 45 2017 Nan
I tried to use the code like this, but I failed to get the right result
df["previous_sales"] = df.groupby(['company_name', 'year'])['sales'].shift()

How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?

Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()
Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10

How to roll the data on "n" months partitions in pandas

I have a dataset that looks like below:
month year value
1 2019 20
2 2019 13
3 2019 10
4 2019 20
5 2019 13
6 2019 10
7 2019 20
8 2019 13
9 2019 10
10 2019 20
11 2019 13
12 2019 10
1 2020 20
2 2020 13
3 2020 10
4 2020 40
Please assume that each month and year occurs multiple times and also there are much more columns. What I wanted to create is multiple dataframes in a 6 months window. I dont want to have aggregations.
The partitioned dataset should include data in the below criteria. Please help me with pandas. I know the naive way is to manually use conditions to select the dataframe. But I guess there will be more effective way in doing this operations at one go.
month 1-6 year 2019
month 2-7 year 2019
month 3-8 year 2019
month 4-9 year 2019
month 5-10 year 2019
month 6-11 year 2019
month 7-12 year 2019
month 8-1 year 2019,2020
month 9-2 year 2019,2020
month 10-3 year 2019,2020
month 11-3 year 2019,2020
What I have tried so far:
for i, j in zip(range(1,12), range(6,13)):
print(i,j) # this is for 2019
I can take this i and j and plug it in months and repeat the same for 2020 as well. But there will be a better way where it would be easy to create a list of dataframes.
With a datetime index and pd.Grouper, you can proceed as follows
df = pd.DataFrame(np.random.randn(12,3),
index = pd.date_range(pd.Timestamp.now(), periods = 12),
)
df_grouped = df.groupby(pd.Grouper(freq = "6M"))
[df_grouped.get_group(x) for x in df_grouped.groups]

Information matrix from pandas dataframe

I have a pandas dataframe like the following:
Customer Id year
0 1510220024 2017
1 1510270013 2017
2 1511160047 2017
3 1512100014 2017
4 1603180006 2017
5 1605030030 2017
6 1605160013 2017
7 1606060008 2017
8 1510220024 2018
9 1606270014 2017
10 1608080011 2017
11 1608090002 2017
12 1511160047 2018
13 1606270014 2018
And I want to build the following matrix from the above dataframe:
2017 2018
2017 11 3
2018 3 3
This matrix tells that there were total 11 customers in year 2017 and three of them also appeared in 2018 and so on. In actual, I have 7 years of data so it would be 7x7 matrix. I am struggling for a while now but can't get this right.
merge + crosstab:
m = df.merge(df, left_on='Customer Id', right_on='Customer Id')
pd.crosstab(m.year_x, m.year_y)
year_y 2017 2018
year_x
2017 11 3
2018 3 3

Pandas Rolling mean with GroupBy and Sort

I have a DataFrame that looks like:
f_period f_year f_month subject month year value
20140102 2014 1 a 1 2018 10
20140109 2014 1 a 1 2018 12
20140116 2014 1 a 1 2018 8
20140202 2014 2 a 1 2018 20
20140209 2014 2 a 1 2018 15
20140102 2014 1 b 1 2018 10
20140109 2014 1 b 1 2018 12
20140116 2014 1 b 1 2018 8
20140202 2014 2 b 1 2018 20
20140209 2014 2 b 1 2018 15
The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018.
I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like:
f_period f_year f_month subject month year value mnthly_avg rolling_2_avg
20140102 2014 1 a 1 2018 10 10 13
20140109 2014 1 a 1 2018 12 10 13
20140116 2014 1 a 1 2018 8 10 13
20140202 2014 2 a 1 2018 20 17.5 null
20140209 2014 2 a 1 2018 15 17.5 null
20140102 2014 1 b 1 2018 10 10 13
20140109 2014 1 b 1 2018 12 10 13
20140116 2014 1 b 1 2018 8 10 13
20140202 2014 2 b 1 2018 20 17.5 null
20140209 2014 2 b 1 2018 15 17.5 null
Things I tried:
I was able to get mnthly_avg by :
data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\
value.transform('mean')
I tried getting the rolling_2_avg :
rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\
groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index()
rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\
value.rolling(2).mean().reset_index(drop=True)
This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg
How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?
Unless I'm misunderstanding it seems simpler than what you've done. What about this?
grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum())
grp['rolling'] = grp.rolling(window=2).mean()
grp
Output:
value rolling
subject month f_month
a 1 1 30 NaN
2 35 32.5
b 1 1 30 32.5
2 35 32.5
I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer).
An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after):
for unique_subject in df['subject'].unique():
df_subject = df[df['subject'] == unique_subject]
df_subject['rolling'] = df_subject['value'].rolling(window=2).mean()
print(df_subject) # just to print, you may wanna concatenate these

Categories