How to remove leading '0' from my column? Python - python

I am trying to remove the '0' leading my data
My dataframe looks like this
Id Year Month Day
1 2019 01 15
2 2019 03 30
3 2019 10 20
4 2019 11 18
Note: 'Year','Month','Day' columns data types are object
I get the 'Year','Month','Day' columns by extracting it from a date.
I want to remove the '0' at the beginning of each months.
Desired Ouput:
Id Year Month Day
1 2019 1 15
2 2019 3 30
3 2019 10 20
4 2019 11 18
What I tried to do so far:
df['Month'].str.lstrip('0')
But it did not work.
Any solution? Thank you!

You could use re package and apply regex on it
import re
# Create sample data
d = pd.DataFrame(data={"Month":["01","02","03","10","11"]})
d["Month" = d["Month"].apply(lambda x: re.sub(r"^0+", "", x))
Result:
0 1
1 2
2 3
3 10
4 11
Name: Month, dtype: object
If you are 100% that Month column will always contain numbers, then you could simply do:
d["Month"] = d["Month"].astype(int)

Related

Pandas groupby operations between groups

I have a DataFrame with 4 fields: Locatiom Year, Week and Sales. I would like to know the difference in Sales between two years preserving the granularity of the dataset. I mean, I would like to know for each Location, Year and Week, what is the difference to the same week of another Year.
The following will generate a Dataframe with a similar structure:
raw_data = {'Location': ['A']*30 + ['B']*30 + ['C']*30,
'Year': 3*([2018]*10+[2019]*10+[2020]*10),
'Week': 3*(3*list(range(1,11))),
'Sales': random.randint(100, size=(90))
}
df = pd.DataFrame(raw_data)
Location Year Week Sales
A 2018 1 67
A 2018 2 93
A 2018 … 67
A 2019 1 49
A 2019 2 38
A 2019 … 40
B 2018 1 18
… … … …
Could you please show me what would be the best approach?
Thank you very much
You can do it using groupby and shift:
df["Next_Years_Sales"] = df.groupby(["Location", "Week"])["Sales"].shift(-1)
df["YoY_Sales_Difference"] = df["Next_Years_Sales"] - df["Sales"]
Spot checking it:
df[(df["Location"] == "A") & (df["Week"] == 1)]
Out[37]:
Location Year Week Sales Next_Years_Sales YoY_Sales_Difference
0 A 2018 1 99 10.0 -89.0
10 A 2019 1 10 3.0 -7.0
20 A 2020 1 3 NaN NaN

loop to filter rows based on multiple column conditions pandas python

df
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
FEB 2010 13 1 2
MAR 2010 12 2 1
....
DEC 2019 2 3 4
Code to extract dataframes where month names are Jan, Feb etc for all years. For eg.
[IN]filterJan=df[df['month']=='JAN']
filterJan
[OUT]
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
JAN 2011 13 1 2
....
JAN 2019 2 3 4
I am trying to make a loop for this process.
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
filter[month]=df[df['month']==month]
[OUT]
----> 3 filter[month]=batch1_clean_Sales_database[batch1_clean_Sales_database['month']==month]
TypeError: 'type' object does not support item assignment
If I print the dataframes it is working, but i want to store them and reuse them later
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
print(df[df['month']==month])
I think you can create dictionary of DataFrames:
d = dict(tuple(df.groupby('month')))
Your solution should be changed:
d = {}
for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
d[month] = df[df['month']==month]
Then is possible select each month like d['Jan'], what working like df1.
If want loop by dictionary of DataFrames:
for k, v in d.items():
print (k)
print (v)

Re-format Dataframe column such that any numeric month substring is replaced with month string

Looking to reformat a string column as causing errors in Django. My df:
import pandas as pd
data = {'Date_Str'['2018_11','2018_12','2019_01','2019_02','2019_03','2019_04','2019_05','2019_06','2019_07','2019_08','2019_09','2019_10',],}
df = pd.DataFrame(dict(data))
print(df)
Date_Str
0 2018_11
1 2018_12
2 2019_01
3 2019_02
4 2019_03
5 2019_04
6 2019_05
7 2019_06
8 2019_07
9 2019_08
10 2019_09
11 2019_10
My solution:
df['Date_Month'] = df.Date_Str.str[-2:]
mapper = {'01':'Jan', '02':'Feb', '03':'Mar','04':'Apr','05':'May','06':'Jun','07':'Jul','08':'Aug','09':'Sep','10':'Oct','11':'Nov','12':'Dec'}
df['Date_Month_Str'] = df.Date_Str.str[0:4] + '_' + df.Date_Month.map(mapper)
print(df)
Desired output is column Date_Month_Str or simply update Date_Str with yyyy_mmm
Date_Str Date_Month Date_Month_Str
0 2018_11 11 2018_Nov
1 2018_12 12 2018_Dec
2 2019_01 01 2019_Jan
3 2019_02 02 2019_Feb
4 2019_03 03 2019_Mar
5 2019_04 04 2019_Apr
6 2019_05 05 2019_May
7 2019_06 06 2019_Jun
8 2019_07 07 2019_Jul
9 2019_08 08 2019_Aug
10 2019_09 09 2019_Sep
11 2019_10 10 2019_Oct
Can the three lines be reduced to one? Or simply update Date_Str with a one liner?
Convert column to datetimes and then use Series.dt.strftime:
df['Date_Month_Str'] = pd.to_datetime(df.Date_Str, format='%Y_%m').dt.strftime('%Y_%b')
print(df)
Date_Str Date_Month_Str
0 2018_11 2018_Nov
1 2018_12 2018_Dec
2 2019_01 2019_Jan
3 2019_02 2019_Feb
4 2019_03 2019_Mar
5 2019_04 2019_Apr
6 2019_05 2019_May
7 2019_06 2019_Jun
8 2019_07 2019_Jul
9 2019_08 2019_Aug
10 2019_09 2019_Sep
11 2019_10 2019_Oct

Matching 'Date' dataframes in Pandas to enable joins/merging

I have two csv files with pandas dataframes with a 'Date' column, which is my desired target to join the two tables (my goal is to join my two csvs by dates and merge matching dataframes by summing them).
The issue is that despite sharing the same month-year format, my first csv abbreviated the years, whereas my desired output would be mm-yyyy (for example, Aug-2012 as opposed to Aug-12).
csv1:
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
...
has 41 rows; i.e. 41 months worth of data between Oct. 12 - Feb. 16
csv2:
0 Jan-2009 943690
1 Feb-2009 1062565
2 Mar-2009 210079
3 Apr-2009 -735286
4 May-2009 842933
5 Jun-2009 358691
6 Jul-2009 914953
7 Aug-2009 723427
8 Sep-2009 -837468
...
has 86 rows; i.e. 41 months worth of data between Jan. 2009 - Feb. 2016
I tried initially to do something akin to a 'find and replace' function as one would in Excel.
I tried :
findlist = ['12','13','14','15','16']
replacelist = ['2012','2013','2014','2015','2016']
def findReplace(find, replace):
s = csv1_df.read()
s = s.replace(Date, replacement)
csv1_dfc.write(s)
for item, replacement in zip(findlist, replacelist):
s = s.replace(Date, replacement)
But I am getting a
NameError: name 's' is not defined
You can use to_datetime to transform to datetime format, and then strftime to adjust your format:
df['col_date'] = pd.to_datetime(df['col_date'], format="%b-%y").dt.strftime('%b-%Y')
Input:
col_date val
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
Output:
col_date val
0 Oct-2012 1154293
1 Nov-2012 885773
2 Dec-2012 -448704
3 Jan-2013 563679
4 Feb-2013 555394
5 Mar-2013 631974
6 Apr-2013 957395
7 May-2013 1104047
8 Jun-2013 693464
Note the lower case y for 2 digits year and upper case Y for 4 digits year.

Cumulative sum (pandas)

Apologies if this has been asked already.
I am trying to create a yearly cumulative sum for all order-points within a certain customer account, and am struggling.
Essentially, I want to create `YearlyTotal' below:
Customer Year Date Order PointsPerOrder YearlyTotal
123456 2016 11/2/16 A939 1 20
123456 2016 3/13/16 A102 19 19
789089 2016 7/15/16 A123 7 7
I've tried:
df['YEARLYTOTAL'] = df.groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
But this produces YearlyTotal in the wrong order (i.e., YearlyTotal of A939 is 1 instead of 20.
Not sure if this matters, but Customer is a string (the database has leading zeroes -- don't get me started). sort_values(by=['Customer','Year','Date'],ascending=True) at the front also produces an error.
Help?
Use [::-1] for reversing dataframe:
df['YEARLYTOTAL'] = df[::-1].groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
print (df)
Customer Year Date Order PointsPerOrder YearlyTotal YEARLYTOTAL
0 123456 2016 11/2/16 A939 1 20 20
1 123456 2016 3/13/16 A102 19 19 19
2 789089 2016 7/15/16 A123 7 7 7
first make sure Date is a datetime column:
In [35]: df.Date = pd.to_datetime(df.Date)
now we can do:
In [36]: df['YearlyTotal'] = df.sort_values('Date').groupby(['Customer','Year'])['PointsPerOrder'].cumsum()
In [37]: df
Out[37]:
Customer Year Date Order PointsPerOrder YearlyTotal
0 123456 2016 2016-11-02 A939 1 20
1 123456 2016 2016-03-13 A102 19 19
2 789089 2016 2016-07-15 A123 7 7
PS this solution will NOT depend on the order of records...

Categories