How to normalize the following dates inside a pandas dataframe?

How to normalize the following dates inside a pandas dataframe? - python

I have the following dates dataframe:
dates
0 2012 10 4
1
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
6
7 2013 03 19
8 2016 2 5
9 2011 2 19
10
11 2011 05 23
12 2012 04 5
How can I normalize the dates column into:
dates
0 2012 10 04
1
2 2012 01 19
3 2020 06 11
4 2020 10 07
5 2019 11 12
6
7 2013 03 19
8 2016 02 05
9 2011 02 19
10
11 2011 05 23
12 2012 04 05
I tried with regex and splitting and tweaking each column separately. However I am complicating the task. Is it possible to normalize this into the latter dataframe?. The rule is to add a 0 if the year is incomplete or a 20 at the beggining of the string if the year is incomplete, the format is yyyymmdd.

Solution:
x = (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
.str.split(expand=True)
.rename(columns={0:'year',1:'month',2:'day'})
.astype(int)
)
x.loc[x.year <= 50, 'year'] += 2000
df['new'] = pd.to_datetime(x, errors='coerce').dt.strftime('%Y%m%d')
Result:
In [148]: df
Out[148]:
dates new
0 2012 10 4 20121004
1 NaN
2 2012 01 19 20120119
3 20 6 11 20200611
4 20 10 7 20201007
5 19 11 12 20191112
6 NaN
7 2013 03 19 20130319
8 2016 2 5 20160205
9 2011 2 19 20110219
10 NaN
11 2011 05 23 20110523
12 2012 04 5 20120405
Explanation:
In [149]: df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
Out[149]:
0 2012 10 4
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 03 19
8 2016 2 5
9 2011 2 19
11 2011 05 23
12 2012 04 5
Name: dates, dtype: object
In [152]: (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
...: .str.split(expand=True)
...: .rename(columns={0:'year',1:'month',2:'day'})
...: .astype(int))
Out[152]:
year month day
0 2012 10 4
2 2012 1 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 3 19
8 2016 2 5
9 2011 2 19
11 2011 5 23
12 2012 4 5

Related

Convert a Python data frame with diferents 'year' column into continue time series

It is posible to convert a dataframe on Pandas like that:
Into a time series where each year its behind the last one

This is likely what df.unstack(level=1) is meant for.
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"2009": np.random.randn(12),
"2010": np.random.randn(12),
"2011": np.random.randn(12),
},
index=range(1, 13)
)
print(df)
Out[45]:
2009 2010 2011
1 -1.133838 -1.440585 0.570594
2 0.384319 0.773703 0.915420
3 1.496554 -1.027967 -1.669341
4 -0.355382 -0.090986 0.482714
5 -0.787534 0.492003 -0.310473
6 -0.459439 0.424672 2.394690
7 -0.059169 1.283049 1.550931
8 -0.354174 0.315986 -0.646465
9 -0.735523 -0.408082 -0.928937
10 -1.183940 -0.067948 -1.654976
11 0.238894 -0.952427 0.350193
12 -0.589920 -0.110677 -0.141757
df_out = df.unstack(1).reset_index()
df_out.columns = ["year", "month", "value"]
print(df_out)
Out[46]:
year month value
0 2009 1 -1.133838
1 2009 2 0.384319
2 2009 3 1.496554
3 2009 4 -0.355382
4 2009 5 -0.787534
5 2009 6 -0.459439
6 2009 7 -0.059169
7 2009 8 -0.354174
8 2009 9 -0.735523
9 2009 10 -1.183940
10 2009 11 0.238894
11 2009 12 -0.589920
12 2010 1 -1.440585
13 2010 2 0.773703
14 2010 3 -1.027967
15 2010 4 -0.090986
16 2010 5 0.492003
17 2010 6 0.424672
18 2010 7 1.283049
19 2010 8 0.315986
20 2010 9 -0.408082
21 2010 10 -0.067948
22 2010 11 -0.952427
23 2010 12 -0.110677
24 2011 1 0.570594
25 2011 2 0.915420
26 2011 3 -1.669341
27 2011 4 0.482714
28 2011 5 -0.310473
29 2011 6 2.394690
30 2011 7 1.550931
31 2011 8 -0.646465
32 2011 9 -0.928937
33 2011 10 -1.654976
34 2011 11 0.350193
35 2011 12 -0.141757

How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?

Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()

Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10

Fill column with value from previous year from the same month

How can I use the value from the same month in the previous year to fill values in the following table for 2020:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
The final desired result is the following:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
B 4 2020 40
A 4 2020 40
A 5 2020 20
A 6 2020 15
A 7 2020 17
A 8 2020 18
A 9 2020 12
A 10 2020 11
A 11 2020 19
A 12 2020 15
I tried using pandas groupby but not sure if that is the right approach.

IIUC we use the pivot then ffill with stack
s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index()
Category Year level_2 0
0 A 2019 1 15.0
1 A 2019 2 90.0
2 A 2019 3 50.0
3 A 2019 5 20.0
4 A 2019 6 15.0
5 A 2019 7 17.0
6 A 2019 8 18.0
7 A 2019 9 12.0
8 A 2019 10 11.0
9 A 2019 11 19.0
10 A 2019 12 15.0
11 A 2020 1 18.0
12 A 2020 2 53.0
13 A 2020 3 80.0
14 A 2020 5 20.0
15 A 2020 6 15.0
16 A 2020 7 17.0
17 A 2020 8 18.0
18 A 2020 9 12.0
19 A 2020 10 11.0
20 A 2020 11 19.0
21 A 2020 12 15.0
22 B 2019 2 20.0
23 B 2019 4 40.0

You can accomplish this with a combination of loc, concat, and drop_duplicates.
The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year.
df2 = df.loc[df['Year'] == 2019, :]
df2['Year'] = 2020
pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first')
Output
Category Month Year Value
0 A 1 2019 15
1 B 2 2019 20
2 A 2 2019 90
3 A 3 2019 50
4 B 4 2019 40
5 A 5 2019 20
6 A 6 2019 15
7 A 7 2019 17
8 A 8 2019 18
9 A 9 2019 12
10 A 10 2019 11
11 A 11 2019 19
12 A 12 2019 15
13 A 1 2020 18
14 A 2 2020 53
15 A 3 2020 80
1 B 2 2020 20
4 B 4 2020 40
5 A 5 2020 20
6 A 6 2020 15
7 A 7 2020 17
8 A 8 2020 18
9 A 9 2020 12
10 A 10 2020 11
11 A 11 2020 19
12 A 12 2020 15

Create a new column in pandas dataframe based on multiple conditions

I have a dataframe like the one below, and I have to create a new column year_val that is equal to the values of col2016 through col2019 based on the Years column, so that the value for year_val will be the value of col#### when Years is equal to the suffix of col####
import pandas as pd
sampleDF = pd.DataFrame({'Years':[2016,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
sampleDF['year_val'] = ?????

Use DataFrame.lookup with change values in Years column with prepend col and cast to string:
sampleDF['year_val'] = sampleDF.lookup(sampleDF.index, 'col' + sampleDF['Years'].astype(str))
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2016 1 9 17 25 1
1 2016 2 10 18 26 2
2 2017 3 11 19 27 11
3 2017 4 12 20 28 12
4 2018 5 13 21 29 21
5 2018 6 14 22 30 22
6 2019 7 15 23 31 31
7 2019 8 16 24 32 32
EDIT: If check definition of lookup function:
result = [df.get_value(row, col) for row, col in zip(row_labels, col_labels)]
you can modify it with try-except statement with Series.at for prevent:
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
oup.append(sampleDF.at[row, col] )
sampleDF = pd.DataFrame({'Years':[2015,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
print (sampleDF)
Years col2016 col2017 col2018 col2019
0 2015 1 9 17 25
1 2016 2 10 18 26
2 2017 3 11 19 27
3 2017 4 12 20 28
4 2018 5 13 21 29
5 2018 6 14 22 30
6 2019 7 15 23 31
7 2019 8 16 24 32
out= []
for row, col in zip(sampleDF.index, 'col' + sampleDF['Years'].astype(str)):
try:
out.append(sampleDF.at[row, col] )
except KeyError:
out.append(np.nan)
sampleDF['year_val'] = out
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2015 1 9 17 25 NaN
1 2016 2 10 18 26 2.0
2 2017 3 11 19 27 11.0
3 2017 4 12 20 28 12.0
4 2018 5 13 21 29 21.0
5 2018 6 14 22 30 22.0
6 2019 7 15 23 31 31.0
7 2019 8 16 24 32 32.0

How do I find the count of a particular column [Model], based on another column [SoldDate] using pandas?

I have a dataframe with 3 columns, such as SoldDate,Model and TotalSoldCount. How do I create a new column, 'CountSoldbyMonth' which will give the count of each of the many models sold monthly? A screenshot describing the problem is given.
The ‘CountSoldbyMonth’ should always be less than the ‘TotalSoldCount’.
I am new to Python.
enter image description here
Date Model TotalSoldCount
Jan 19 A 4
Jan 19 A 4
Jan 19 A 4
Jan 19 B 6
Jan 19 C 2
Jan 19 C 2
Feb 19 A 4
Feb 19 B 6
Feb 19 B 6
Feb 19 B 6
Mar 19 B 6
Mar 19 B 6
The new df should look like this.
Date Model TotalSoldCount CountSoldbyMonth
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 B 6 1
Jan 19 C 2 2
Jan 19 C 2 2
Feb 19 A 4 1
Feb 19 B 6 3
Feb 19 B 6 3
Feb 19 B 6 3
Mar 19 B 6 2
Mar 19 B 6 2
I tried doing
df['CountSoldbyMonth'] = df.groupby(['date','model']).totalsoldcount.transform('sum')
but it is generating a different value.

Suppose you have this data set:
date model totalsoldcount
0 Jan 19 A 110
1 Jan 19 A 110
2 Jan 19 A 110
3 Jan 19 B 50
4 Jan 19 C 70
5 Jan 19 C 70
6 Feb 19 A 110
7 Feb 19 B 50
8 Feb 19 B 50
9 Feb 19 B 50
10 Mar 19 B 50
11 Mar 19 B 50
And you want to define a new column, countsoldbymonth. You can groupby the date and model columns and then sum the totalsoldcount with a transform and then create the new column:
s['countsoldbymonth'] = s.groupby([
'date',
'model'
]).totalsoldcount.transform('sum')
print(s)
date model totalsoldcount countsoldbymonth
0 Jan 19 A 110 330
1 Jan 19 A 110 330
2 Jan 19 A 110 330
3 Jan 19 B 50 50
4 Jan 19 C 70 140
5 Jan 19 C 70 140
6 Feb 19 A 110 110
7 Feb 19 B 50 150
8 Feb 19 B 50 150
9 Feb 19 B 50 150
10 Mar 19 B 50 100
11 Mar 19 B 50 100
Or, if you just want to see the sums without creating a new column you can use sum instead of transform like this:
print(s.groupby([
'date',
'model'
]).totalsoldcount.sum())
date model
Feb 19 A 110
B 150
Jan 19 A 330
B 50
C 140
Mar 19 B 100
Edit
If you just want to know how many sales were done in the month you can do the same groupby, but instead of sum use count
df['CountSoldByMonth'] = df.groupby([
'Date',
'Model'
]).TotalSoldCount.transform('count')
print(df)
Date Model TotalSoldCount CountSoldByMonth
0 Jan 19 A 4 3
1 Jan 19 A 4 3
2 Jan 19 A 4 3
3 Jan 19 B 6 1
4 Jan 19 C 2 2
5 Jan 19 C 2 2
6 Feb 19 A 4 1
7 Feb 19 B 6 3
8 Feb 19 B 6 3
9 Feb 19 B 6 3
10 Mar 19 B 6 2
11 Mar 19 B 6 2

it's easier to help if you give code that let's the user experiment. In this case, I'd think taking your dataframe (df) & doing the following should work:
df['CountSoldbyMonth'] = df.groupby(['Date','Model'])['TotalSoldCount'].transform('sum')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to normalize the following dates inside a pandas dataframe? - python

Related

Convert a Python data frame with diferents 'year' column into continue time series

How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?

Fill column with value from previous year from the same month

Create a new column in pandas dataframe based on multiple conditions

How do I find the count of a particular column [Model], based on another column [SoldDate] using pandas?

Categories

Resources