I have a dataframe like the one below, and I have to create a new column year_val that is equal to the values of col2016 through col2019 based on the Years column, so that the value for year_val will be the value of col#### when Years is equal to the suffix of col####
import pandas as pd
sampleDF = pd.DataFrame({'Years':[2016,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
sampleDF['year_val'] = ?????
Use DataFrame.lookup with change values in Years column with prepend col and cast to string:
sampleDF['year_val'] = sampleDF.lookup(sampleDF.index, 'col' + sampleDF['Years'].astype(str))
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2016 1 9 17 25 1
1 2016 2 10 18 26 2
2 2017 3 11 19 27 11
3 2017 4 12 20 28 12
4 2018 5 13 21 29 21
5 2018 6 14 22 30 22
6 2019 7 15 23 31 31
7 2019 8 16 24 32 32
EDIT: If check definition of lookup function:
result = [df.get_value(row, col) for row, col in zip(row_labels, col_labels)]
you can modify it with try-except statement with Series.at for prevent:
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
oup.append(sampleDF.at[row, col] )
sampleDF = pd.DataFrame({'Years':[2015,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
print (sampleDF)
Years col2016 col2017 col2018 col2019
0 2015 1 9 17 25
1 2016 2 10 18 26
2 2017 3 11 19 27
3 2017 4 12 20 28
4 2018 5 13 21 29
5 2018 6 14 22 30
6 2019 7 15 23 31
7 2019 8 16 24 32
out= []
for row, col in zip(sampleDF.index, 'col' + sampleDF['Years'].astype(str)):
try:
out.append(sampleDF.at[row, col] )
except KeyError:
out.append(np.nan)
sampleDF['year_val'] = out
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2015 1 9 17 25 NaN
1 2016 2 10 18 26 2.0
2 2017 3 11 19 27 11.0
3 2017 4 12 20 28 12.0
4 2018 5 13 21 29 21.0
5 2018 6 14 22 30 22.0
6 2019 7 15 23 31 31.0
7 2019 8 16 24 32 32.0
Related
suppose I have following data frame :
data = {'age' :[10,11,12,11,11,10,11,13,13,13,14,14,15,15,15],
'num1':[10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],
'num2':[20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]}
df = pd.DataFrame(data)
I want to sum rows for age 14 and 15 and keep those new values as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
in the code below, I have tried to group.by age but it does not work for me:
df1 =df.groupby(age[age >=14])['num1', 'num2'].apply(', '.join).reset_index(drop=True).to_frame()
limit_age = 14
new = df.query("age < #limit_age").copy()
new.loc[len(new)] = [limit_age,
*df.query("age >= #limit_age").drop(columns="age").sum()]
first get the "before 14" dataframe
then assign it to a new row where
age is 14
other values are the row-wise sums of "after 14" dataframe
to get
>>> new
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160
(new.index += 1 can be used for a 1-based index at the end.)
I would use a mask and concat:
m = df['age'].isin([14, 15])
out = pd.concat([df[~m],
df[m].agg({'age': 'min', 'num1': 'sum', 'num2': 'sum'})
.to_frame().T
], ignore_index=True)
Output:
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160
Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()
Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10
How can I use the value from the same month in the previous year to fill values in the following table for 2020:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
The final desired result is the following:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
B 4 2020 40
A 4 2020 40
A 5 2020 20
A 6 2020 15
A 7 2020 17
A 8 2020 18
A 9 2020 12
A 10 2020 11
A 11 2020 19
A 12 2020 15
I tried using pandas groupby but not sure if that is the right approach.
IIUC we use the pivot then ffill with stack
s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index()
Category Year level_2 0
0 A 2019 1 15.0
1 A 2019 2 90.0
2 A 2019 3 50.0
3 A 2019 5 20.0
4 A 2019 6 15.0
5 A 2019 7 17.0
6 A 2019 8 18.0
7 A 2019 9 12.0
8 A 2019 10 11.0
9 A 2019 11 19.0
10 A 2019 12 15.0
11 A 2020 1 18.0
12 A 2020 2 53.0
13 A 2020 3 80.0
14 A 2020 5 20.0
15 A 2020 6 15.0
16 A 2020 7 17.0
17 A 2020 8 18.0
18 A 2020 9 12.0
19 A 2020 10 11.0
20 A 2020 11 19.0
21 A 2020 12 15.0
22 B 2019 2 20.0
23 B 2019 4 40.0
You can accomplish this with a combination of loc, concat, and drop_duplicates.
The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year.
df2 = df.loc[df['Year'] == 2019, :]
df2['Year'] = 2020
pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first')
Output
Category Month Year Value
0 A 1 2019 15
1 B 2 2019 20
2 A 2 2019 90
3 A 3 2019 50
4 B 4 2019 40
5 A 5 2019 20
6 A 6 2019 15
7 A 7 2019 17
8 A 8 2019 18
9 A 9 2019 12
10 A 10 2019 11
11 A 11 2019 19
12 A 12 2019 15
13 A 1 2020 18
14 A 2 2020 53
15 A 3 2020 80
1 B 2 2020 20
4 B 4 2020 40
5 A 5 2020 20
6 A 6 2020 15
7 A 7 2020 17
8 A 8 2020 18
9 A 9 2020 12
10 A 10 2020 11
11 A 11 2020 19
12 A 12 2020 15
I have a dataframe with 3 columns, such as SoldDate,Model and TotalSoldCount. How do I create a new column, 'CountSoldbyMonth' which will give the count of each of the many models sold monthly? A screenshot describing the problem is given.
The ‘CountSoldbyMonth’ should always be less than the ‘TotalSoldCount’.
I am new to Python.
enter image description here
Date Model TotalSoldCount
Jan 19 A 4
Jan 19 A 4
Jan 19 A 4
Jan 19 B 6
Jan 19 C 2
Jan 19 C 2
Feb 19 A 4
Feb 19 B 6
Feb 19 B 6
Feb 19 B 6
Mar 19 B 6
Mar 19 B 6
The new df should look like this.
Date Model TotalSoldCount CountSoldbyMonth
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 B 6 1
Jan 19 C 2 2
Jan 19 C 2 2
Feb 19 A 4 1
Feb 19 B 6 3
Feb 19 B 6 3
Feb 19 B 6 3
Mar 19 B 6 2
Mar 19 B 6 2
I tried doing
df['CountSoldbyMonth'] = df.groupby(['date','model']).totalsoldcount.transform('sum')
but it is generating a different value.
Suppose you have this data set:
date model totalsoldcount
0 Jan 19 A 110
1 Jan 19 A 110
2 Jan 19 A 110
3 Jan 19 B 50
4 Jan 19 C 70
5 Jan 19 C 70
6 Feb 19 A 110
7 Feb 19 B 50
8 Feb 19 B 50
9 Feb 19 B 50
10 Mar 19 B 50
11 Mar 19 B 50
And you want to define a new column, countsoldbymonth. You can groupby the date and model columns and then sum the totalsoldcount with a transform and then create the new column:
s['countsoldbymonth'] = s.groupby([
'date',
'model'
]).totalsoldcount.transform('sum')
print(s)
date model totalsoldcount countsoldbymonth
0 Jan 19 A 110 330
1 Jan 19 A 110 330
2 Jan 19 A 110 330
3 Jan 19 B 50 50
4 Jan 19 C 70 140
5 Jan 19 C 70 140
6 Feb 19 A 110 110
7 Feb 19 B 50 150
8 Feb 19 B 50 150
9 Feb 19 B 50 150
10 Mar 19 B 50 100
11 Mar 19 B 50 100
Or, if you just want to see the sums without creating a new column you can use sum instead of transform like this:
print(s.groupby([
'date',
'model'
]).totalsoldcount.sum())
date model
Feb 19 A 110
B 150
Jan 19 A 330
B 50
C 140
Mar 19 B 100
Edit
If you just want to know how many sales were done in the month you can do the same groupby, but instead of sum use count
df['CountSoldByMonth'] = df.groupby([
'Date',
'Model'
]).TotalSoldCount.transform('count')
print(df)
Date Model TotalSoldCount CountSoldByMonth
0 Jan 19 A 4 3
1 Jan 19 A 4 3
2 Jan 19 A 4 3
3 Jan 19 B 6 1
4 Jan 19 C 2 2
5 Jan 19 C 2 2
6 Feb 19 A 4 1
7 Feb 19 B 6 3
8 Feb 19 B 6 3
9 Feb 19 B 6 3
10 Mar 19 B 6 2
11 Mar 19 B 6 2
it's easier to help if you give code that let's the user experiment. In this case, I'd think taking your dataframe (df) & doing the following should work:
df['CountSoldbyMonth'] = df.groupby(['Date','Model'])['TotalSoldCount'].transform('sum')
I have the following dates dataframe:
dates
0 2012 10 4
1
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
6
7 2013 03 19
8 2016 2 5
9 2011 2 19
10
11 2011 05 23
12 2012 04 5
How can I normalize the dates column into:
dates
0 2012 10 04
1
2 2012 01 19
3 2020 06 11
4 2020 10 07
5 2019 11 12
6
7 2013 03 19
8 2016 02 05
9 2011 02 19
10
11 2011 05 23
12 2012 04 05
I tried with regex and splitting and tweaking each column separately. However I am complicating the task. Is it possible to normalize this into the latter dataframe?. The rule is to add a 0 if the year is incomplete or a 20 at the beggining of the string if the year is incomplete, the format is yyyymmdd.
Solution:
x = (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
.str.split(expand=True)
.rename(columns={0:'year',1:'month',2:'day'})
.astype(int)
)
x.loc[x.year <= 50, 'year'] += 2000
df['new'] = pd.to_datetime(x, errors='coerce').dt.strftime('%Y%m%d')
Result:
In [148]: df
Out[148]:
dates new
0 2012 10 4 20121004
1 NaN
2 2012 01 19 20120119
3 20 6 11 20200611
4 20 10 7 20201007
5 19 11 12 20191112
6 NaN
7 2013 03 19 20130319
8 2016 2 5 20160205
9 2011 2 19 20110219
10 NaN
11 2011 05 23 20110523
12 2012 04 5 20120405
Explanation:
In [149]: df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
Out[149]:
0 2012 10 4
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 03 19
8 2016 2 5
9 2011 2 19
11 2011 05 23
12 2012 04 5
Name: dates, dtype: object
In [152]: (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
...: .str.split(expand=True)
...: .rename(columns={0:'year',1:'month',2:'day'})
...: .astype(int))
Out[152]:
year month day
0 2012 10 4
2 2012 1 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 3 19
8 2016 2 5
9 2011 2 19
11 2011 5 23
12 2012 4 5