Create a new column in pandas dataframe based on multiple conditions

Create a new column in pandas dataframe based on multiple conditions - python

I have a dataframe like the one below, and I have to create a new column year_val that is equal to the values of col2016 through col2019 based on the Years column, so that the value for year_val will be the value of col#### when Years is equal to the suffix of col####
import pandas as pd
sampleDF = pd.DataFrame({'Years':[2016,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
sampleDF['year_val'] = ?????

Use DataFrame.lookup with change values in Years column with prepend col and cast to string:
sampleDF['year_val'] = sampleDF.lookup(sampleDF.index, 'col' + sampleDF['Years'].astype(str))
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2016 1 9 17 25 1
1 2016 2 10 18 26 2
2 2017 3 11 19 27 11
3 2017 4 12 20 28 12
4 2018 5 13 21 29 21
5 2018 6 14 22 30 22
6 2019 7 15 23 31 31
7 2019 8 16 24 32 32
EDIT: If check definition of lookup function:
result = [df.get_value(row, col) for row, col in zip(row_labels, col_labels)]
you can modify it with try-except statement with Series.at for prevent:
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
oup.append(sampleDF.at[row, col] )
sampleDF = pd.DataFrame({'Years':[2015,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
print (sampleDF)
Years col2016 col2017 col2018 col2019
0 2015 1 9 17 25
1 2016 2 10 18 26
2 2017 3 11 19 27
3 2017 4 12 20 28
4 2018 5 13 21 29
5 2018 6 14 22 30
6 2019 7 15 23 31
7 2019 8 16 24 32
out= []
for row, col in zip(sampleDF.index, 'col' + sampleDF['Years'].astype(str)):
try:
out.append(sampleDF.at[row, col] )
except KeyError:
out.append(np.nan)
sampleDF['year_val'] = out
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2015 1 9 17 25 NaN
1 2016 2 10 18 26 2.0
2 2017 3 11 19 27 11.0
3 2017 4 12 20 28 12.0
4 2018 5 13 21 29 21.0
5 2018 6 14 22 30 22.0
6 2019 7 15 23 31 31.0
7 2019 8 16 24 32 32.0

Related

combining specific row conditionally and add output to existing row in pandas

suppose I have following data frame :
data = {'age' :[10,11,12,11,11,10,11,13,13,13,14,14,15,15,15],
'num1':[10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],
'num2':[20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]}
df = pd.DataFrame(data)
I want to sum rows for age 14 and 15 and keep those new values as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
in the code below, I have tried to group.by age but it does not work for me:
df1 =df.groupby(age[age >=14])['num1', 'num2'].apply(', '.join).reset_index(drop=True).to_frame()

limit_age = 14
new = df.query("age < #limit_age").copy()
new.loc[len(new)] = [limit_age,
*df.query("age >= #limit_age").drop(columns="age").sum()]
first get the "before 14" dataframe
then assign it to a new row where
age is 14
other values are the row-wise sums of "after 14" dataframe
to get
>>> new
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160
(new.index += 1 can be used for a 1-based index at the end.)

I would use a mask and concat:
m = df['age'].isin([14, 15])
out = pd.concat([df[~m],
df[m].agg({'age': 'min', 'num1': 'sum', 'num2': 'sum'})
.to_frame().T
], ignore_index=True)
Output:
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160

How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?

Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()

Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10

Fill column with value from previous year from the same month

How can I use the value from the same month in the previous year to fill values in the following table for 2020:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
The final desired result is the following:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
B 4 2020 40
A 4 2020 40
A 5 2020 20
A 6 2020 15
A 7 2020 17
A 8 2020 18
A 9 2020 12
A 10 2020 11
A 11 2020 19
A 12 2020 15
I tried using pandas groupby but not sure if that is the right approach.

IIUC we use the pivot then ffill with stack
s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index()
Category Year level_2 0
0 A 2019 1 15.0
1 A 2019 2 90.0
2 A 2019 3 50.0
3 A 2019 5 20.0
4 A 2019 6 15.0
5 A 2019 7 17.0
6 A 2019 8 18.0
7 A 2019 9 12.0
8 A 2019 10 11.0
9 A 2019 11 19.0
10 A 2019 12 15.0
11 A 2020 1 18.0
12 A 2020 2 53.0
13 A 2020 3 80.0
14 A 2020 5 20.0
15 A 2020 6 15.0
16 A 2020 7 17.0
17 A 2020 8 18.0
18 A 2020 9 12.0
19 A 2020 10 11.0
20 A 2020 11 19.0
21 A 2020 12 15.0
22 B 2019 2 20.0
23 B 2019 4 40.0

You can accomplish this with a combination of loc, concat, and drop_duplicates.
The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year.
df2 = df.loc[df['Year'] == 2019, :]
df2['Year'] = 2020
pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first')
Output
Category Month Year Value
0 A 1 2019 15
1 B 2 2019 20
2 A 2 2019 90
3 A 3 2019 50
4 B 4 2019 40
5 A 5 2019 20
6 A 6 2019 15
7 A 7 2019 17
8 A 8 2019 18
9 A 9 2019 12
10 A 10 2019 11
11 A 11 2019 19
12 A 12 2019 15
13 A 1 2020 18
14 A 2 2020 53
15 A 3 2020 80
1 B 2 2020 20
4 B 4 2020 40
5 A 5 2020 20
6 A 6 2020 15
7 A 7 2020 17
8 A 8 2020 18
9 A 9 2020 12
10 A 10 2020 11
11 A 11 2020 19
12 A 12 2020 15

How do I find the count of a particular column [Model], based on another column [SoldDate] using pandas?

I have a dataframe with 3 columns, such as SoldDate,Model and TotalSoldCount. How do I create a new column, 'CountSoldbyMonth' which will give the count of each of the many models sold monthly? A screenshot describing the problem is given.
The ‘CountSoldbyMonth’ should always be less than the ‘TotalSoldCount’.
I am new to Python.
enter image description here
Date Model TotalSoldCount
Jan 19 A 4
Jan 19 A 4
Jan 19 A 4
Jan 19 B 6
Jan 19 C 2
Jan 19 C 2
Feb 19 A 4
Feb 19 B 6
Feb 19 B 6
Feb 19 B 6
Mar 19 B 6
Mar 19 B 6
The new df should look like this.
Date Model TotalSoldCount CountSoldbyMonth
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 B 6 1
Jan 19 C 2 2
Jan 19 C 2 2
Feb 19 A 4 1
Feb 19 B 6 3
Feb 19 B 6 3
Feb 19 B 6 3
Mar 19 B 6 2
Mar 19 B 6 2
I tried doing
df['CountSoldbyMonth'] = df.groupby(['date','model']).totalsoldcount.transform('sum')
but it is generating a different value.

Suppose you have this data set:
date model totalsoldcount
0 Jan 19 A 110
1 Jan 19 A 110
2 Jan 19 A 110
3 Jan 19 B 50
4 Jan 19 C 70
5 Jan 19 C 70
6 Feb 19 A 110
7 Feb 19 B 50
8 Feb 19 B 50
9 Feb 19 B 50
10 Mar 19 B 50
11 Mar 19 B 50
And you want to define a new column, countsoldbymonth. You can groupby the date and model columns and then sum the totalsoldcount with a transform and then create the new column:
s['countsoldbymonth'] = s.groupby([
'date',
'model'
]).totalsoldcount.transform('sum')
print(s)
date model totalsoldcount countsoldbymonth
0 Jan 19 A 110 330
1 Jan 19 A 110 330
2 Jan 19 A 110 330
3 Jan 19 B 50 50
4 Jan 19 C 70 140
5 Jan 19 C 70 140
6 Feb 19 A 110 110
7 Feb 19 B 50 150
8 Feb 19 B 50 150
9 Feb 19 B 50 150
10 Mar 19 B 50 100
11 Mar 19 B 50 100
Or, if you just want to see the sums without creating a new column you can use sum instead of transform like this:
print(s.groupby([
'date',
'model'
]).totalsoldcount.sum())
date model
Feb 19 A 110
B 150
Jan 19 A 330
B 50
C 140
Mar 19 B 100
Edit
If you just want to know how many sales were done in the month you can do the same groupby, but instead of sum use count
df['CountSoldByMonth'] = df.groupby([
'Date',
'Model'
]).TotalSoldCount.transform('count')
print(df)
Date Model TotalSoldCount CountSoldByMonth
0 Jan 19 A 4 3
1 Jan 19 A 4 3
2 Jan 19 A 4 3
3 Jan 19 B 6 1
4 Jan 19 C 2 2
5 Jan 19 C 2 2
6 Feb 19 A 4 1
7 Feb 19 B 6 3
8 Feb 19 B 6 3
9 Feb 19 B 6 3
10 Mar 19 B 6 2
11 Mar 19 B 6 2

it's easier to help if you give code that let's the user experiment. In this case, I'd think taking your dataframe (df) & doing the following should work:
df['CountSoldbyMonth'] = df.groupby(['Date','Model'])['TotalSoldCount'].transform('sum')

How to normalize the following dates inside a pandas dataframe?

I have the following dates dataframe:
dates
0 2012 10 4
1
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
6
7 2013 03 19
8 2016 2 5
9 2011 2 19
10
11 2011 05 23
12 2012 04 5
How can I normalize the dates column into:
dates
0 2012 10 04
1
2 2012 01 19
3 2020 06 11
4 2020 10 07
5 2019 11 12
6
7 2013 03 19
8 2016 02 05
9 2011 02 19
10
11 2011 05 23
12 2012 04 05
I tried with regex and splitting and tweaking each column separately. However I am complicating the task. Is it possible to normalize this into the latter dataframe?. The rule is to add a 0 if the year is incomplete or a 20 at the beggining of the string if the year is incomplete, the format is yyyymmdd.

Solution:
x = (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
.str.split(expand=True)
.rename(columns={0:'year',1:'month',2:'day'})
.astype(int)
)
x.loc[x.year <= 50, 'year'] += 2000
df['new'] = pd.to_datetime(x, errors='coerce').dt.strftime('%Y%m%d')
Result:
In [148]: df
Out[148]:
dates new
0 2012 10 4 20121004
1 NaN
2 2012 01 19 20120119
3 20 6 11 20200611
4 20 10 7 20201007
5 19 11 12 20191112
6 NaN
7 2013 03 19 20130319
8 2016 2 5 20160205
9 2011 2 19 20110219
10 NaN
11 2011 05 23 20110523
12 2012 04 5 20120405
Explanation:
In [149]: df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
Out[149]:
0 2012 10 4
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 03 19
8 2016 2 5
9 2011 2 19
11 2011 05 23
12 2012 04 5
Name: dates, dtype: object
In [152]: (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
...: .str.split(expand=True)
...: .rename(columns={0:'year',1:'month',2:'day'})
...: .astype(int))
Out[152]:
year month day
0 2012 10 4
2 2012 1 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 3 19
8 2016 2 5
9 2011 2 19
11 2011 5 23
12 2012 4 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a new column in pandas dataframe based on multiple conditions - python

Related

combining specific row conditionally and add output to existing row in pandas

How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?

Fill column with value from previous year from the same month

How do I find the count of a particular column [Model], based on another column [SoldDate] using pandas?

How to normalize the following dates inside a pandas dataframe?

Categories

Resources