I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
And I want a new column date, that combines year and month into year-month date. I tried:
df['my_month'] = df['year']*100 + df['month'] + 1
But I'm stuck on what to do next. Any help will be greatly appreciated.
If we need start date of the month then
df['date'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
Sample Output
year month after campaign sales date
0 2011 1 0 0 10000 2011-01-01
1 2011 2 0 0 11000 2011-02-01
2 2011 3 0 0 12000 2011-03-01
3 2011 4 1 0 10500 2011-04-01
Edit as per comment
When year and month format is required
pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str)).dt.to_period('M')
Sample output
year month after campaign sales date
0 2011 1 0 0 10000 2011-01
1 2011 2 0 0 11000 2011-02
2 2011 3 0 0 12000 2011-03
3 2011 4 1 0 10500 2011-04
import pandas as pd
from datetime import date
def get_date(year, month):
return date(year, month, 1)
def create_dataframe():
df = pd.DataFrame()
df['year'] = [2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011]
df['month'] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
df['after'] = [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
df['campaign'] = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
df['sales'] = [10000, 11000, 12000, 10500, 10000, 9500, 7000, 8000, 5000, 6000, 6000, 7000]
df['my_month'] = df.apply(lambda x: get_date(x.year, x.month), axis=1)
print(df.to_string())
if __name__ == '__main__':
create_dataframe()
output
year month after campaign sales my_month
0 2011 1 0 0 10000 2011-01-01
1 2011 2 0 0 11000 2011-02-01
2 2011 3 0 0 12000 2011-03-01
3 2011 4 1 0 10500 2011-04-01
4 2011 5 1 0 10000 2011-05-01
5 2011 6 1 0 9500 2011-06-01
6 2011 1 0 1 7000 2011-01-01
7 2011 2 0 1 8000 2011-02-01
8 2011 3 0 1 5000 2011-03-01
9 2011 4 1 1 6000 2011-04-01
10 2011 5 1 1 6000 2011-05-01
11 2011 6 1 1 7000 2011-06-01
Related
I've a dataframe with 3 levels of multi index columns:
quarter Q1 Q2 Totals
year 2021 2022 2021 2022
qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 1 2 0 0 46 5
February 20 8 2 3 4 6 0 0 26 17
March 2 10 7 4 3 3 0 0 12 17
Totals 62 20 14 8 8 11 0 0 84 39
After doing a groupy by levels (0,2), I've the following subtotals dataframe:
quarter Q1 Q2 Totals
qty orders qty orders qty orders
month name
January 45 3 1 2 46 5
February 22 10 4 6 26 16
March 9 14 3 3 12 17
Totals 76 28 8 11 84 39
I need to insert the second into the first, without upsetting the columns, levels or index so that I get the following dataframe:
quarter Q1 Q2 Totals
year 2021 2022 Subtotal 2021 2022 Subtotal
qty orders qty orders qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 45 3 1 2 0 0 1 2 46 5
February 20 8 2 3 22 10 4 6 0 0 4 6 26 16
March 2 10 7 4 9 14 3 3 0 0 3 3 12 17
Totals 62 20 14 8 76 28 8 11 0 0 8 11 84 39
How do I do this?
With your initial dataframe (before groupby):
import pandas as pd
df = pd.DataFrame(
[
[40, 2, 5, 1, 1, 2, 0, 0],
[20, 8, 2, 3, 4, 6, 0, 0],
[2, 10, 7, 4, 3, 3, 0, 0],
[62, 20, 14, 8, 8, 11, 0, 0],
],
columns=pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022"), ("qty", "orders")]
),
index=["January", "February", "March", "Totals"],
)
Here is one way to do it (using product from Python standard library's itertools module, otherwise a nested for-loop is also possible):
# Add new columns
for level1, level2 in product(["Q1", "Q2"], ["qty", "orders"]):
df.loc[:, (level1, "subtotal", level2)] = (
df.loc[:, (level1, "2021", level2)] + df.loc[:, (level1, "2022", level2)]
)
# Sort columns
df = df.reindex(
pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022", "subtotal"), ("qty", "orders")]
),
axis=1,
)
Then:
print(df)
# Output
Q1 Q2 \
2021 2022 subtotal 2021 2022
qty orders qty orders qty orders qty orders qty orders
January 40 2 5 1 45 3 1 2 0 0
February 20 8 2 3 22 11 4 6 0 0
March 2 10 7 4 9 14 3 3 0 0
Totals 62 20 14 8 76 28 8 11 0 0
subtotal
qty orders
January 1 2
February 4 6
March 3 3
Totals 8 11
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN
IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0
I have the following type of muti-index dataframe:
import random
col3=[0,0,0,0,2,4,6,0,0,0,100,200,300,400]
col4=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]
d = {'Unit': [1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6],
'Year': [2014, 2015, 2016, 2017, 2015, 2016, 2017, 2017, 2014, 2015, 2014, 2015, 2016, 2017], 'col3' : col3, 'col4' : col4 }
df = pd.DataFrame(data=d)
new_df = df.groupby(['Unit', 'Year']).sum()
col3 col4
Unit Year
1 2014 0 0
2015 0 0
2016 0 0
2017 0 0
2 2015 2 4
2016 4 6
2017 6 8
3 2017 0 0
4 2014 0 0
5 2015 0 0
6 2014 100 200
2015 200 900
2016 300 400
2017 400 500
In reality it is larger ofcourse, but this does the job. In this dataframe I want to remove all Units, which have only one year entry. So I want to have this:
col3 col4
Unit Year
1 2014 0 0
2015 0 0
2016 0 0
2017 0 0
2 2015 2 4
2016 4 6
2017 6 8
6 2014 100 200
2015 200 900
2016 300 400
2017 400 500
Thank you in advance for help,
Jen
Use GroupBy.transform with any column and test counts with GroupBy.size, compare for not equal by Series.ne and filter by boolean indexing:
df = new_df[new_df.groupby(level=0)['col3'].transform('size').ne(1)]
Or get values of index by Index.get_level_values and filter by Index.duplicated:
df = new_df[new_df.index.get_level_values(0).duplicated(keep=False)]
print (df)
col3 col4
Unit Year
1 2014 0 0
2015 0 0
2016 0 0
2017 0 0
2 2015 2 4
2016 4 6
2017 6 8
6 2014 100 200
2015 200 900
2016 300 400
2017 400 500
I am trying to find a way to generate the value caseid across a very large dataset. I would like the caseid variable to do two things: (1) increase by 1 when y = 1. Importantly, caseid's value should increase in the row after y = 1 is observed, and (2) increase by 1 when case changes in value, i.e., from A to B.
Example data is below:
case = pd.Series(['A', 'A', 'A', 'A',
'B', 'B', 'B', 'B',
'C', 'C', 'C', 'C'])
y = pd.Series([0, 1, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0])
year = [2016, 2017, 2018, 2019,
2016, 2017, 2018, 2019,
2016, 2017, 2018, 2019]
caseid = pd.Series([1, 1, 2, 2,
3, 3, 4, 4,
5, 5, 5, 6])
dict = {'case': case, 'y': y, 'year': year, 'caseid' : caseid}
df = pd.DataFrame(dict)
case y year caseid
0 A 0 2016 1
1 A 1 2017 1
2 A 0 2018 2
3 A 0 2019 2
4 B 0 2016 3
5 B 1 2017 3
6 B 0 2018 4
7 B 0 2019 4
8 C 0 2016 5
9 C 0 2017 5
10 C 1 2018 5
11 C 0 2019 6
I would greatly appreciate your generous help!
Use boolean mask along with DataFrame.cumsum:
df['case_id'] = (~df['case'].eq(df['case'].shift(1).fillna(df.loc[0,'case'])) |
df['y'].shift(1).fillna(0)).cumsum()+1
print(df)
case y year caseid
0 A 0 2016 1
1 A 1 2017 1
2 A 0 2018 2
3 A 0 2019 2
4 B 0 2016 3
5 B 1 2017 3
6 B 0 2018 4
7 B 0 2019 4
8 C 0 2016 5
9 C 0 2017 5
10 C 1 2018 5
11 C 0 2019 6
This works:
df['case_id'] = ((~(df.case == df.case.shift())) | (df.y.shift()==1)).cumsum()
Credits: #Quang Hoang (Only a bracket missing.)
Consider following dataframe:
df = pd.read_csv("data.csv")
print(df)
Category Year Month Count1 Count2
0 a 2017 December 5 9
1 a 2018 January 3 5
2 b 2017 October 7 6
3 b 2017 November 4 1
4 b 2018 March 3 3
I want to achieve this:
Category Year Month Count1 Count2
0 a 2017 October
1 a 2017 November
2 a 2017 December 5 9
3 a 2018 January 3 5
4 a 2018 February
5 a 2018 March
6 b 2017 October 7 6
7 b 2017 November 4 1
8 b 2017 December
9 b 2018 January
10 b 2018 February
11 b 2018 March 3 3
Here I've done so far:
months = {"January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6, "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12}
df["Date"] = pd.to_datetime(10000 * df["Year"] + 100 * df["Month"].apply(months.get) + 1, format="%Y%m%d")
date_min = df["Date"].min()
date_max = df["Date"].max()
new_index = pd.MultiIndex.from_product([df["Category"].unique(), pd.date_range(date_min, date_max, freq="M")], names=["Category", "Date"])
df = df.set_index(["Category", "Date"]).reindex(new_index).reset_index()
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month_name()
df = df[["Category", "Year", "Month", "Count1", "Count2"]]
In the resulting dataframe last month (March) is missing and all "Count1", "Count2" values are NaN
This is complicated by the fact that you want to fill the category as well as the missing dates. One solution is to create a separate data frame for each category and then concatenate them all together.
df['Date'] = pd.to_datetime('1 '+df.Month.astype(str)+' '+df.Year.astype(str))
df_ix = pd.Series(1, index=df.Date.sort_values()).resample('MS').first().reset_index()
df_list = []
for cat in df.Category.unique():
df_temp = (df.query('Category==#cat')
.merge(df_ix, on='Date', how='right')
.get(['Date','Category','Count1','Count2'])
.sort_values('Date')
)
df_temp.Category = cat
df_temp = df_temp.fillna(0)
df_temp.loc[:,['Count1', 'Count2']] = df_temp.get(['Count1', 'Count2']).astype(int)
df_list.append(df_temp)
df2 = pd.concat(df_list, ignore_index=True)
df2['Month'] = df2.Date.apply(lambda x: x.strftime('%B'))
df2['Year'] = df2.Date.apply(lambda x: x.year)
df2.drop('Date', axis=1)
# returns:
Category Count1 Count2 Month Year
0 a 0 0 October 2017
1 a 0 0 November 2017
2 a 5 9 December 2017
3 a 3 5 January 2018
4 a 0 0 February 2018
5 a 0 0 March 2018
6 b 7 6 October 2017
7 b 4 1 November 2017
8 b 0 0 December 2017
9 b 0 0 January 2018
10 b 0 0 February 2018
11 b 3 3 March 2018