How to concat two dataframes in python - python

I have two data frames, i want to join them so that i could check the quantity of the that week in every year in a single in a single data frame.
df1= City Week qty Year
hyd 35 10 2015
hyd 36 15 2015
hyd 37 11 2015
hyd 42 10 2015
hyd 23 10 2016
hyd 32 15 2016
hyd 37 11 2017
hyd 42 10 2017
pune 35 10 2015
pune 36 15 2015
pune 37 11 2015
pune 42 10 2015
pune 23 10 2016
pune 32 15 2016
pune 37 11 2017
pune 42 10 2017
df2= city Week qty Year
hyd 23 10 2015
hyd 32 15 2015
hyd 35 12 2016
hyd 36 15 2016
hyd 37 11 2016
hyd 42 10 2016
hyd 43 12 2016
hyd 44 18 2016
hyd 35 11 2017
hyd 36 15 2017
hyd 37 11 2017
hyd 42 10 2017
hyd 51 14 2017
hyd 52 17 2017
pune 35 12 2016
pune 36 15 2016
pune 37 11 2016
pune 42 10 2016
pune 43 12 2016
pune 44 18 2016
pune 35 11 2017
pune 36 15 2017
pune 37 11 2017
pune 42 10 2017
pune 51 14 2017
pune 52 17 2017
I want to join two data frames as shown in the result, i want to append the quantity of the that week in every year for each city in a single data frame.
city Week qty Year y2016_wk qty y2017_wk qty y2015_week qty
hyd 35 10 2015 2016_35 12 2017_35 11 nan nan
hyd 36 15 2015 2016_36 15 2017_36 15 nan nan
hyd 37 11 2015 2016_37 11 2017_37 11 nan nan
hyd 42 10 2015 2016_42 10 2017_42 10 nan nan
hyd 23 10 2016 nan nan 2017_23 x 2015_23 10
hyd 32 15 2016 nan nan 2017_32 y 2015_32 15
hyd 37 11 2017 2016_37 11 nan nan 2015_37 x
hyd 42 10 2017 2016_42 10 nan nan 2015_42 y
pune 35 10 2015 2016_35 12 2017_35 11 nan nan
pune 36 15 2015 2016_36 15 2017_36 15 nan nan
pune 37 11 2015 2016_37 11 2017_37 11 nan nan
pune 42 10 2015 2016_42 10 2017_42 10 nan nan

You can break down your task into a few steps:
Combine your dataframes df1 and df2.
Create a list of dataframes from your combined dataframe, splitting by year.
At the same time, rename columns to reflect year, set index to Week.
Finally, concatenate along axis=1 and reset_index.
Here is an example:
df = pd.concat([df1, df2], ignore_index=True)
dfs = [df[df['Year'] == y].rename(columns=lambda x: x+'_'+str(y) if x != 'Week' else x)\
.set_index('Week') for y in df['Year'].unique()]
res = pd.concat(dfs, axis=1).reset_index()
Result:
print(res)
Week qty_2015 Year_2015 qty_2016 Year_2016 qty_2017 Year_2017
0 35 10.0 2015.0 12.0 2016.0 11.0 2017.0
1 36 15.0 2015.0 15.0 2016.0 15.0 2017.0
2 37 11.0 2015.0 11.0 2016.0 11.0 2017.0
3 42 10.0 2015.0 10.0 2016.0 10.0 2017.0
4 43 NaN NaN 12.0 2016.0 NaN NaN
5 44 NaN NaN 18.0 2016.0 NaN NaN
6 51 NaN NaN NaN NaN 14.0 2017.0
7 52 NaN NaN NaN NaN 17.0 2017.0

Personally I don't think your example output is that readable, so unless you need that format for a specific reason I might consider using a pivot table. I also think the code required is cleaner.
import pandas as pd
df3 = pd.concat([df1, df2], ignore_index=True)
df4 = df3.pivot(index='Week', columns='Year', values='qty')
print(df4)
Year 2015 2016 2017
Week
35 10.0 12.0 11.0
36 15.0 15.0 15.0
37 11.0 11.0 11.0
42 10.0 10.0 10.0
43 NaN 12.0 NaN
44 NaN 18.0 NaN
51 NaN NaN 14.0
52 NaN NaN 17.0

Related

Pandas drop nan in a specific row ('Feb-29') and shift remaining rows up

I have a pandas dataframe containing several years of timeseries data as columns. Each starts in November and ends in the subsequent year. I'm trying to deal with NaN's in non-leap years. The structure can be recreated with something like this:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
ndays = 151
sdates = [(datetime(2019,11,1) + timedelta(days=x)).strftime("%b-%d") for x in range(ndays)]
columns=list(range(2016,2021))
df = pd.DataFrame(np.random.randint(0,100,size=(ndays, len(columns))), index=sdates, columns=columns)
df.loc["Feb-29",2017:2019] = np.nan
df.loc["Feb-28":"Mar-01"]
Out[61]:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 NaN NaN NaN 6
Mar-01 24 75.0 49.0 99.0 82
What I want to do is remove the "Feb-29" NaN data only (in the non-leap years) and then shift teh data in those columns up a row, leaving the leap-years as-is. Something like this, with Mar-01 and subsequent rows shifted up for 2017 through 2019:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 75.0 49.0 99.0 6
Mar-01 24 42.0 21.0 41.0 82
I don't care that "Mar-01" data will be labelled as "Feb-29" as eventually I'll be replacing the string date index with an integer index.
Note that I didn't include this in the example but I have NaN's at the start and end of the dataframe in varying rows that I do not want to remove (i.e., I can't just remove all NaN data, I need to target "Feb-29" specifically)
It sounds like you don't actually want to shift dates up, but rather number them correctly based on the day of the year? If so, this will work:
First, make the DataFrame long instead of wide:
df = pd.DataFrame(
{
"2016": {"Feb-28": 36, "Feb-29": 85, "Mar-01": 24},
"2017": {"Feb-28": 59.0, "Feb-29": None, "Mar-01": 75.0},
"2018": {"Feb-28": 74.0, "Feb-29": None, "Mar-01": 49.0},
"2019": {"Feb-28": 19.0, "Feb-29": None, "Mar-01": 99.0},
"2020": {"Feb-28": 24, "Feb-29": 6, "Mar-01": 82},
}
)
df = (
df.melt(ignore_index=False, var_name="year", value_name="value")
.reset_index()
.rename(columns={"index": "month-day"})
)
df
month-day year value
0 Feb-28 2016 36.0
1 Feb-29 2016 85.0
2 Mar-01 2016 24.0
3 Feb-28 2017 59.0
4 Feb-29 2017 NaN
5 Mar-01 2017 75.0
6 Feb-28 2018 74.0
7 Feb-29 2018 NaN
8 Mar-01 2018 49.0
9 Feb-28 2019 19.0
10 Feb-29 2019 NaN
11 Mar-01 2019 99.0
12 Feb-28 2020 24.0
13 Feb-29 2020 6.0
14 Mar-01 2020 82.0
Then remove rows containing an invalid date and get the day of the year for remaining days:
df["date"] = pd.to_datetime(
df.apply(lambda row: " ".join(row[["year", "month-day"]]), axis=1), errors="coerce",
)
df = df[df["date"].notna()]
df["day_of_year"] = df["date"].dt.dayofyear
df
month-day year value date day_of_year
0 Feb-28 2016 36.0 2016-02-28 59
1 Feb-29 2016 85.0 2016-02-29 60
2 Mar-01 2016 24.0 2016-03-01 61
3 Feb-28 2017 59.0 2017-02-28 59
5 Mar-01 2017 75.0 2017-03-01 60
6 Feb-28 2018 74.0 2018-02-28 59
8 Mar-01 2018 49.0 2018-03-01 60
9 Feb-28 2019 19.0 2019-02-28 59
11 Mar-01 2019 99.0 2019-03-01 60
12 Feb-28 2020 24.0 2020-02-28 59
13 Feb-29 2020 6.0 2020-02-29 60
14 Mar-01 2020 82.0 2020-03-01 61
I think this would do the trick?
is_leap = ~np.isnan(df.loc["Feb-29"])
leap = df.loc[:, is_leap]
nonleap = df.loc[:, ~is_leap]
nonleap = nonleap.drop(index="Feb-29")
df2 = pd.concat([
leap.reset_index(drop=True),
nonleap.reset_index(drop=True)
], axis=1).sort_index(axis=1)
This should move all the nulls to the end by sorting on pd.isnull, which I believe it what you want:
df = df.apply(lambda x: sorted(x, key=pd.isnull),axis=0)
Before:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 NaN NaN NaN 52
Mar-01 86 92.0 29.0 44.0 36
After:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 92.0 29.0 44.0 52
Mar-01 86 32.0 50.0 27.0 36

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

Inserting Index along with column values from one dataframe to another

If i have two data frames df1 and df2:
df1
yr
24 1984
30 1985
df2
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
I would like to have a dataframe that gives an output as below, could you help with the function that could help me achieve this
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
24 NaN NaN 1984
30 NaN NaN 1985
You are looking to concat two dataframes:
res = pd.concat([df2, df1], sort=False)
print(res)
d m yr
16 12.0 4.0 2012
17 13.0 10.0 1976
18 24.0 4.0 98
24 NaN NaN 1984
30 NaN NaN 1985

Rolling Mean with Groupby object in Pandas returns null

I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
I create a Groupby Object because I would like to obtain a roolling mean group by columns A,B,C,D
DfGby = Df.groupby(['A','B', 'C','D'])
After I execute rolling mean
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
But I obtain
E
A B C D
23 2015 1 14937 0 NaN
19054 1 NaN
2 14937 2 NaN
19054 3 NaN
3 14937 4 NaN
19054 5 NaN
4 14937 6 NaN
19054 7 NaN
5 14937 8 NaN
19054 9 NaN
6 14937 10 NaN
19054 11 NaN
What is the problem here?
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0

Categories