Inserting Index along with column values from one dataframe to another - python

If i have two data frames df1 and df2:
df1
yr
24 1984
30 1985
df2
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
I would like to have a dataframe that gives an output as below, could you help with the function that could help me achieve this
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
24 NaN NaN 1984
30 NaN NaN 1985

You are looking to concat two dataframes:
res = pd.concat([df2, df1], sort=False)
print(res)
d m yr
16 12.0 4.0 2012
17 13.0 10.0 1976
18 24.0 4.0 98
24 NaN NaN 1984
30 NaN NaN 1985

Related

Python/Pandas outer merge not including all relevant columns

I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?
Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)

cells full of NaN with the use of stack() in python to convert headers to rows

I have this datasets, two columns are one header (age, sex) and another two columns are two headers (wk1 and wk2):
wk1 wk2
ID age sex Note1 Note2 Note1 Note2
1123 22 M 10 22 233 2
1198 34 M 9 4 44 23
101 28 F 3 6 3 43
when I use pd.read_excel('file',header=[0,1], index_col=0).stack(0).reset_index()
I got this resut:
ID level_0 level_1 Note 1 Note 2 age sex
0 1123 Unnamed: 1_level_0 NaN NaN 22.0 NaN
1 1123 Unnamed: 2_level_0 NaN NaN NaN M
2 1123 wk1 10.0 22.0 NaN NaN
3 1123 wk2 233.0 2.0 NaN NaN
4 1198 Unnamed: 1_level_0 NaN NaN 34.0 NaN
5 1198 Unnamed: 2_level_0 NaN NaN NaN M
6 1198 wk1 9.0 4.0 NaN NaN
...
What I want to get as results is :
ID age sexe Wk Note 1 Note 2
1123 22 M wk1 10 22
1198 34 M wk1 9 4
101 28 F wk1 3 6
1123 22 M wk2 233 2
1198 34 M wk2 44 23
101 28 F wk2 3 43
You'll have to do some re-shaping to your data with replace ffill/bfill
and pd.crosstab
Honestly, it's easier to fix this sort of data at source if you can, I've dealt with similar stuff from a SAP system which would always come out pivoted with lots of merged cells and random rows (written by many years of SAP analysts)
df = pd.read_excel(file,
header=[0,1],
index_col=[0]).stack([0,1])\
.reset_index().rename(columns={'ID' : 'header',
'level_0' : 'ID',
'level_1' : 'weeks',
0 : 'value'})
df['weeks'] = df['weeks'].replace('Unnamed.*',np.nan,regex=True).bfill().ffill()
df1 = pd.crosstab(
[df["ID"], df["weeks"]], df["header"], values=df["value"], aggfunc=lambda x: x
).reset_index()
print(df1.groupby('ID').ffill())
weeks Note1 Note2 age sex
0 wk1 3 6 28 F
1 wk2 3 43 28 F
2 wk1 10 22 22 M
3 wk2 233 2 22 M
4 wk1 9 4 34 M
5 wk2 44 23 34 M

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

How to concat two dataframes in python

I have two data frames, i want to join them so that i could check the quantity of the that week in every year in a single in a single data frame.
df1= City Week qty Year
hyd 35 10 2015
hyd 36 15 2015
hyd 37 11 2015
hyd 42 10 2015
hyd 23 10 2016
hyd 32 15 2016
hyd 37 11 2017
hyd 42 10 2017
pune 35 10 2015
pune 36 15 2015
pune 37 11 2015
pune 42 10 2015
pune 23 10 2016
pune 32 15 2016
pune 37 11 2017
pune 42 10 2017
df2= city Week qty Year
hyd 23 10 2015
hyd 32 15 2015
hyd 35 12 2016
hyd 36 15 2016
hyd 37 11 2016
hyd 42 10 2016
hyd 43 12 2016
hyd 44 18 2016
hyd 35 11 2017
hyd 36 15 2017
hyd 37 11 2017
hyd 42 10 2017
hyd 51 14 2017
hyd 52 17 2017
pune 35 12 2016
pune 36 15 2016
pune 37 11 2016
pune 42 10 2016
pune 43 12 2016
pune 44 18 2016
pune 35 11 2017
pune 36 15 2017
pune 37 11 2017
pune 42 10 2017
pune 51 14 2017
pune 52 17 2017
I want to join two data frames as shown in the result, i want to append the quantity of the that week in every year for each city in a single data frame.
city Week qty Year y2016_wk qty y2017_wk qty y2015_week qty
hyd 35 10 2015 2016_35 12 2017_35 11 nan nan
hyd 36 15 2015 2016_36 15 2017_36 15 nan nan
hyd 37 11 2015 2016_37 11 2017_37 11 nan nan
hyd 42 10 2015 2016_42 10 2017_42 10 nan nan
hyd 23 10 2016 nan nan 2017_23 x 2015_23 10
hyd 32 15 2016 nan nan 2017_32 y 2015_32 15
hyd 37 11 2017 2016_37 11 nan nan 2015_37 x
hyd 42 10 2017 2016_42 10 nan nan 2015_42 y
pune 35 10 2015 2016_35 12 2017_35 11 nan nan
pune 36 15 2015 2016_36 15 2017_36 15 nan nan
pune 37 11 2015 2016_37 11 2017_37 11 nan nan
pune 42 10 2015 2016_42 10 2017_42 10 nan nan
You can break down your task into a few steps:
Combine your dataframes df1 and df2.
Create a list of dataframes from your combined dataframe, splitting by year.
At the same time, rename columns to reflect year, set index to Week.
Finally, concatenate along axis=1 and reset_index.
Here is an example:
df = pd.concat([df1, df2], ignore_index=True)
dfs = [df[df['Year'] == y].rename(columns=lambda x: x+'_'+str(y) if x != 'Week' else x)\
.set_index('Week') for y in df['Year'].unique()]
res = pd.concat(dfs, axis=1).reset_index()
Result:
print(res)
Week qty_2015 Year_2015 qty_2016 Year_2016 qty_2017 Year_2017
0 35 10.0 2015.0 12.0 2016.0 11.0 2017.0
1 36 15.0 2015.0 15.0 2016.0 15.0 2017.0
2 37 11.0 2015.0 11.0 2016.0 11.0 2017.0
3 42 10.0 2015.0 10.0 2016.0 10.0 2017.0
4 43 NaN NaN 12.0 2016.0 NaN NaN
5 44 NaN NaN 18.0 2016.0 NaN NaN
6 51 NaN NaN NaN NaN 14.0 2017.0
7 52 NaN NaN NaN NaN 17.0 2017.0
Personally I don't think your example output is that readable, so unless you need that format for a specific reason I might consider using a pivot table. I also think the code required is cleaner.
import pandas as pd
df3 = pd.concat([df1, df2], ignore_index=True)
df4 = df3.pivot(index='Week', columns='Year', values='qty')
print(df4)
Year 2015 2016 2017
Week
35 10.0 12.0 11.0
36 15.0 15.0 15.0
37 11.0 11.0 11.0
42 10.0 10.0 10.0
43 NaN 12.0 NaN
44 NaN 18.0 NaN
51 NaN NaN 14.0
52 NaN NaN 17.0

Python Pandas: Custom rolling window calculation

I'm Looking to take the most recent value in a rolling window and divide it by the mean of all numbers in said window.
What I tried:
df.a.rolling(window=7).mean()/df.a[-1]
This doesn't work because df.a[-1] is always the most recent of the entire dataset. I need the last value of the window.
I've done a ton of searching today. I may be searching the wrong terms, or not understanding the results, because I have not gotten anything useful.
Any pointers would be appreciated.
Aggregation (use the mean()) on a rolling windows returns a pandas Series object with the same indexing as the original column. You can simply aggregate the rolling window and then divide the original column by the aggregated values.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(30), columns=['A'])
df
# returns:
A
0 0
1 1
2 2
...
27 27
28 28
29 29
You can use a rolling mean to get a series with the same index.
df.A.rolling(window=7).mean()
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 3.0
7 4.0
...
26 23.0
27 24.0
28 25.0
29 26.0
Because it is indexed, you can simple divide by df.A to get your desired results.
df.A.rolling(window=7).mean() / df.A
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0.500000
7 0.571429
8 0.625000
9 0.666667
10 0.700000
11 0.727273
12 0.750000
13 0.769231
14 0.785714
15 0.800000
16 0.812500
17 0.823529
18 0.833333
19 0.842105
20 0.850000
21 0.857143
22 0.863636
23 0.869565
24 0.875000
25 0.880000
26 0.884615
27 0.888889
28 0.892857
29 0.896552

Categories