Merge specific column in multiple dataframe with different length - python

df1
Color date
0 A 2011
1 B 201411
2 C 20151231
3 A 2019
df2
Color date
0 A 2013
1 B 20151111
2 C 201101
df3
Color date
0 A 2011
1 B 201411
2 C 20151231
3 A 2019
4 Y 20070212
Assuming there are three dataframes:
I want to create a new dataframe by extracting only the 'date' column.
output what I want
New df
df1-date df2-date df3-date
0 2011 2013 2011
1 201411 20151111 201411
2 20151231 201101 20151231
3 2019 NaN 2019
4 NaN NaN 20070212
I want to set the empty part to NaN because the length is different.
I try merge,concat but getting error..
Thank you for reading.

One more approach
df1.join(df2['date'],rsuffix='df2',how='outer').join(df3['date'],rsuffix='df3',how='outer')
Output
Color date datedf2 datedf3
0 A 2011.0 2013.0 2011
1 B 201411.0 20151111.0 201411
2 C 20151231.0 201101.0 20151231
3 A 2019.0 NaN 2019
4 NaN NaN NaN 20070212

This include two problem, 1 multiple dataframes merge, 2 duplicated key merge
def multikey(x):
return x.assign(key=x.groupby('Color').cumcount())
#we use groupby and cumcount create the addtional key
from functools import reduce
#then use reduce
df = reduce(lambda left,right:
pd.merge(left,right,on=['Color','key'],how='outer'),
list(map(multikey, [df1,df2,df3])))
df
Color date_x key date_y date
0 A 2011.0 0 2013.0 2011
1 B 201411.0 0 20151111.0 201411
2 C 20151231.0 0 201101.0 20151231
3 A 2019.0 1 NaN 2019
4 Y NaN 0 NaN 20070212
Notice name here we can always modify by rename
Method 2 from cancat not consider the key one merge with index
s=pd.concat([df1,df2,df3],keys=['df1','df2','df3'], axis=1)
s.columns=s.columns.map('_'.join)
s=s.filter(like='_date')
s
df1_date df2_date df3_date
0 2011.0 2013.0 2011
1 201411.0 20151111.0 201411
2 20151231.0 201101.0 20151231
3 2019.0 NaN 2019
4 NaN NaN 20070212

Related

replacing the missing value with different values on the same column in pandas dataframe

A B C D
1 2010 one 0 0
2 2020 one 2 4
3 2007 two 0 8
4 2010 one 8 4
5 2020 four 6 12
6 2007 three 0 14
7 2006 four 7 14
8 2010 two 10 12
I need to replace 0 with the average of the C values of years.For example 2010 C value would be 9. What is the best way to do this? i have over 10,000 rows.
You can use replace to change 0's to np.nan in Column C, and use fillna to map the yearly averages:
df.C.replace({0:np.nan},inplace=True)
df.C.fillna(
df.A.map(
df.groupby(df['A']).\
C.mean().fillna(0)\
.to_dict()
),inplace=True
)
print(df)
A B C D
0 2010 one 9.0 0
1 2020 one 2.0 4
2 2007 two 0.0 8
3 2010 one 8.0 4
4 2020 four 6.0 12
5 2007 three 0.0 14
6 2006 four 7.0 14
7 2010 two 10.0 12
2007 is still NaN because we have no values other than 0's in the initial data.
Here is what I think I will do it. The code below will be pseudo-code.
1: You find the avg for each year, and put it to a dict.
my_year_dict = {'2020':xxx,'2021':xxx}
2: Use apply & lambda functions
df[New C Col] = df[C].apply(lambda x: my_year_dict[x] if x is 0)
Hope it can be a start!

Replace NaN of rows using data from another rows [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Pandas Error: '[nan nan] not found in axis' while dropping a column without labels

I'm trying to drop the first two columns in a dataframe that has NaN for column headers. The dataframe looks like this:
**15 NaN NaN NaN Energy Supply Energy Supply Renewable Energy**
17 NaN Afghanistan Afghanistan 1 2 3
18 NaN Albania Albania 1 2 3
19 NaN Algeria Algeria 1 2 3
I need to drop the first two columns labeled NaN. I tried df=df.drop(df.columns[[1,2]],axis=1), which returns an error
What am I missing?
KeyError: '[nan nan] not found in axis'
Strange you have NaN as columns. Please try filter columns that do not start with NaN using regex.
df.filter(regex='^(?!NaN).+', axis=1)
Using your data
print(df)
15 NaN NaN.1 NaN.2 EnergySupply EnergySupply.1 \
0 17 NaN Afghanistan Afghanistan 1 2
1 18 NaN Albania Albania 1 2
2 19 NaN Algeria Algeria 1 2
RenewableEnergy
0 3
1 3
2 3
Solution
print(df.filter(regex='^(?!NaN).+', axis=1))
15 EnergySupply EnergySupply.1 RenewableEnergy
0 17 1 2 3
1 18 1 2 3
2 19 1 2 3
When the NaN columns exist, I had to do a case-insenstive version of the regex from wwnde's answer in order for them to successfully filter out the column:
df = df.filter(regex='(?i)^(?!NaN).+', axis=1)
Other suggestions, such as df=df[df.columns.dropna()] and df=df.drop(np.nan, axis=1) do not work, but the above did.
I'm guessing this is related to the painful reality of np.nan == np.nan not evaluating to true, but ultimately it seems like a bug with pandas.

undefined row to column, group by year and month

I am trying to change the structure data inside the data frame
year month count reason
2001 1 1 a
2001 2 3 b
2001 3 4 c
2005 1 4 a
2005 1 3 c
at new data frame should look like:
year month count reason_a reason_b reason_c
2001 1 1 1 0 0
2001 2 3 0 3 0
2001 3 4 0 0 4
2005 1 7 4 0 3
Is anyone can show some Python code to do it? Thank you in advance,
Using
DataFrame.join() - Join columns of another DataFrame.
pandas.get_dummies() - Convert categorical variable into dummy/indicator variables.
DataFrame.mul() - Get Multiplication of dataframe and other, element-wise (binary operator mul).
DataFrame.groupby() - Group DataFrame or Series using a mapper or by a Series of columns.
DataFrameGroupBy.agg() - Aggregate using callable, string, dict, or list of string/callables.
Ex.
dummies = df.join(pd.get_dummies(df["reason"],prefix='reason').mul(df['count'], axis=0))
f = {'count': 'sum', 'reason_a': 'first', 'reason_b': 'first', 'reason_c': 'last'}
df1 = dummies.groupby(['year','month'],sort=False,as_index=False).agg(f)
print(df1)
O/P:
year month count reason_a reason_b reason_c
0 2001 1 1 1 0 0
1 2001 2 3 0 3 0
2 2001 3 4 0 0 4
3 2005 1 7 4 0 3
Using pivot_table:
df2 = pd.pivot_table(df,index=["year","month",],values=["count"],columns="reason").reset_index().fillna(0)
df2.columns = [i[0] if i[0]!="count" else f"reason_{i[1]}" for i in df2.columns]
df2["count"] = df2.iloc[:,2:5].sum(axis=1)
print (df2)
#
year month reason_a reason_b reason_c count
0 2001 1 1.0 0.0 0.0 1.0
1 2001 2 0.0 3.0 0.0 3.0
2 2001 3 0.0 0.0 4.0 4.0
3 2005 1 4.0 0.0 3.0 7.0

How to calculate Quarterly difference and add missing Quarterly with count in python pandas

I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2

Categories