Merge should adopt Nan values if not existing value [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN

IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0

Related

How to add sub-total columns to a multilevel columns dataframe?

I've a dataframe with 3 levels of multi index columns:
quarter Q1 Q2 Totals
year 2021 2022 2021 2022
qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 1 2 0 0 46 5
February 20 8 2 3 4 6 0 0 26 17
March 2 10 7 4 3 3 0 0 12 17
Totals 62 20 14 8 8 11 0 0 84 39
After doing a groupy by levels (0,2), I've the following subtotals dataframe:
quarter Q1 Q2 Totals
qty orders qty orders qty orders
month name
January 45 3 1 2 46 5
February 22 10 4 6 26 16
March 9 14 3 3 12 17
Totals 76 28 8 11 84 39
I need to insert the second into the first, without upsetting the columns, levels or index so that I get the following dataframe:
quarter Q1 Q2 Totals
year 2021 2022 Subtotal 2021 2022 Subtotal
qty orders qty orders qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 45 3 1 2 0 0 1 2 46 5
February 20 8 2 3 22 10 4 6 0 0 4 6 26 16
March 2 10 7 4 9 14 3 3 0 0 3 3 12 17
Totals 62 20 14 8 76 28 8 11 0 0 8 11 84 39
How do I do this?
With your initial dataframe (before groupby):
import pandas as pd
df = pd.DataFrame(
[
[40, 2, 5, 1, 1, 2, 0, 0],
[20, 8, 2, 3, 4, 6, 0, 0],
[2, 10, 7, 4, 3, 3, 0, 0],
[62, 20, 14, 8, 8, 11, 0, 0],
],
columns=pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022"), ("qty", "orders")]
),
index=["January", "February", "March", "Totals"],
)
Here is one way to do it (using product from Python standard library's itertools module, otherwise a nested for-loop is also possible):
# Add new columns
for level1, level2 in product(["Q1", "Q2"], ["qty", "orders"]):
df.loc[:, (level1, "subtotal", level2)] = (
df.loc[:, (level1, "2021", level2)] + df.loc[:, (level1, "2022", level2)]
)
# Sort columns
df = df.reindex(
pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022", "subtotal"), ("qty", "orders")]
),
axis=1,
)
Then:
print(df)
# Output
Q1 Q2 \
2021 2022 subtotal 2021 2022
qty orders qty orders qty orders qty orders qty orders
January 40 2 5 1 45 3 1 2 0 0
February 20 8 2 3 22 11 4 6 0 0
March 2 10 7 4 9 14 3 3 0 0
Totals 62 20 14 8 76 28 8 11 0 0
subtotal
qty orders
January 1 2
February 4 6
March 3 3
Totals 8 11

Count Number of Rows within Time Interval in Pandas Dataframe

Say we have this data:
list1, list2, list3 = [1,2,3,4], [1990, 1990, 1990, 1991], [2009, 2009, 2009, 2009]
df = pd.DataFrame(list(zip(list1, list2, list3)), columns = ['Index', 'Y0', 'Y1'])
> df
Index Y0 Y1
1 1990 2009
2 1990 2009
3 1990 2009
4 1991 2009
I want to count, for each year, how many rows ("index") fall within each year, but excluding the Y0.
So say we start at the first available year, 1990:
How many rows do we count? 0.
1991:
Three (row 1, 2, 3)
1992:
Four (row 1, 2, 3, 4)
...
2009:
Four (row 1, 2, 3, 4)
So I want to end up with a dataframe that says:
Count Year
0 1990
3 1991
4. 1992
... ...
4 2009
My attempt:
df['Y0'] = pd.to_datetime(df['Y0'], format='%Y')
df['Y1'] = pd.to_datetime(df['Y1'], format='%Y')
# Group by the interval between Y0 and Y1
df = d.groupby([d['Y0'].dt.year, d['Y1'].dt.year]).agg({'count'})
df.columns = ['count', 'Y0 count', 'Y1 count']
# sum the total
df_sum = pd.DataFrame(df.groupby(df.index)['count'].sum())
But the result doesn't look right.
Appreciate any help.
you could do:
min_year = df[['Y0', 'Y1']].values.min()
max_year = df[['Y0', 'Y1']].values.max()
year_range = np.arange(min_year, max_year+1)
counts = ((df[['Y0']].values < year_range) & (year_range<= df[['Y1']].values)).sum(axis=0)
o = pd.DataFrame({"counts": counts, 'year': year_range})
counts year
0 0 1990
1 3 1991
2 4 1992
3 4 1993
4 4 1994
5 4 1995
6 4 1996
7 4 1997
8 4 1998
9 4 1999
10 4 2000
11 4 2001
12 4 2002
13 4 2003
14 4 2004
15 4 2005
16 4 2006
17 4 2007
18 4 2008
19 4 2009
The following should do your job:
counts=[]
years=[]
def count_in_interval(year):
n=0
for i in range(len(df)):
if df['Y0'][i]<year<=df['Y1'][i]:
n+=1
return n
for i in range(1990, 2010):
counts.append(count_in_interval(i))
years.append(i)
result=pd.DataFrame(zip(counts, years), columns=['Count', 'Year'])

How to use apply or transform with a more complex function

I am relatively new to python and pandas and I am trying to perform a group-wise operation using apply but struggle to get it working.
My data frame looks like this:
Year Country Val1 Val2 Fact
2005 A 1 3 1
2006 A 2 4 2
2007 A 3 5 2
2008 A 4 3 1
2009 A 4 3 1
2010 A 4 3 1
2005 B 5 7 2
2006 B 6 6 2
2007 B 7 5 1
2008 B 8 6 2
2009 B 8 6 2
2010 B 8 6 2
For each country in each year, I need to calculate
(country mean for period 2005-2008 - value in 2005)/4 * Fact * (Year - 2005) + value in 2005
So far I read up on the use of apply and transform and looked at questions related to the use of both functions (e.g.1 and 2) and I thought that my problem can be solved by using a group wise apply.
I tried to set it up like so:
import pandas as pd
df = pd.DataFrame({'Year' : [2005, 2006, 2007, 2008, 2009, 2010, 2005, 2006, 2007, 2008, 2009, 2010],
'Country' : ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Val1' : [1, 2, 3, 4, 4, 4, 5, 6, 7, 8, 8, 8],
'Val2' : [3, 4, 5, 3, 3, 3, 7, 6, 5, 6, 6, 6,],
'Fact' : [1, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2]
})
def func(grp):
grad = grp[(grp['Year'] > 2004) & (grp['Year'] < 2009)].transform('mean')
ref = grp[grp['Year'] == 2005]
grad = (grad - ref)/4
res = grad * grp['Fact'] * (grp['Year']-2015) * ref
return res
df.groupby('Country').apply(func)
Running the code yields
Country Fact Val1 Val2 Year 0 1 2 3 4 5 6 7 8 9 10 11
Country
A 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
However, I hoped to receive something along the line of this
Year Country Val1 Val2 Fact
2005 A 1 3 1
2006 A 1.75 3.375 2
2007 A 2.5 3.75 2
2008 A 2.125 3.5625 1
2009 A 2.125 3.5625 1
2010 A 2.125 3.5625 1
2005 B 5 7 2
2006 B 5.75 6.5 2
2007 B 5.75 6.5 1
2008 B 7.25 5.5 2
2009 B 7.25 5.5 2
2010 B 7.25 5.5 2
I would be very grateful if anybody could point me towards a solution for this problem.
It is better not do it within one function
s1=df.loc[df.Year.between(2005,2008)].groupby('Country').mean()[['Val1','Val2']]
s2=df.loc[df.Year.eq(2005),['Country','Val1','Val2']].set_index('Country')
s3=df.Year.sub(2005)*df.Fact
s=(s1-s2).div(4).reindex(df.Country).values*s3.values[:,None]+s2.reindex(df.Country).values
df.loc[:,['Val1','Val2']]=s
df
Year Country Val1 Val2 Fact
0 2005 A 1.000 3.0000 1
1 2006 A 1.750 3.3750 2
2 2007 A 2.500 3.7500 2
3 2008 A 2.125 3.5625 1
4 2009 A 2.500 3.7500 1
5 2010 A 2.875 3.9375 1
6 2005 B 5.000 7.0000 2
7 2006 B 5.750 6.5000 2
8 2007 B 5.750 6.5000 1
9 2008 B 7.250 5.5000 2
10 2009 B 8.000 5.0000 2
11 2010 B 8.750 4.5000 2

grouping by count, year and displaying the last occurence and its count

In the following dataframe
d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
'tin': [12, 23, 24, 28,30, 12,7],
'ptin': [12, 23, 28, 22, 12, 12,0] }
df = pd.DataFrame(data=d)
If I run following code:
df = (df.groupby(['ptin', 'tin', 'year'])
.apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
.reset_index(name='matches'))
df
I get following result
ptin tin year matches
0 12 3.0 1999 0
1 12 3.0 2001 0
2 22 1.0 2002 0
3 23 1.0 2002 0
This gives me the matching tin to ptin and groups by year.
Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below
ptin tin year matches lastoccurence length
0 12 3.0 1999 0 0 0
1 12 3.0 2001 0 2001 2
2 22 1.0 2002 0 2002 1
3 23 1.0 2002 0 2002 1
Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.
I think this will do magic (at least partially?):
df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)
ptin tin year matches duration
2 12 12 2001 1 2.0
3 12 30 2004 0 3.0

How to get the max out of a group by on two columns and sum on third in a pandas dataframe?

So I used a group by on a pandas dataframe which looks like this
df.groupby(['year','month'])['AMT'].agg('sum')
And I get something like this
year month
2003 1 114.00
2 9195.00
3 300.00
5 200.00
6 450.00
7 68.00
8 750.00
9 3521.00
10 250.00
11 799.00
12 1000.00
2004 1 8551.00
2 9998.00
3 17334.00
4 2525.00
5 16014.00
6 9132.00
7 10623.00
8 7538.00
9 3650.00
10 7733.00
11 10128.00
12 4741.00
2005 1 6965.00
2 3208.00
3 8630.00
4 7776.00
5 11950.00
6 11717.00
7 1510.00
...
2015 7 1431441.00
8 966974.00
9 1121650.00
10 1200104.00
11 1312191.90
12 482535.00
2016 1 1337343.00
2 1465068.00
3 1170113.00
4 1121691.00
5 1302936.00
6 1518047.00
7 1251844.00
8 825215.00
9 1491626.00
10 1243877.00
11 1632252.00
12 750995.50
2017 1 905974.00
2 1330182.00
3 1382628.52
4 1146789.00
5 1201425.00
6 1278701.00
7 1172596.00
8 1517116.50
9 1108609.00
10 1360841.00
11 1340386.00
12 860686.00
What I want is to just select the max out of the third summed column so that the final data frame has only the max from each year, something like:
year month
2003 2 9195.00
2004 3 17334.00
2005 5 11950.00
... and so on
What do I have to add to my group by aggregation to do this?
I think need DataFrameGroupBy.idxmax:
s = df.groupby(['year','month'])['AMT'].sum()
out = s.loc[s.groupby(level=0).idxmax()]
#working in newer pandas versions
#out = df.loc[df.groupby('Year').idxmax()]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
If possible multiple max values per years:
out = s[s == s.groupby(level=0).transform('max')]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
You can use GroupBy + transform with max. Note this gives multiple maximums for any years where a tie exists. This may or may not be what you require.
As you have requested, it's possible to do this in 2 steps, first summing and then calculating maximums by year.
df = pd.DataFrame({'year': [2003, 2003, 2003, 2004, 2004, 2004],
'month': [1, 2, 2, 1, 1, 2],
'AMT': [100, 200, 100, 100, 300, 100]})
# STEP 1: sum by year + month
df2 = df.groupby(['year', 'month']).sum().reset_index()
# STEP 2: filter for max by year
res = df2[df2['AMT'] == df2.groupby(['year'])['AMT'].transform('max')]
print(res)
year month AMT
1 2003 2 300
2 2004 1 400

Categories