I have one dataframe data contains daily data of sales (DF).
I have another dataframe that contains quarterly data (DF1).
This is what the quarterly dataframe looks like DF1.
Date Computer Sale In Person Sales Net Sales
1/29/2021 1 2 3
4/30/2021 2 4 6
7/29/2021 3 6 9
1/29/2022 4 8 12
5/1/2022 5 10 15
7/30/2022 6 12 18
This is what the daily Data frame looks like: DF
Date Num of people
1 / 30 / 2021 45
1 / 31 / 2021 35
2 / 1 / 2021 25
5 / 1 / 2021 20
5 / 2 / 2021 15
I have columns Computer Sales, In Person Sales, Net Sales in the quarterly dataframe.
How to I merge the columns from above to the daily dataframe so that I can see on the daily dataframe the quarterly data. I want the final result to look like this
Date Num of people Computer Sale In Person Sales Net Sales
1/30/2021 45 1 2 3
1/31/2021 35 1 2 3
2/1/2021 25 1 2 3
5/1/2021 20 2 4 6
5/2/2021 15 2 4 6
So, for example. I want 1/30/2021 to be the figure that is 1/29/2021 and once the daily data goes past 4/30/2021 then merge the new quarterly Data.
Please let me know if I need to be more specific.
A possible solution:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
pd.merge_asof(df2, df1, on='Date', direction='backward')
Output:
Date Num of people Computer Sale In Person Sales Net Sales
0 2021-01-30 45 1 2 3
1 2021-01-31 35 1 2 3
2 2021-02-01 25 1 2 3
3 2021-05-01 20 2 4 6
4 2021-05-02 15 2 4 6
A B C D
1 2010 one 0 0
2 2020 one 2 4
3 2007 two 0 8
4 2010 one 8 4
5 2020 four 6 12
6 2007 three 0 14
7 2006 four 7 14
8 2010 two 10 12
I need to replace 0 with the average of the C values of years.For example 2010 C value would be 9. What is the best way to do this? i have over 10,000 rows.
You can use replace to change 0's to np.nan in Column C, and use fillna to map the yearly averages:
df.C.replace({0:np.nan},inplace=True)
df.C.fillna(
df.A.map(
df.groupby(df['A']).\
C.mean().fillna(0)\
.to_dict()
),inplace=True
)
print(df)
A B C D
0 2010 one 9.0 0
1 2020 one 2.0 4
2 2007 two 0.0 8
3 2010 one 8.0 4
4 2020 four 6.0 12
5 2007 three 0.0 14
6 2006 four 7.0 14
7 2010 two 10.0 12
2007 is still NaN because we have no values other than 0's in the initial data.
Here is what I think I will do it. The code below will be pseudo-code.
1: You find the avg for each year, and put it to a dict.
my_year_dict = {'2020':xxx,'2021':xxx}
2: Use apply & lambda functions
df[New C Col] = df[C].apply(lambda x: my_year_dict[x] if x is 0)
Hope it can be a start!
I have an example data frame let's call it df. I want to add more numbers to df but i don't want to start adding after NaN's which will be the index 7 i want to start adding from index 3.
year number letter
0 1945 10 a
1 1950 15 b
2 1955 20 c
3 1960 NaN NaN
4 1965 NaN Nan
5 1970 NaN Nan
6 1975 NaN Nan
Let's say we have a column like this:
number2
0 25
1 30
2 35
3 40
my target is to get a df like this
year number letter
0 1945 10 a
1 1950 15 b
2 1955 20 c
3 1960 25 NaN
4 1965 30 Nan
5 1970 35 Nan
6 1975 40 Nan
I hope I explained it well enough. Thank you for your support !
number2 = [25,30,35,40]
df.loc[df.number.isna(), 'number'] = number2
Result df:
I have a pandas dataframe for which I'm trying to compute an expanding windowed aggregation after grouping by columns. The data structure is something like this:
df = pd.DataFrame([['A',1,2015,4],['A',1,2016,5],['A',1,2017,6],['B',1,2015,10],['B',1,2016,11],['B',1,2017,12],
['A',1,2015,24],['A',1,2016,25],['A',1,2017,26],['B',1,2015,30],['B',1,2016,31],['B',1,2017,32],
['A',2,2015,4],['A',2,2016,5],['A',2,2017,6],['B',2,2015,10],['B',2,2016,11],['B',2,2017,12]],columns=['Typ','ID','Year','dat'])\
.sort_values(by=['Typ','ID','Year'])
i.e.
Typ ID Year dat
0 A 1 2015 4
6 A 1 2015 24
1 A 1 2016 5
7 A 1 2016 25
2 A 1 2017 6
8 A 1 2017 26
12 A 2 2015 4
13 A 2 2016 5
14 A 2 2017 6
3 B 1 2015 10
9 B 1 2015 30
4 B 1 2016 11
10 B 1 2016 31
5 B 1 2017 12
11 B 1 2017 32
15 B 2 2015 10
16 B 2 2016 11
17 B 2 2017 12
In general, there is a completely varying number of years per Type-ID and rows per Type-ID-Year. I need to group this dataframe by the columns Type and ID, then compute an expanding windowed median & std of all observations by Year. I would like to get output results like this:
Typ ID Year median std
0 A 1 2015 14.0 14.14
1 A 1 2016 14.5 11.56
2 A 1 2017 15.0 10.99
3 A 2 2015 4.0 0
4 A 2 2016 4.5 0
5 A 2 2017 5.0 0
6 B 1 2015 20.0 14.14
7 B 1 2016 20.5 11.56
8 B 1 2017 21.0 10.99
9 B 2 2015 10.0 0
10 B 2 2016 10.5 0
11 B 2 2017 11.0 0
Hence, I want something like a groupby by ['Type','ID','Year'], with the median & std for each Type-ID-Year computed for all data with the same Type-ID and cumulative inclusive that Year.
How can I do this without manual iteration?
There's been no activity on this question, so I'll post the solution I found.
mn = df.groupby(by=['Typ','ID']).dat.expanding().median().reset_index().set_index('level_2')
mylast = lambda x: x.iloc[-1]
mn = mn.join(df['Year'])
mn = mn.groupby(by=['Typ','ID','Year']).agg(mylast).reset_index()
My solution follows this algorithm:
group the data, compute the windowed median, and get the original index back
with the original index back, get the year back from the original dataframe
group by the grouping columns, taking the last (in order) value for each
This gives the output desired. The same process can be followed for the standard deviation (or any other statistic desired).
The operation that I want to do is similar to merger. For example, with the inner merger we get a data frame that contains rows that are present in the first AND second data frame. With the outer merger we get a data frame that are present EITHER in the first OR in the second data frame.
What I need is a data frame that contains rows that are present in the first data frame AND NOT present in the second one? Is there a fast and elegant way to do it?
Consider Following:
df_one is first DataFrame
df_two is second DataFrame
Present in First DataFrame and Not in Second DataFrame
Solution: by Index
df = df_one[~df_one.index.isin(df_two.index)]
index can be replaced by required column upon which you wish to do exclusion.
In above example, I've used index as a reference between both Data Frames
Additionally, you can also use a more complex query using boolean pandas.Series to solve for above.
How about something like the following?
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
As long as there is a non-key commonly named column, you can let the added on sufffexes do the work (if there is no non-key common column then you could create one to use temporarily ... df1['common'] = 1 and df2['common'] = 1):
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
Or you can use isin but you would have to create a single key:
df1['key'] = df1['Team'] + df1['Year'].astype(str)
df2['key'] = df1['Team'] + df2['Year'].astype(str)
print df1[~df1.key.isin(df2.key)]
Team Year foo key
0 Hawks 2001 5 Hawks2001
2 Nets 1987 3 Nets1987
4 Nets 2001 8 Nets2001
5 Nets 2000 10 Nets2000
6 Heat 2004 6 Heat2004
7 Pacers 2003 12 Pacers2003
You could run into errors if your non-index column has cells with NaN.
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
8 Problem 2112 NaN
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
3 Problem 2112 NaN
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
6 Problem 2112 NaN NaN
The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.
Solution:
What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.
df2['in_df2']='yes'
print df2
Team Year foo in_df2
0 Pacers 2003 12 yes
1 Heat 2004 6 yes
2 Nets 1988 6 yes
3 Problem 2112 NaN yes
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.in_df2.isnull()]
Team Year foo_x foo_y in_df1 in_df2
0 Hawks 2001 5 NaN yes NaN
1 Hawks 2004 4 NaN yes NaN
2 Nets 1987 3 NaN yes NaN
4 Nets 2001 8 NaN yes NaN
5 Nets 2000 10 NaN yes NaN
NB. The problem row is now correctly filtered out, because it has a value for in_df2.
Problem 2112 NaN NaN yes yes
I suggest using parameter 'indicator' in merge. Also if 'on' is None this defaults to the intersection of the columns in both DataFrames.
new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge'
new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2
new = new.drop(columns='_merge').copy()
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
4 Nets 2001 8
5 Nets 2000 10
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
Information column is Categorical-type and takes on a value of
“left_only” for observations whose merge key only appears in ‘left’ DataFrame,
“right_only” for observations whose merge key only appears in ‘right’ DataFrame,
and “both” if the observation’s merge key is found in both.