Avoiding duplicate data in a dataframe concat/merge/join

Avoiding duplicate data in a dataframe concat/merge/join - python

I am trying to concat 2 DataFrames, but .join is creating an unwanted duplicate.
df_ask:
timestamp price volume
1520259290 10.5 100
1520259275 10.6 2000
1520259275 10.55 200
df_bid:
timestamp price volume
1520259290 10.25 500
1520259280 10.2 300
1520259275 10.1 400
I tried:
depth = pd.concat([df_ask,df_bid], axis=1, keys=['Ask Orders','Bid Orders'])
but that returns an error which I do understand ("concat failed Reindexing only valid with uniquely valued Index objects")
and I tried:
df_ask.join(df_bid, how='outer', lsuffix='_ask', rsuffix='_bid')
Which gives no error, but gives the following dataframe:
timestamp price_ask volume_bid price_bid volume_bid
1520259290 10.5 100 10.25 500
1520259280 NaN NaN 10.2 300
1520259275 10.6 2000 10.1 400
1520259275 10.55 200 10.1 400
My problem is the repeated 10.1 and 400 at timestamp 1520259275. They weren't in the original df_bid dataframe twice and should only be in this df once. Having two rows of the same timestamp is correct as there are two ask rows at this time, however there should only be one bid information row associated with this timestamp. The other should be NaN.
ie What I'm looking for is this:
timestamp price_ask volume_bid price_bid volume_bid
1520259290 10.5 100 10.25 500
1520259280 NaN NaN 10.2 300
1520259275 10.6 2000 10.1 400
1520259275 10.55 200 NaN NaN
I've looked through the merge/join/concat documentation and this question but I can't find what I'm looking for. Thanks in advance

You are implicitly assuming that the first instance of an index should be aligned with the other first instance of an index. In that case, use groupby + cumcount to establish an ordering of each unique index.
df_ask = df_ask.set_index(df_ask.groupby('timestamp').cumcount(), append=True)
df_bid = df_bid.set_index(df_bid.groupby('timestamp').cumcount(), append=True)
df_ask.join(df_bid, how='outer', lsuffix='_ask', rsuffix='_bid')
price_ask volume_ask price_bid volume_bid
timestamp
1520259275 0 10.60 2000.0 10.10 400.0
1 10.55 200.0 NaN NaN
1520259280 0 NaN NaN 10.20 300.0
1520259290 0 10.50 100.0 10.25 500.0

Related

get average monthly value by divide from its monthly row count

i have following datframe
created_time shares_count
2021-07-01 250.0
2021-07-31 501.0
2021-08-02 48.0
2021-08-05 300.0
2021-08-07 200.0
2021-09-06 28.0
2021-09-08 100.0
2021-09-25 100.0
2021-09-30 200.0
did the grouping as monthly like this
df_groupby_monthly = df.groupby(pd.Grouper(key='created_time',freq='M')).sum()
df_groupby_monthly
Now how to get the average of these 'shares_count's by dividing from a sum of monthly rows?
ex: if the 07th month has 2 rows average should be 751.0/2 = 375.5, and the 08th month has 3 rows average should be 548.0/3 = 182.666, and the 09th month has 4 rows average should be 428.0/4 = 142.66
how to get like this final output
created_time shares_count
2021-07-31 375.5
2021-08-31 182.666
2021-09-30 142.66
I have tried following
df.groupby(pd.Grouper(key='created_time',freq='M')).apply(lambda x: x['shares_count'].sum()/len(x))
this is working if only one column, multiple ones hard to get

df['created_time'] = pd.to_datetime(df['created_time'])
output = df.groupby(df['created_time'].dt.to_period('M')).mean().round(2).reset_index()
output
###
created_time shares_count
0 2021-07 375.50
1 2021-08 182.67
2 2021-09 107.00

Use this code:
df=df.groupby(pd.Grouper(key='created_time',freq='M')).agg({'shares_count':['sum', 'count']}).reset_index()
df['ss']=df[('shares_count','sum')]/df[('shares_count','count')]

Merge several Dataframes with outside temperature and power generation

I have several dataframes of heating devices which are containing data over 1 year. One time step is 15 min, each df have two columns: outside_temp and heat_generation. Each df looks like this:
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
3 11.0 0
4 11.0 300
5 10.9 49
6
.
.
.
35037 -5.1 450
35038 -5.1 450
35039 -5.1 450
35040 -5.2 600
I now want to know at which outside_temp I need how much heat_production for all heat devices(and therefore for all dataframes) -> I was thinking about groupby oder somthing else. But I dont know how to handel this amount of data the best way. When directly merging the dfs there is the problem that the outside temperature is there several times and the heat production of course differs. To solve this, I could imagine to take the average heat_production for each device at a given outside_temperature. Of course it can also be the case that a device was not measuring a specific temperature (e.g. the device is located in warmer or colder area -> Therefore NaN Values are possbile)
At the end I want to get kind of Polynomial/Sigmoid function to see how much heat_production is necessary at a given outside temperature
At the end I want to have a dataframe like this:
outside_temp heat_production_average_device_1 heat_production_average_device_2 ...etc
-20.0 790 NaN
-19.9 789 NaN
-19.8 788 790
-19.7 NaN 780
-19.6 770 NaN
.
.
.
19.6 34 0
19.7 32 0
19.8 30 0
19.9 32 0
20.0 0 0
Any idea whats the best way to do so ?

Given:
>>> df1
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
>>> df2
outside_temp heat_production
3 11.0 0
4 11.0 300
5 10.9 49
Doing:
def my_func(i, df):
renamer = {'heat_production': f'heat_production_average_device_{i}'}
return (df.groupby('outside_temp')
.mean()
.rename(columns=renamer))
dfs = [df1, df2]
dfs = [my_func(i+1, df) for i, df in enumerate(dfs)]
df = pd.concat(dfs, axis=1)
print(df)
Output:
heat_production_average_device_1 heat_production_average_device_2
outside_temp
11.0 245.0 150.0
11.1 175.0 NaN
10.9 NaN 49.0

Pandas Turning multiple rows with different types into 1 row with multiple columns for each type

Given the following df,
Account contract_date type item_id quantity price tax net_amount
ABC123 2020-06-17 P 1409 1000 0.355 10 400
ABC123 2020-06-17 S 1409 2000 0.053 15 150
ABC123 2020-06-17 C 1409 500 0.25 5 180
ABC123 2020-06-17 S 1370 5000 0.17 30 900
DEF456 2020-06-18 P 7214 3000 0.1793 20 600
I would like to turn df, grouped by Account, contract_date and item_id. Then split the values of different types into different column. Intended results are as follows. I can do this with for loop/apply, but would like to seek for suggestion for groupby or pivot or any vectorized/pythonic solution to this. Intended results are as follows:
Account contract_date item_id quantity_P quantity_S quantity_C price_P price_S price_C tax_P tax_S tax_C net_amount_P net_amount_S net_amount_C
ABC123 2020-06-17 1409 1000 2000 500 0.355 0.053 0.25 10 15 5 400 150 180
ABC123 2020-06-17 1370 0 5000 0 0 0.17 0 0 30 0 0 900 0
DEF456 2020-06-18 7214 3000 0 0 0.1793 0 0 20 0 0 600 0 0
*Although it looks a bit off for the alignment, you may copy the df and use df = pd.read_clipboard() to read the table. Appreciate your help. Thank you.
Edit: The error I am getting using df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])

Use df.pivot:
In [1660]: df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])
Out[1660]:
quantity price tax net_amount
type C P S C P S C P S C P S
Account contract_date item_id
ABC123 2020-06-17 1370 NaN NaN 5000.0 NaN NaN 0.170 NaN NaN 30.0 NaN NaN 900.0
1409 500.0 1000.0 2000.0 0.25 0.3550 0.053 5.0 10.0 15.0 180.0 400.0 150.0
DEF456 2020-06-18 7214 NaN 3000.0 NaN NaN 0.1793 NaN NaN 20.0 NaN NaN 600.0 NaN

How do you separate a pandas dataframe by year in python?

I am trying to make a graph that shows the average temperature each day over a year by averaging 19 years of NOAA data (side note, is there any better way to get historical weather data because the NOAA's seems super inconsistent). I was wondering what the best way to set up the data would be. The relevant columns of my data look like this:
DATE PRCP TAVG TMAX TMIN TOBS
0 1990-01-01 17.0 NaN 13.3 8.3 10.0
1 1990-01-02 0.0 NaN NaN NaN NaN
2 1990-01-03 0.0 NaN 13.3 2.8 10.0
3 1990-01-04 0.0 NaN 14.4 2.8 10.0
4 1990-01-05 0.0 NaN 14.4 2.8 11.1
... ... ... ... ... ... ...
10838 2019-12-27 0.0 NaN 15.0 4.4 13.3
10839 2019-12-28 0.0 NaN 14.4 5.0 13.9
10840 2019-12-29 3.6 NaN 15.0 5.6 14.4
10841 2019-12-30 0.0 NaN 14.4 6.7 12.2
10842 2019-12-31 0.0 NaN 15.0 6.7 13.9
10843 rows × 6 columns
The DATE column is the datetime64[ns] type
Here's my code:
import pandas as pd
from matplotlib import pyplot as plt
data = pd.read_csv('1990-2019.csv')
#seperate the data by station
oceanside = data[data.STATION == 'USC00047767']
downtown = data[data.STATION == 'USW00023272']
oceanside.loc[:,'DATE'] = pd.to_datetime(oceanside.loc[:,'DATE'],format='%Y-%m-%d')
#This is the area I need help with:
oceanside['DATE'].dt.year

I've been trying to separate the data by year, so I can then average it. I would like to do this without using a for loop because I plan on doing this with much larger data sets and that would be super inefficient. I looked in the pandas documentation but I couldn't find a function that seemed like it would do that. Am I missing something? Is that even the right way to do it?
I am new to pandas/python data analysis so it is very possible the answer is staring me in the face.
Any help would be greatly appreciated!

Create a dict of dataframes where each key is a year
df_by_year = dict()
for year oceanside.date.dt.year.unique():
data = oceanside[oceanside.date.dt.year == year]
df_by_year[year] = data
Get data by a single year
oceanside[oceanside.date.dt.year == 2019]
Get average for each year
oceanside.groupby(oceanside.date.dt.year).mean()

Add Missing Date Index in a multiindex dataframe

I am working with a multi index data frame that has a date column and location_id as indices.
index_1 = ['2020-01-01', '2020-01-03', '2020-01-04']
index_2 = [100,200,300]
index = pd.MultiIndex.from_product([index_1,
index_2], names=['Date', 'location_id'])
df = pd.DataFrame(np.random.randint(10,100,9), index)
df
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
I want to fill in missing dates, with just one location_id and fill it with 0:
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-02 100 0
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
How can I achieve that? This is helpful but only if my data frame was not multi indexed.

you can get unique value of the Date index level, generate all dates between min and max with pd.date_range and use difference with unique value of Date to get the missing one. Then reindex df with the union of the original index and a MultiIndex.from_product made of missing date and the min of the level location_id.
#unique dates
m = df.index.unique(level=0)
# reindex
df = df.reindex(df.index.union(
pd.MultiIndex.from_product([pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]])),
fill_value=0)
print(df)
0
2020-01-01 100 91
200 49
300 19
2020-01-02 100 0
2020-01-03 100 41
200 25
300 51
2020-01-04 100 44
200 40
300 54
instead of pd.MultiIndex.from_product, you can also use product from itertools. Same result but maybe faster.
from itertools import product
df = df.reindex(df.index.union(
list(product(pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]))),
fill_value=0)

Pandas index is immutable, so you need to construct a new index. Put index level location_id to column and get unique rows and call asfreq to create rows for missing date. Assign the result to df2. Finally, use df.align to join both indices and fillna
df1 = df.reset_index(-1)
df2 = df1.loc[~df1.index.duplicated()].asfreq('D').ffill()
df_final = df.align(df2.set_index('location_id', append=True))[0].fillna(0)
Out[75]:
0
Date location_id
2020-01-01 100 19.0
200 75.0
300 39.0
2020-01-02 100 0.0
2020-01-03 100 11.0
200 91.0
300 80.0
2020-01-04 100 36.0
200 56.0
300 54.0

unstack/stack and asfreq/reindex would work:
new_df = df.unstack(fill_value=0)
new_df.index = pd.to_datetime(new_df.index)
new_df.asfreq('D').fillna(0).stack('location_id')
Output:
0
Date location_id
2020-01-01 100 78.0
200 25.0
300 89.0
2020-01-02 100 0.0
200 0.0
300 0.0
2020-01-03 100 79.0
200 23.0
300 11.0
2020-01-04 100 30.0
200 79.0
300 72.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoiding duplicate data in a dataframe concat/merge/join - python

Related

get average monthly value by divide from its monthly row count

Merge several Dataframes with outside temperature and power generation

Pandas Turning multiple rows with different types into 1 row with multiple columns for each type

How do you separate a pandas dataframe by year in python?

Add Missing Date Index in a multiindex dataframe

Categories

Resources