Merge several Dataframes with outside temperature and power generation - python

I have several dataframes of heating devices which are containing data over 1 year. One time step is 15 min, each df have two columns: outside_temp and heat_generation. Each df looks like this:
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
3 11.0 0
4 11.0 300
5 10.9 49
6
.
.
.
35037 -5.1 450
35038 -5.1 450
35039 -5.1 450
35040 -5.2 600
I now want to know at which outside_temp I need how much heat_production for all heat devices(and therefore for all dataframes) -> I was thinking about groupby oder somthing else. But I dont know how to handel this amount of data the best way. When directly merging the dfs there is the problem that the outside temperature is there several times and the heat production of course differs. To solve this, I could imagine to take the average heat_production for each device at a given outside_temperature. Of course it can also be the case that a device was not measuring a specific temperature (e.g. the device is located in warmer or colder area -> Therefore NaN Values are possbile)
At the end I want to get kind of Polynomial/Sigmoid function to see how much heat_production is necessary at a given outside temperature
At the end I want to have a dataframe like this:
outside_temp heat_production_average_device_1 heat_production_average_device_2 ...etc
-20.0 790 NaN
-19.9 789 NaN
-19.8 788 790
-19.7 NaN 780
-19.6 770 NaN
.
.
.
19.6 34 0
19.7 32 0
19.8 30 0
19.9 32 0
20.0 0 0
Any idea whats the best way to do so ?

Given:
>>> df1
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
>>> df2
outside_temp heat_production
3 11.0 0
4 11.0 300
5 10.9 49
Doing:
def my_func(i, df):
renamer = {'heat_production': f'heat_production_average_device_{i}'}
return (df.groupby('outside_temp')
.mean()
.rename(columns=renamer))
dfs = [df1, df2]
dfs = [my_func(i+1, df) for i, df in enumerate(dfs)]
df = pd.concat(dfs, axis=1)
print(df)
Output:
heat_production_average_device_1 heat_production_average_device_2
outside_temp
11.0 245.0 150.0
11.1 175.0 NaN
10.9 NaN 49.0

Related

Reorganizing pandas dataframe turning Column into new Header, Original Header to be part of multiindex with a prexisting Column

I have been tasked with reorganizing a fairly large data set for analysis. I want to make a dataframe where each employee has a list of Stats associated with their Employee Number ordered based on how many periods they have been with the company. The data does not go all the way back to the start of the company so some employees will not appear in the first period. My guess is there's some combination of pivot and merge that I am unable to wrap my head around.
df1 looks like this:
Periods since Start Period Employee Number Wage Sick Days
0 3 202001 101 20 14
1 2 202001 102 15 12
2 1 202001 103 10 17
3 4 202002 101 20 14
4 3 202002 102 20 10
5 2 202002 103 10 13
6 5 202003 101 25 13
7 4 202003 102 20 9
8 3 202003 103 10 13
And I want df2 (Column# for reference only):
Column1 Column2 Column3 Column4 Column5
101 102 103
1 Wage NaN NaN 10
1 Sick Days NaN NaN 17
2 Wage NaN 15 10
2 Sick Days NaN 12 13
3 Wage 20 20 10
3 Sick Days 14 10 13
4 Wage 20 20 NaN
4 Sick Days 14 9 NaN
Column1 = 'Periods since Start'
Column2 = "Stat" e.g. 'Wage', 'Sick Days'
Column3 - Column 5 Headers = 'Employee Number'
First thoughts were to try pivot/merge/stack but I have had no good results.
The second option I thought of was to create a dataframe with the index and headers that I wanted and then populate it from df1
import pandas as pd
import numpy as np
stat_list = ['Wage', 'Sick Days']
largest_period = df1['Periods since Start'].max()
df2 = np.tile(stat_list, largest_period)
df2 = pd.DataFrame(data=df2, columns = ['Stat'])
df2['Period_Number'] = df2.groupby('Stat').cumcount()+1
df2 = pd.DataFrame(index = df2[['Period_Number', 'Stat']],
columns = df1['Employee Number'])
Which Yields:
Employee Number 101 102 103
(1, 'Wage') NaN NaN NaN
(1, 'Sick Days') NaN NaN NaN
(2, 'Wage') NaN NaN NaN
(2, 'Sick Days') NaN NaN NaN
(3, 'Wage') NaN NaN NaN
(3, 'Sick Days') NaN NaN NaN
(4, 'Wage') NaN NaN NaN
(4, 'Sick Days') NaN NaN NaN
But I am at a loss on how to populate it.
You can .melt and then .unstack the dataframe.
Finish up up with some multiindex column clean up using .droplevel and passing axis=1 to drop unnecessary levels on columns rather than the default axis=0, which would drop index columns. You can also use reset_index() to bring the index columns into your dataframe:
df = (df.melt(id_vars=['Periods since Start', 'Employee Number'],
value_vars=['Wage', 'Sick Days'])
.set_index(['Periods since Start', 'Employee Number', 'variable']).unstack(1)
.droplevel(0, axis=1)
.reset_index())
df
Out[1]:
Employee Number Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN
When melting the dataframe, you can pass var_name= as the default is "variable". If you do that make sure to change the column name when using set_index() as well.
Try this, first melt the dataframe keeping Periods since Start, Employee Number, and Period in the index. Next, pivot the dataframe making rows and columns with 'value' from melt the values in the pivoted dataframe. Lastly, cleanup index with reset_index and remove the column index header name using rename_axis:
df.melt(['Periods since Start', 'Employee Number', 'Period'])\
.pivot(['Periods since Start', 'variable'], 'Employee Number', 'value')\
.reset_index()\
.rename_axis(None, axis=1)
Output:
Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN

How do you separate a pandas dataframe by year in python?

I am trying to make a graph that shows the average temperature each day over a year by averaging 19 years of NOAA data (side note, is there any better way to get historical weather data because the NOAA's seems super inconsistent). I was wondering what the best way to set up the data would be. The relevant columns of my data look like this:
DATE PRCP TAVG TMAX TMIN TOBS
0 1990-01-01 17.0 NaN 13.3 8.3 10.0
1 1990-01-02 0.0 NaN NaN NaN NaN
2 1990-01-03 0.0 NaN 13.3 2.8 10.0
3 1990-01-04 0.0 NaN 14.4 2.8 10.0
4 1990-01-05 0.0 NaN 14.4 2.8 11.1
... ... ... ... ... ... ...
10838 2019-12-27 0.0 NaN 15.0 4.4 13.3
10839 2019-12-28 0.0 NaN 14.4 5.0 13.9
10840 2019-12-29 3.6 NaN 15.0 5.6 14.4
10841 2019-12-30 0.0 NaN 14.4 6.7 12.2
10842 2019-12-31 0.0 NaN 15.0 6.7 13.9
10843 rows × 6 columns
The DATE column is the datetime64[ns] type
Here's my code:
import pandas as pd
from matplotlib import pyplot as plt
data = pd.read_csv('1990-2019.csv')
#seperate the data by station
oceanside = data[data.STATION == 'USC00047767']
downtown = data[data.STATION == 'USW00023272']
oceanside.loc[:,'DATE'] = pd.to_datetime(oceanside.loc[:,'DATE'],format='%Y-%m-%d')
#This is the area I need help with:
oceanside['DATE'].dt.year
​
I've been trying to separate the data by year, so I can then average it. I would like to do this without using a for loop because I plan on doing this with much larger data sets and that would be super inefficient. I looked in the pandas documentation but I couldn't find a function that seemed like it would do that. Am I missing something? Is that even the right way to do it?
I am new to pandas/python data analysis so it is very possible the answer is staring me in the face.
Any help would be greatly appreciated!
Create a dict of dataframes where each key is a year
df_by_year = dict()
for year oceanside.date.dt.year.unique():
data = oceanside[oceanside.date.dt.year == year]
df_by_year[year] = data
Get data by a single year
oceanside[oceanside.date.dt.year == 2019]
Get average for each year
oceanside.groupby(oceanside.date.dt.year).mean()

Handing complex data transformations with pivot_table in Python

I am working a pandas DataFrame of a shape of 7837 rows and 19 columns. I am interested in getting the number of times a product_id appears per month which is the date column, and the associated amount. Because a product_id can have various amounts. So I am looking for a way to say for example product_id 1921 with amount 59 appeared ....
Here is the small version of the pandas dataframe
print(df)
CompanyName Produktname product_id amount Date
0 companyA productA 1921 59.0 Jan-2020
1 companyB productB 114 NaN May-2020
2 companyC productC 469 NaN Feb-2020
3 companyD productD 569 18.0 Jun-2020
4 companyE productE 569 18.0 March-2020
I think pivot_table might be helpful. I wanted to first see how many times each product_id appeared with the date as the column
pd.pivot_table(df, index="product_id", values= "product_id" ,columns="Date", aggfunc="count")
but I get an error:
ValueError: Grouper for 'product_id' not 1-dimensional
Is there a way around this or a more efficient way to handle this?
IIUC use:
df = df.pivot_table(index="product_id", values= "amount" ,columns="Date", aggfunc="count")
print (df)
Date Feb-2020 Jan-2020 Jun-2020 March-2020 May-2020
product_id
114 NaN NaN NaN NaN 0.0
469 0.0 NaN NaN NaN NaN
569 NaN NaN 1.0 1.0 NaN
1921 NaN 1.0 NaN NaN NaN
For correct order is possible use:
df['Date'] = pd.to_datetime(df['Date'], format='%b-%Y')
df = df.pivot_table(index="product_id",
values= "amount" ,
columns="Date",
aggfunc="count",
fill_value=0).rename(columns = lambda x: x.strftime('%b-%Y'))
print (df)
Date Jan-2020 Feb-2020 Mar-2020 May-2020 Jun-2020
product_id
114 0 0 0 0 0
469 0 0 0 0 0
569 0 0 1 0 1
1921 1 0 0 0 0

Insert Value to a dataframe column

I have a pandas dataframe
0 1 2 3
0 173.0 147.0 161 162.0
1 NaN NaN 23 NaN
I just want to add value a column such as
3
0 161
1 23
2 181
But can't go with the approch of loc and iloc. Because the file can contain columns of any length and I will not know loc and iloc. Hence Just want to add value to a column. Thanks in advance.
I believe need setting with enlargement:
df.loc[len(df.index), 2] = 181
print (df)
0 1 2 3
0 173.0 147.0 161.0 162.0
1 NaN NaN 23.0 NaN
2 NaN NaN 181.0 NaN
If that 2x3 dataframe is your original dataframe, you can add an extra row to dataframe by pandas.concat().
For example:
pandas.concat([your_original_dataframe, pandas.DataFrame([[181]] , columns=[2] )], axis=0)
This will add 181 at the bottom of column 2

Avoiding duplicate data in a dataframe concat/merge/join

I am trying to concat 2 DataFrames, but .join is creating an unwanted duplicate.
df_ask:
timestamp price volume
1520259290 10.5 100
1520259275 10.6 2000
1520259275 10.55 200
df_bid:
timestamp price volume
1520259290 10.25 500
1520259280 10.2 300
1520259275 10.1 400
I tried:
depth = pd.concat([df_ask,df_bid], axis=1, keys=['Ask Orders','Bid Orders'])
but that returns an error which I do understand ("concat failed Reindexing only valid with uniquely valued Index objects")
and I tried:
df_ask.join(df_bid, how='outer', lsuffix='_ask', rsuffix='_bid')
Which gives no error, but gives the following dataframe:
timestamp price_ask volume_bid price_bid volume_bid
1520259290 10.5 100 10.25 500
1520259280 NaN NaN 10.2 300
1520259275 10.6 2000 10.1 400
1520259275 10.55 200 10.1 400
My problem is the repeated 10.1 and 400 at timestamp 1520259275. They weren't in the original df_bid dataframe twice and should only be in this df once. Having two rows of the same timestamp is correct as there are two ask rows at this time, however there should only be one bid information row associated with this timestamp. The other should be NaN.
ie What I'm looking for is this:
timestamp price_ask volume_bid price_bid volume_bid
1520259290 10.5 100 10.25 500
1520259280 NaN NaN 10.2 300
1520259275 10.6 2000 10.1 400
1520259275 10.55 200 NaN NaN
I've looked through the merge/join/concat documentation and this question but I can't find what I'm looking for. Thanks in advance
You are implicitly assuming that the first instance of an index should be aligned with the other first instance of an index. In that case, use groupby + cumcount to establish an ordering of each unique index.
df_ask = df_ask.set_index(df_ask.groupby('timestamp').cumcount(), append=True)
df_bid = df_bid.set_index(df_bid.groupby('timestamp').cumcount(), append=True)
df_ask.join(df_bid, how='outer', lsuffix='_ask', rsuffix='_bid')
price_ask volume_ask price_bid volume_bid
timestamp
1520259275 0 10.60 2000.0 10.10 400.0
1 10.55 200.0 NaN NaN
1520259280 0 NaN NaN 10.20 300.0
1520259290 0 10.50 100.0 10.25 500.0

Categories