Shift Dataframe and filling up NaN - python

I want to create a DataFrame which includes the hourly heatdemand of n consumers (here n=5) for 10 hours.
--> DataFrame called "Village", n columns (each representing a consumer) and 10 rows (10 hours)
All consumers follow the same demand-profile with the only difference that it is shiftet within a random amount of hours. The random number follows a normal distribution.
I managed to create a list of discrete numbers that follow a normal distribution and i managed to create a DataFrame with n rows where the same demand-profile gets shiftet by that random number.
The problem i cant solve is, that NaN appears instead of filling up the shifted hours with the values that where cut of because of the shift.
Example: if a the demand-profile gets shiftet by 1 hour (like consumer 5 for example). Now there appears "NaN" as the demand in the first hour. Instead of "NaN" i would like the value of the 10th hour of the original demand-profile to appear (4755.005240). So instead of shifting the values of the demand-profile i kind of want it more to "rotate".
heat_demand
0 1896.107462
1 1964.878199
2 2072.946499
3 2397.151402
4 3340.292937
5 4912.195496
6 6159.893152
7 5649.024821
8 5157.805271
9 4755.005240
Consumer 1 Consumer 2 Consumer 3 Consumer 4 Consumer 5
0 1896.107462 NaN 1964.878199 NaN NaN
1 1964.878199 NaN 2072.946499 NaN 1896.107462
2 2072.946499 NaN 2397.151402 NaN 1964.878199
3 2397.151402 1896.107462 3340.292937 1896.107462 2072.946499
4 3340.292937 1964.878199 4912.195496 1964.878199 2397.151402
5 4912.195496 2072.946499 6159.893152 2072.946499 3340.292937
6 6159.893152 2397.151402 5649.024821 2397.151402 4912.195496
7 5649.024821 3340.292937 5157.805271 3340.292937 6159.893152
8 5157.805271 4912.195496 4755.005240 4912.195496 5649.024821
9 4755.005240 6159.893152 NaN 6159.893152 5157.805271
Could someone maybe give me a hint how to solve that problem? Thanks a lot already in advance and kind regards
Luise
import numpy as np
import pandas as pd
import os
path= os.path.dirname(os.path.abspath(os.path.join(file)))
#Create a list with discrete numbers following normal distribution
n = 5
timeshift_1h = np.random.normal(loc=0.1085, scale=1.43825, size=n)
timeshift_1h = np.round(timeshift_1h).astype(int)
print ("Time Shift in h:", timeshift_1h)
#Read the Standard Load Profile
cols = ["heat_demand"]
df_StandardLoadProfile = pd.read_excel(os.path.join(path, '10_h_example.xlsx'),usecols=cols)
print(df_StandardLoadProfile)
#Create a df for n consumers, whose demand equals a shifted StandardLoadProfile.
#It is shifted by a random amount of hours, that is taken from the list timeshift_1h
list_consumers = list(range(1,n+1))
Village=pd.DataFrame()
for i in list_consumers:
a=timeshift_1h[i-1]
name = "Consumer {}".format(i)
Village[name] = df_StandardLoadProfile.shift(a)
print(Village)

There's a very nice numpy function for that use-case, namely np.roll (see here for the documentation). It takes an array and shifts it by the steps specified withshift.
For your example, this could look like the following:
import pandas as pd
import numpy as np
df = pd.read_csv("demand.csv")
df['Consumer 1'] = np.roll(df["heat_demand"], shift=1)

You could fill the nan values from the reversed column -
df = pd.DataFrame(np.arange(10))
df
# 0
#0 0
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
df[0].shift(3).fillna(pd.Series(reversed(df[0])))
#0 9.0
#1 8.0
#2 7.0
#3 0.0
#4 1.0
#5 2.0
#6 3.0
#7 4.0
#8 5.0
#9 6.0

Related

Preserving id columns in dataframe after applying assign and groupby

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN

Iterating through rows of a dataframe to fill column in python

I have the following dataframe (called data_coh):
roi mag phase coherence
0 1 0.699883 0.0555903 NaN
1 2 0.640482 0.1053 NaN
2 3 0.477865 1.14926 NaN
3 4 0.128119 2.28403 NaN
4 5 0.563046 2.53091 NaN
5 6 0.58869 0.94647 NaN
6 7 0.428383 1.13915 NaN
7 8 0.164036 1.95959 NaN
8 9 0.27912 3.07456 NaN
9 10 0.244237 2.78111 NaN
10 11 0.696592 2.61011 NaN
11 12 0.237346 3.01836 NaN
For every row, I want to calculate its coherence value as follows (note that I want
to use the imaginary unit j):
import math
import cmath
for roin, val in enumerate(data_coh):
data_coh.loc[roin,'coherence'] = mag*math.cos(phase) + mag*math.sin(phase)*j
First of all, it is not able to perform the computation (which is calculating a complex number based on magnitude and phase). J is a complex unit (from cmath).
But in addition, even when j is left out, the allocation to the rows is not done
correctly. Why is that, and how can it be corrected?
No need to iterate or to import math or cmath, just pandas and numpy:
import pandas as pd
import numpy as np
df['coherence'] = df['mag'] * (np.cos(df['phase']) + 1j*df['phase'])
# Result
df
roi mag phase coherence
0 1 0.699883 0.05559 0.698802+0.038907j
1 2 0.640482 0.10530 0.636934+0.067443j
2 3 0.477865 1.14926 0.195525+0.549191j
3 4 0.128119 2.28403 -0.083826+0.292628j
4 5 0.563046 2.53091 -0.461279+1.425019j
5 6 0.588690 0.94647 0.344119+0.557177j
6 7 0.428383 1.13915 0.179221+0.487992j
7 8 0.164036 1.95959 -0.062182+0.321443j
8 9 0.279120 3.07456 -0.278493+0.858171j
9 10 0.244237 2.78111 -0.228539+0.67925j
10 11 0.696592 2.61011 -0.600502+1.818182j
11 12 0.237346 3.01836 -0.235546+0.716396j

How to assign a value from a range in a list?

I want to assign a value to a list from another list with values. I'm trying to find a in a list where does it belong in a range from another list
I tried .merge but it didn't work, I tried a for loop to go through all the list but I was not able to connect all the pieces.
I have two list and i want to do the 3rd table
import numpy as np
import pandas as pd
s = pd.Series([0,1001,2501])
t = pd.Series([1000,2500,4000])
u=pd.Series([6.5,8.5,10])
df = pd.DataFrame(s,columns = ["LRange"])
df["uRange"] =t
df["Cost"]=u
print (df)
p=pd.Series([550,1240,2530,230])
dp=pd.DataFrame(p,columns = ["Power"])
print (dp)
LRange uRange Cost
0 0 1000 6.5
1 1001 2500 8.5
2 2501 4000 10
Power
1 550
2 1240
3 2530
4 230
I want my result to be:
Power Cost p/kW
1 550 6.5
2 1240 8.5
3 2530 10.0
4 230 6.5

How to update da Pandas Panel without duplicates

Currently i'm working on a Livetiming-Software for a motorsport-application. Therefore i have to crawl a Livetiming-Webpage and copy the Data to a big Dataframe. This Dataframe is the source of several diagramms i want to make. To keep my Dataframe up to date, i have to crawl the webpage very often.
I can download the Data and save them as a Panda.Dataframe. But my Problem is step from the downloaded DataFrame to the Big Dataframe, that includes all the Data.
import pandas as pd
import numpy as np
df1= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:30,000','1:45,000','1:50,000','1:25,333','1:13,366','1:17,000'],
'Laps':['1','1','1','1','1','1']})
df2= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,],
'Laps':['2','2','2','2','2','2']})
df3= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:31,000','1:41,000','1:51,000','1:21,333','1:11,366','1:11,000'],
'Laps':['2','2','2','2','2','2']})
df1.set_index(['CLS','Nr.','Laps'],inplace=True)
df2.set_index(['CLS','Nr.','Laps'],inplace=True)
df3.set_index(['CLS','Nr.','Laps'],inplace=True)
df1 shows a Dataframe from previous laps.
df2 shows a Dataframe in the second lap. The Lap is not completed, so i have a nan.
df3 shows a Dataframe after the second lap is completed.
My target is to have just one row for each Lap per Car per Class.
Either i have the problem, that i have duplicates with incomplete Laps or all date get overwritten.
I hope that someone can help me with this problem.
Thank you so far.
MrCrunsh
If I understand your problem correctly, your issue is that you have overlapping data for the second lap: information while the lap is still in progress and information after it's over. If you want to put all the information for a given lap in one row, I'd suggest use multi-index columns or changing the column names to reflect the difference between measurements during and after laps.
df = pd.concat([df1, df3])
df = pd.concat([df, df2], axis=1, keys=['after', 'during'])
The result will look like this:
after during
Pos Zeit Pos Zeit
CLS Nr. Laps
V4 24 1 5 1:13,366 NaN NaN
2 5 1:11,366 5.0 NaN
55 1 4 1:25,333 NaN NaN
2 4 1:21,333 4.0 NaN
985 1 6 1:17,000 NaN NaN
2 6 1:11,000 6.0 NaN
V5 13 1 1 1:30,000 NaN NaN
2 1 1:31,000 1.0 NaN
30 1 3 1:50,000 NaN NaN
2 3 1:51,000 3.0 NaN
700 1 2 1:45,000 NaN NaN
2 2 1:41,000 2.0 NaN

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!
Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.
Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0
The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

Categories