I have a pandas dataframe of variable number of columns. I'd like to numerically integrate each column of the dataframe so that I can evaluate the definite integral from row 0 to row 'n'. I have a function that works on an 1D array, but is there a better way to do this in a pandas dataframe so that I don't have to iterate over columns and cells? I was thinking of some way of using applymap, but I can't see how to make it work.
This is the function that works on a 1D array:
def findB(x,y):
y_int = np.zeros(y.size)
y_int_min = np.zeros(y.size)
y_int_max = np.zeros(y.size)
end = y.size-1
y_int[0]=(y[1]+y[0])/2*(x[1]-x[0])
for i in range(1,end,1):
j=i+1
y_int[i] = (y[j]+y[i])/2*(x[j]-x[i]) + y_int[i-1]
return y_int
I'd like to replace it with something that calculates multiple columns of a dataframe all at once, something like this:
B_df = y_df.applymap(integrator)
EDIT:
Starting dataframe dB_df:
Sample1 1 dB Sample1 2 dB Sample1 3 dB Sample1 4 dB Sample1 5 dB Sample1 6 dB
0 2.472389 6.524537 0.306852 -6.209527 -6.531123 -4.901795
1 6.982619 -0.534953 -7.537024 8.301643 7.744730 7.962163
2 -8.038405 -8.888681 6.856490 -0.052084 0.018511 -4.117407
3 0.040788 5.622489 3.522841 -8.170495 -7.707704 -6.313693
4 8.512173 1.896649 -8.831261 6.889746 6.960343 8.236696
5 -6.234313 -9.908385 4.934738 1.595130 3.116842 -2.078000
6 -1.998620 3.818398 5.444592 -7.503763 -8.727408 -8.117782
7 7.884663 3.818398 -8.046873 6.223019 4.646397 6.667921
8 -5.332267 -9.163214 1.993285 2.144201 4.646397 0.000627
9 -2.783008 2.288842 5.836786 -8.013618 -7.825365 -8.470759
Ending dataframe B_df:
Sample1 1 B Sample1 2 B Sample1 3 B Sample1 4 B Sample1 5 B Sample1 6 B
0 0.000038 0.000024 -0.000029 0.000008 0.000005 0.000012
1 0.000034 -0.000014 -0.000032 0.000041 0.000036 0.000028
2 0.000002 -0.000027 0.000010 0.000008 0.000005 -0.000014
3 0.000036 0.000003 -0.000011 0.000003 0.000002 -0.000006
4 0.000045 -0.000029 -0.000027 0.000037 0.000042 0.000018
5 0.000012 -0.000053 0.000015 0.000014 0.000020 -0.000023
6 0.000036 -0.000023 0.000004 0.000009 0.000004 -0.000028
7 0.000046 -0.000044 -0.000020 0.000042 0.000041 -0.000002
8 0.000013 -0.000071 0.000011 0.000019 0.000028 -0.000036
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
In the above example,
(x[j]-x[i]) = 0.000008
First of all, you can achieve a similar result using vectorized operations. Each element of the integration is just the mean of the current and next y value scaled by the corresponding difference in x. The final integral is just the cumulative sum of these elements. You can achieve the same result by doing something like
def findB(x, y):
"""
x : pandas.Series
y : pandas.DataFrame
"""
mean_y = (y[:-1] + y.shift(-1)[:-1]) / 2
delta_x = x.shift(-1)[:-1] - x[:-1]
scaled_int = mean_y.multiply(delta_x)
cumulative_int = scaled_int.cumsum(axis='index')
return cumulative_int.shift(1).fillna(0)
Here DataFrame.shift and Series.shift are used to match the indices of the "next" elements to the current. You have to use DataFrame.multiply rather than the * operator to ensure that the proper axis is used ('index' vs 'column'). Finally, DataFrame.cumsum provides the final integration step. DataFrame.fillna ensures that you have a first row of zeros as you did in the original solution. The advantage of using all the native pandas functions is that you can pass in a dataframe with any number of columns and have it operate on all of them simultaneously.
Do you really look for numeric values of the integral? Maybe you just need a picture? Then it is easier, using pyplot.
import matplotlib.pyplot as plt
# Introduce a column *bin* holding left limits of our bins.
df['bin'] = pd.cut(df['volume2'], 50).apply(lambda bin: bin.left)
# Group by bins and calculate *f*.
g = df[['bin', 'universe']].groupby('bin').sum()
# Plot the function using cumulative=True.
plt.hist(list(g.index), bins=50, weights=list(g['universe']), cumulative=True)
plt.show()
Related
I have the following dataframe.
for each time point (row) A1,A2,A3 ; A4,5,6 ; ... are 3 replicates. I would like to get the averages and standard deviation for each group of 3 per row and add it to a new df.
I have tried:
new_df['A1-A3_mean']=np.mean(df[['A1','A2','A3']],axis=1)
new_df['A1-A3_std']=np.std(df[['A1','A2','A3']],axis=1)
which works but is quite manual and time consuming. I tried using groupby('Time').agg({'mean','std'}) but not I don't know how to specify that it should always take 3 columns. Ideally the resulting column would be named A1-3_mean / A1-3_stdev
Thanks in advance!
You can try:
N = 3
cols = list(df.drop(columns='time'))
mapper = {c: f'{cols[i//N]}-{cols[i//N+N-1]}' for i,c in enumerate(cols)}
g = df[cols].rename(columns=mapper).groupby(level=0, axis=1)
out = pd.concat({x: g.agg(x) for x in ['mean', 'std']}, axis=1)
Output:
mean std
A1-A3 A2-A4 A1-A3 A2-A4
0 4.666667 3.000000 2.886751 2.000000
1 2.666667 4.333333 1.154701 3.214550
2 6.333333 4.333333 2.309401 1.154701
I have this function that I am trying to apply to a dask dataframe that calculates cooling assuming certain storage capacity and rate limits. It takes a 15-minute timestep value of cooling a building uses and returns the amount a certain storage rate can accommodate.
def cooling_kwh_by_case(row, storage_capacity, storage_rate):
if ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] <= storage_rate/row['cop'])):
return row['daily_cooling_kwh']
elif ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] > storage_rate/row['cop'])):
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: sum(min(x,storage_rate/(4*row['cop']))))
return daily_groupby.loc[(row.building_date)]
else:
n_largest = 1
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):
n_largest += 1
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
return min(storage_capacity/row['cop'],net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest-1).sum()).loc[(row.building_date)])
When I apply it, this is my error message.
<ipython-input-22-88e243d194c6> in cooling_kwh_by_case()
16 n_largest = 1
17 daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
---> 18 while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):
19 n_largest += 1
20 daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
I think the issue I'm running into is the way I try and calculate the value I want for the else statement which are the cases where the cooling kwh is larger than the storage_capacity parameter. To calculate this value, I apply a function to find when the sum of the largest 15-min cooling kwh values for the day exceeds the storage_capacity. I then return the sum of the largest values.
The dataframe that I am trying to groupby in the function to return a value is called net_load_w_times:
time electricity_cooling_kwh \
building_id
2 2016-07-05 19:00:00 0.050000
2 2016-07-05 22:00:00 3.200000
2 2016-07-05 16:00:00 5.779318
2 2016-07-05 20:00:00 1.888300
2 2016-07-05 18:00:00 7.490000
electricity_heating_kwh total_site_electricity_kwh iso_zone \
building_id
2 0.000000 19.529506 MISO-E
2 0.045235 6.310719 MISO-E
2 0.000000 22.514705 MISO-E
2 0.018624 13.474863 MISO-E
2 0.005464 18.192927 MISO-E
index date
building_id
2 2|2016-10-24 2016-10-24
2 2|2016-03-05 2016-03-05
2 2|2016-08-14 2016-08-14
2 2|2016-03-05 2016-03-05
2 2|2016-03-05 2016-03-05
Desired Output:
Given cooling_kwh_by_case(row, 8, 5) it outputs:
7.717618 because this is the max cooling kWh it can add up until 8.
Dask dataframes are lazy and do not work within control flow like if-else statements or for loops. I recommend trying to find solutions within the pandas API, like the where method.
I have created two series and I want to create a third series by doing element-wise multiplication of first two. My code is given below:
new_samples = 10 # Number of samples in series
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
c = pd.Series([x*y for x,y in zip(a.tolist(),b.tolist())],index=['Power'])
My output is:
TypeError: can't multiply sequence by non-int of type 'list'
To keep things clear, I am pasting my actual for loop code below. My data frame already has three columns Current,Voltage,Power. For my requirement, I have to add new list of values to existing columns Voltage,Current. But, Power values are created by multiplying already created values. My code is given below:
for i,j in zip(IV_start_index,IV_start_index[1:]):
isc_act = module_allData_df['Current'].iloc[i:j-1].max()
isc_indx = module_allData_df['Current'].iloc[i:j-1].idxmax()
sample_count = int((j-i)/(module_allData_df['Voltage'].iloc[i]-module_allData_df['Voltage'].iloc[j-1]))
new_samples = int(sample_count * (module_allData_df['Voltage'].iloc[isc_indx]))
missing_current = pd.Series([list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples)))],index=['Current'])
missing_voltage = pd.Series([list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples)))],index=['Voltage'])
print(missing_current.tolist()*missing_voltage.tolist())
Sample data: module_allData_df.head()
Voltage Current Power
0 33.009998 -0.004 -0.13204
1 33.009998 0.005 0.16505
2 32.970001 0.046 1.51662
3 32.950001 0.087 2.86665
4 32.919998 0.128 4.21376
sample data: module_allData_df.iloc[120:126] and you require this also
Voltage Current Power
120 0.980000 5.449 5.34002
121 0.920000 5.449 5.01308
122 0.860000 5.449 4.68614
123 0.790000 5.449 4.30471
124 33.110001 -0.004 -0.13244
125 33.110001 0.005 0.16555
sample data: IV_start_index[:5]
[0, 124, 251, 381, 512]
Based on #jezrael answer, I have successfully created three separate series. How to append them to main dataframe. My requirement is explained in following plot.
Problem is each Series is one element with lists, so not possible use vectorized operations.
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
print (a)
Current [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
dtype: object
print (b)
Voltage [10.0, 8.88888888888889, 7.777777777777778, 6....
dtype: object
So I believe need remove [] and if necessary add parameter name:
a = pd.Series(list(map(lambda x:x,np.linspace(2,2,new_samples))), name='Current')
b = pd.Series(list(map(lambda x:x,np.linspace(10,0,new_samples))),name='Voltage')
print (a)
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
9 2.0
Name: Current, dtype: float64
print (b)
0 10.000000
1 8.888889
2 7.777778
3 6.666667
4 5.555556
5 4.444444
6 3.333333
7 2.222222
8 1.111111
9 0.000000
Name: Voltage, dtype: float64
c = a * b
print (c)
0 20.000000
1 17.777778
2 15.555556
3 13.333333
4 11.111111
5 8.888889
6 6.666667
7 4.444444
8 2.222222
9 0.000000
dtype: float64
EDIT:
If want outoput multiplied Series need last 2 rows:
missing_current = pd.Series(list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples))))
missing_voltage = pd.Series(list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples))))
print(missing_current *missing_voltage)
It's easier using numpy.
import numpy as np
new_samples = 10 # Number of samples in series
a = np.array(np.linspace(2,2,new_samples))
b = np.array(np.linspace(10,0,new_samples))
c = a*b
print(c)
Output:
array([20. , 17.77777778, 15.55555556, 13.33333333,
11.11111111,
8.88888889, 6.66666667, 4.44444444, 2.22222222, 0. ])
As you are doing everything using pandas dataframe, use the below code.
import pandas as pd
new_samples = 10 # Number of samples in series
df = pd.DataFrame({'Current':np.linspace(2,2,new_samples),'Voltage':np.linspace(10,0,new_samples)})
df['Power'] = df['Current'] * df['Voltage']
print(df.to_string(index=False))
Output:
Current Voltage Power
2.0 10.000000 20.000000
2.0 8.888889 17.777778
2.0 7.777778 15.555556
2.0 6.666667 13.333333
2.0 5.555556 11.111111
2.0 4.444444 8.888889
2.0 3.333333 6.666667
2.0 2.222222 4.444444
2.0 1.111111 2.222222
2.0 0.000000 0.000000
Because they are series you should be able to just multiply them c= a * b
You could add a and b to a data frame and the c becomes the third column
I found a behavior in pandas that I'm not able to explain to myself.
I am studying a database of audio features with N+2 columns: an ID, the time t, and N audio features related to time t. For various reasons, I would like to put in every row also the features of the next T time steps. (yes, the same data will be repeated up to T times). I have therefore written a function that creates additional feature columns containing data from the successive time steps. I have implemented it in the three ways, as you can see in the attached code, and one of them is not working, which is surprising to me since it works if the underlying data structures are numpy arrays. Can anybody explain me why?
def create_datapoints_for_dnn(df, T):
"""
Here we take the data frame with chroma features at time t and create all features at times t+1, t+2, ..., t+T-1.
:param df: initial data frame of chroma features
:param T: number of time steps to keep
:return: expanded data frame of chroma features
"""
res = df.copy()
original_labels = df.columns.values
n_steps = df.shape[0] # the number of time steps in this song
nans = pd.Series(np.full(n_steps, np.NaN)).values # a column of nans of the correct length
for n in range(1, T):
new_labels = [ol + '+' + str(n) for ol in original_labels[2:]]
for nl, ol in zip(new_labels, original_labels[2:]):
# df.assign would use the name "nl" instead of what nl contains, so we build and unpack a dictionary
res = res.assign(**{nl: nans}) # create a new column
# CORRECT BUT EXTREMELY SLOW
# for i in range(n_steps - (T - 1)):
# res.iloc[i, res.columns.get_loc(nl)] = df.iloc[n+i, df.columns.get_loc(ol)]
# CORRECT AND FAST
res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[:, df.columns.get_loc(ol)].shift(-n)
# NOT WORKING
# res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[n:, df.columns.get_loc(ol)]
return res[: - (T - 1)] # drop the last T-1 rows because time t+T-1 is not defined for them
Data example (put it in a csv):
songID,time,A_t,A#_t
CrossEra-0850,0.0,0.0,0.0
CrossEra-0850,0.1,0.0,0.0
CrossEra-0850,0.2,0.0,0.0
CrossEra-0850,0.3,0.31621,0.760299
CrossEra-0850,0.4,0.0,0.00107539
CrossEra-0850,0.5,0.0,0.142832
CrossEra-0850,0.6,0.8506459999999999,0.12481600000000001
CrossEra-0850,0.7,0.0,0.21206399999999997
CrossEra-0850,0.8,0.0796207,0.28227399999999997
CrossEra-0850,0.9,2.55144,0.169434
CrossEra-0850,1.0,3.4581699999999995,0.08014550000000001
CrossEra-0850,1.1,3.1061400000000003,0.030419599999999998
Code to run it
import pandas as pd
import numpy as np
T = 4 # how many successive steps we want to put in a single row
df = pd.read_csv('path_to_csv')
res = create_datapoints_for_dnn(df, T)
res.to_csv('path_to_output', index=False)
Results:
Use pd.DataFrame.shift and concat
f-string requires Python 3.6. Otherwise use '+{}'.format(i)'
cols = ['songID', 'time']
d = df.drop(['songID', 'time'], 1)
df[cols].join(
pd.concat(
[d.shift(-i).add_suffix(f'+{i}') for i in range(4)],
axis=1
)
)
songID time A_t+0 A#_t+0 A_t+1 A#_t+1 A_t+2 A#_t+2 A_t+3 A#_t+3
0 CrossEra-0850 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.316210 0.760299
1 CrossEra-0850 0.1 0.000000 0.000000 0.000000 0.000000 0.316210 0.760299 0.000000 0.001075
2 CrossEra-0850 0.2 0.000000 0.000000 0.316210 0.760299 0.000000 0.001075 0.000000 0.142832
3 CrossEra-0850 0.3 0.316210 0.760299 0.000000 0.001075 0.000000 0.142832 0.850646 0.124816
4 CrossEra-0850 0.4 0.000000 0.001075 0.000000 0.142832 0.850646 0.124816 0.000000 0.212064
5 CrossEra-0850 0.5 0.000000 0.142832 0.850646 0.124816 0.000000 0.212064 0.079621 0.282274
6 CrossEra-0850 0.6 0.850646 0.124816 0.000000 0.212064 0.079621 0.282274 2.551440 0.169434
7 CrossEra-0850 0.7 0.000000 0.212064 0.079621 0.282274 2.551440 0.169434 3.458170 0.080146
8 CrossEra-0850 0.8 0.079621 0.282274 2.551440 0.169434 3.458170 0.080146 3.106140 0.030420
9 CrossEra-0850 0.9 2.551440 0.169434 3.458170 0.080146 3.106140 0.030420 NaN NaN
10 CrossEra-0850 1.0 3.458170 0.080146 3.106140 0.030420 NaN NaN NaN NaN
11 CrossEra-0850 1.1 3.106140 0.030420 NaN NaN NaN NaN NaN NaN
From a network, I want to plot the probability of two nodes to be connected as a function of their distance to each other.
I have two pandas series, one (distance) is the distance between each pair of node and the other (adjacency) is filled with zeros and ones and tells if the nodes are connected.
My idea was to use cut and value_counts to first compute the number of pairs having a distance inside bins, which works fine:
factor = pandas.cut(distance, 100)
num_bin = pandas.value_counts(factor)
Now if had a vector of the same size of num_bin with the number of connected nodes inside each bins, i would have my probability. but how to compute this vector?
My problem is how to know among, lets says the 3 couple of nodes inside the second bin, how many are connected?
thanks
You could use crosstab for this:
import numpy as np
import pandas as pd
factor = pd.cut(distance, 100)
# the crosstab dataframe with the value counts in each bucket
ct = pd.crosstab(factor, adjacency, margins=True,
rownames=['distance'], colnames=['adjacency'])
# from here computing the probability of nodes being adjacent is straightforward
ct['prob'] = np.true_divide(ct[1], ct['All'])
Which gives a dataframe of this form:
>>> ct
adjacency 0 1 All prob
distance
(0.00685, 0.107] 7 4 11 0.363636
(0.107, 0.205] 6 9 15 0.600000
(0.205, 0.304] 6 6 12 0.500000
(0.304, 0.403] 5 2 7 0.285714
(0.403, 0.502] 4 6 10 0.600000
(0.502, 0.6] 8 3 11 0.272727
(0.6, 0.699] 6 2 8 0.250000
(0.699, 0.798] 4 6 10 0.600000
(0.798, 0.896] 4 5 9 0.555556
(0.896, 0.995] 5 2 7 0.285714
All 55 45 100 0.450000