I have created two series and I want to create a third series by doing element-wise multiplication of first two. My code is given below:
new_samples = 10 # Number of samples in series
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
c = pd.Series([x*y for x,y in zip(a.tolist(),b.tolist())],index=['Power'])
My output is:
TypeError: can't multiply sequence by non-int of type 'list'
To keep things clear, I am pasting my actual for loop code below. My data frame already has three columns Current,Voltage,Power. For my requirement, I have to add new list of values to existing columns Voltage,Current. But, Power values are created by multiplying already created values. My code is given below:
for i,j in zip(IV_start_index,IV_start_index[1:]):
isc_act = module_allData_df['Current'].iloc[i:j-1].max()
isc_indx = module_allData_df['Current'].iloc[i:j-1].idxmax()
sample_count = int((j-i)/(module_allData_df['Voltage'].iloc[i]-module_allData_df['Voltage'].iloc[j-1]))
new_samples = int(sample_count * (module_allData_df['Voltage'].iloc[isc_indx]))
missing_current = pd.Series([list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples)))],index=['Current'])
missing_voltage = pd.Series([list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples)))],index=['Voltage'])
print(missing_current.tolist()*missing_voltage.tolist())
Sample data: module_allData_df.head()
Voltage Current Power
0 33.009998 -0.004 -0.13204
1 33.009998 0.005 0.16505
2 32.970001 0.046 1.51662
3 32.950001 0.087 2.86665
4 32.919998 0.128 4.21376
sample data: module_allData_df.iloc[120:126] and you require this also
Voltage Current Power
120 0.980000 5.449 5.34002
121 0.920000 5.449 5.01308
122 0.860000 5.449 4.68614
123 0.790000 5.449 4.30471
124 33.110001 -0.004 -0.13244
125 33.110001 0.005 0.16555
sample data: IV_start_index[:5]
[0, 124, 251, 381, 512]
Based on #jezrael answer, I have successfully created three separate series. How to append them to main dataframe. My requirement is explained in following plot.
Problem is each Series is one element with lists, so not possible use vectorized operations.
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
print (a)
Current [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
dtype: object
print (b)
Voltage [10.0, 8.88888888888889, 7.777777777777778, 6....
dtype: object
So I believe need remove [] and if necessary add parameter name:
a = pd.Series(list(map(lambda x:x,np.linspace(2,2,new_samples))), name='Current')
b = pd.Series(list(map(lambda x:x,np.linspace(10,0,new_samples))),name='Voltage')
print (a)
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
9 2.0
Name: Current, dtype: float64
print (b)
0 10.000000
1 8.888889
2 7.777778
3 6.666667
4 5.555556
5 4.444444
6 3.333333
7 2.222222
8 1.111111
9 0.000000
Name: Voltage, dtype: float64
c = a * b
print (c)
0 20.000000
1 17.777778
2 15.555556
3 13.333333
4 11.111111
5 8.888889
6 6.666667
7 4.444444
8 2.222222
9 0.000000
dtype: float64
EDIT:
If want outoput multiplied Series need last 2 rows:
missing_current = pd.Series(list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples))))
missing_voltage = pd.Series(list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples))))
print(missing_current *missing_voltage)
It's easier using numpy.
import numpy as np
new_samples = 10 # Number of samples in series
a = np.array(np.linspace(2,2,new_samples))
b = np.array(np.linspace(10,0,new_samples))
c = a*b
print(c)
Output:
array([20. , 17.77777778, 15.55555556, 13.33333333,
11.11111111,
8.88888889, 6.66666667, 4.44444444, 2.22222222, 0. ])
As you are doing everything using pandas dataframe, use the below code.
import pandas as pd
new_samples = 10 # Number of samples in series
df = pd.DataFrame({'Current':np.linspace(2,2,new_samples),'Voltage':np.linspace(10,0,new_samples)})
df['Power'] = df['Current'] * df['Voltage']
print(df.to_string(index=False))
Output:
Current Voltage Power
2.0 10.000000 20.000000
2.0 8.888889 17.777778
2.0 7.777778 15.555556
2.0 6.666667 13.333333
2.0 5.555556 11.111111
2.0 4.444444 8.888889
2.0 3.333333 6.666667
2.0 2.222222 4.444444
2.0 1.111111 2.222222
2.0 0.000000 0.000000
Because they are series you should be able to just multiply them c= a * b
You could add a and b to a data frame and the c becomes the third column
Related
I have a Pandas Dataframe with a float column. The values in that column have many decimal points but I only need 2 decimal points. I don't want to round, but truncate the value after the second digit.
this is what I have so far, however with this operation i always get NaN's:
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
sub = "."
t['latitude'].astype(str).str.slice(start=t['latitude'].astype(str).str.find(sub), stop=t['latitude'].astype(str).str.find(sub)+2)
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
Name: latitude, dtype: float64
The simpliest way to truncate:
t = pd.DataFrame()
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
t['latitude'] = (t['latitude'] * 100).astype(int) / 100
print(t)
>>
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Use np.round -
s = pd.Series([18.3988, 18.4439, 18.3467, 37.5079, 38.1102, 38.2927])
s_rounded = np.round(s, 2)
Output
0 18.40
1 18.44
2 18.35
3 37.51
4 38.11
5 38.29
dtype: float64
If you don't want to round, but just truncate -
s.astype(str).str.split('.').apply(lambda x: str(x[0]) + '.' + str(x[1])[:2])
Output
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
dtype: object
Use numpy.trunc for a vectorial operation:
n = 2 # number of decimals to keep
np.trunc(df['latitude'].mul(10**n)).div(10**n)
# to assign
# df['latitude'] = np.trunc(df['latitude'].mul(10**n)).div(10**n)
output:
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Name: latitude, dtype: float64
x = 12.3614
y = round(x,2)
print(y) // 12.36
Easiest is Serious.round, but you can also try .str.extract
t['latitude'] = (t['latitude'].astype(str)
.str.extract('(.*\.\d{0,2})')
.astype(float))
print(t)
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
import re
t = [18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
truncated_lat=[]
for lat in t:
truncated_lat.append(float(re.findall('[0-9]+\.[0-9]{2}', str(lat))[0]))
print(truncated_lat)
Output:
[18.39, 18.44, 18.34, 37.5, 38.11, 38.29]
Try
import math
for i in t['latitude']:
math.trunc(i)
Comparing 2 series objects of different sizes:
IN[248]:df['Series value 1']
Out[249]:
0 70
1 66.5
2 68
3 60
4 100
5 12
Name: Stu_perc, dtype: int64
IN[250]:benchmark_value
#benchamrk is a subset of data from df2 only based on certain filters
Out[251]:
0 70
Name: Stu_perc, dtype: int64
Basically I wish to compare df['Series value 1'] with benchmark_value and return the values which are greater than 95% of benchark value in a column Matching list. Type of both of these is Pandas series. However sizes are different for both, hence it is not comparing.
Input given:
IN[252]:df['Matching list']=(df2['Series value 1']>=0.95*benchmark_value)
OUT[253]: ValueError: Can only compare identically-labeled Series objects
Output wanted:
[IN]:
df['Matching list']=(df2['Stu_perc']>=0.95*benchmark_value)
#0.95*Benchmark value is 66.5 in this case.
df['Matching list']
[OUT]:
0 70
1 66.5
2 68
3 NULL
4 100
5 NULL
Because benchmark_value is Series, for scalar need select first value of Series by Series.iat and set NaNs by Series.where:
benchmark_value = pd.Series([70], index=[0])
val = benchmark_value.iat[0]
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 NaN
4 100.0 100.0
5 12.0 NaN
General solution also working if benchmark_value is empty is next with iter for return first value of Series and if not exist use default value - here 0:
benchmark_value = pd.Series([])
val = next(iter(benchmark_value), 0)
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 60.0
4 100.0 100.0
5 12.0 12.0
is your benchmark value is single-value?
If yes, you might need to convert benchmark_value which is a series to a number (without index) by using df['Matching list']=(df['Stu_perc']>=0.95*benchmark_value.values)
It seems benchmark value is a Series with a single row, so not an actual number, I believe you need to access it first.
But this will return a list of Booleans. To get just the values that you want, you can use the where function.
Try this:
df['Matching list']= df2['Stu_perc'].where(df2['Stu_perc'] >=0.95*benchmark_value[0][0]))
I have many dataframes (timeseries) that are of different lengths ranging between 28 and 179. I need to make them all of length 104. (upsampling those below 104 and downsampling those above 104)
For upsampling, the linear method can be sufficient to my needs. For downsampling, the mean of the values should be good.
To get all files to be the same length, I thought that I need to make all dataframes start and end at the same dates.
I was able to downsample all to the size of the smallest dataframe (i.e. 28) using below lines of code:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
resampled=df.resample('120D').mean()
However, this will not give me good results when I feed them into the model I need them for as it shrinks the longer files so much thus distorting the data.
This is what I tried so far:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
if df.shape[0]>100: resampled=df.resample('D').mean()
elif df.shape[0]<100: resampled=df.astype(float).resample('33D').interpolate(axis=0, method='linear')
else: break
Now, in the above lines of code, I am getting the files to be the same length (length 100). The downsampling part works fine too.
What's not working is the interpoaltion on the upsampling part. It just returns dataframes of length 100 with the first value of every column just copied over to all the rows.
What I need is to make them all size 104 (average size). This means any df of length>104 needs to downsampled and any df of length<104 needs to be upsampled.
As an example, please consider the two dfs as follows:
>>df1
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
>>df2
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 6 -3 -2
4 4 0 -4
5 8 2 -6
6 10 4 -8
7 12 6 -10
Suppose the avg length is 6, the expected output would be:
df1 upsampled to length 6 using interpolation - for e.g. resamle(rule).interpolate().
And df2 downsampled to length 6 using resample(rule).mean() .
Update:
If I could get all the files to be upsampled to 179, that would be fine as well.
I assume the problem is when you do resample in the up-sampling case, the other values are not kept. With you example df1, you can see it by using asfreq on one column:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').asfreq().isna().sum(0))
#99 rows are nan on the 100 length resampled dataframe
So when you do interpolate instead of asfreq, it actually interpolates with just the first value, meaning that the first value is "repeated" over all the rows
To get the result you want, then before interpolating, use also mean even in the up-sampling case, such as:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').mean().interpolate().head())
1991-01-01 3.000000
1991-02-03 3.060606
1991-03-08 3.121212
1991-04-10 3.181818
1991-05-13 3.242424
Freq: 33D, Name: 1, dtype: float64
and you will get values as you want.
To conclude, I think in both up-sampling and down-sampling cases, you can use the same command
resampled = (df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'))
.resample('33D').mean().interpolate())
Because the interpolate would not affect the result in the down-sampling case.
Here is my version using skimage.transform.resize() function:
df1 = pd.DataFrame({
'a': [3,5,9,11],
'b': [-1,-3,-5,-7],
'c': [0,2,0,-2]
})
df1
a b c
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
import pandas as pd
import numpy as np
from skimage.transform import resize
def df_resample(df1, num=1):
df2 = pd.DataFrame()
for key, value in df1.iteritems():
temp = value.to_numpy()/value.abs().max() # normalize
resampled = resize(temp, (num,1), mode='edge')*value.abs().max() # de-normalize
df2[key] = resampled.flatten().round(2)
return df2
df2 = df_resample(df1, 20) # resampling rate is 20
df2
a b c
0 3.0 -1.0 0.0
1 3.0 -1.0 0.0
2 3.0 -1.0 0.0
3 3.4 -1.4 0.4
4 3.8 -1.8 0.8
5 4.2 -2.2 1.2
6 4.6 -2.6 1.6
7 5.0 -3.0 2.0
8 5.8 -3.4 1.6
9 6.6 -3.8 1.2
10 7.4 -4.2 0.8
11 8.2 -4.6 0.4
12 9.0 -5.0 0.0
13 9.4 -5.4 -0.4
14 9.8 -5.8 -0.8
15 10.2 -6.2 -1.2
16 10.6 -6.6 -1.6
17 11.0 -7.0 -2.0
18 11.0 -7.0 -2.0
19 11.0 -7.0 -2.0
This question already has answers here:
Subtract one dataframe from another excluding the first column Pandas
(3 answers)
Closed 4 years ago.
I have two data frames with same column names.
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.282690 0.073650 22.160800 4050.311360
1 4208.98 5.5 0.490580 0.084925 44.323130 4208.973512
2 4374.94 9.0 0.714830 0.114290 86.964970 4374.927110
3 4379.74 9.0 0.314040 0.091070 30.442710 4379.760601
4 4398.01 14.0 0.504150 0.098450 52.832360 4398.007473
5 4502.21 8.0 0.562780 0.101090 60.559960 4502.205220
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.276350 0.077770 22.876240 4050.310469
1 4208.98 5.5 0.493035 0.084065 44.095755 4208.974363
2 4374.94 6.0 0.716760 0.111550 85.111070 4374.927649
3 4379.74 1.0 0.299070 0.098400 31.325300 4379.759339
4 4398.01 6.0 0.508810 0.084530 45.783740 4398.004164
5 4502.21 9.0 0.572320 0.100540 61.252070 4502.205764
As the both the dataframes have column names and column one wave is same in both the dataframes. I want to take the difference of all the column except column 1 i.e, wave.
So, the resultant dataframe should have column1 and the difference of all the other columns
how can i do that?
I believe need extract columns names by difference and then use DataFrame.sub:
cols = df1.columns.difference(['wave'])
#is possible specify multiple columns
#cols = df1.columns.difference(['wave','MeasredWave'])
#df1[cols] = means in output are not touch columns from df1
df1[cols] = df1[cols].sub(df2[cols])
print (df1)
wave num stlines fwhm EWs MeasredWave
0 4050.32 0.0 0.006340 -0.00412 -0.715440 0.000891
1 4208.98 0.0 -0.002455 0.00086 0.227375 -0.000851
2 4374.94 3.0 -0.001930 0.00274 1.853900 -0.000539
3 4379.74 8.0 0.014970 -0.00733 -0.882590 0.001262
4 4398.01 8.0 -0.004660 0.01392 7.048620 0.003309
5 4502.21 -1.0 -0.009540 0.00055 -0.692110 -0.000544
cols = df1.columns.difference(['wave'])
#df2[cols] = means in output are not touch columns from df2
df2[cols] = df1[cols].sub(df2[cols])
print (df2)
wave num stlines fwhm EWs MeasredWave
0 4050.32 0.0 0.006340 -0.00412 -0.715440 0.000891
1 4208.98 0.0 -0.002455 0.00086 0.227375 -0.000851
2 4374.94 3.0 -0.001930 0.00274 1.853900 -0.000539
3 4379.74 8.0 0.014970 -0.00733 -0.882590 0.001262
4 4398.01 8.0 -0.004660 0.01392 7.048620 0.003309
5 4502.21 -1.0 -0.009540 0.00055 -0.692110 -0.000544
I have a pandas dataframe of variable number of columns. I'd like to numerically integrate each column of the dataframe so that I can evaluate the definite integral from row 0 to row 'n'. I have a function that works on an 1D array, but is there a better way to do this in a pandas dataframe so that I don't have to iterate over columns and cells? I was thinking of some way of using applymap, but I can't see how to make it work.
This is the function that works on a 1D array:
def findB(x,y):
y_int = np.zeros(y.size)
y_int_min = np.zeros(y.size)
y_int_max = np.zeros(y.size)
end = y.size-1
y_int[0]=(y[1]+y[0])/2*(x[1]-x[0])
for i in range(1,end,1):
j=i+1
y_int[i] = (y[j]+y[i])/2*(x[j]-x[i]) + y_int[i-1]
return y_int
I'd like to replace it with something that calculates multiple columns of a dataframe all at once, something like this:
B_df = y_df.applymap(integrator)
EDIT:
Starting dataframe dB_df:
Sample1 1 dB Sample1 2 dB Sample1 3 dB Sample1 4 dB Sample1 5 dB Sample1 6 dB
0 2.472389 6.524537 0.306852 -6.209527 -6.531123 -4.901795
1 6.982619 -0.534953 -7.537024 8.301643 7.744730 7.962163
2 -8.038405 -8.888681 6.856490 -0.052084 0.018511 -4.117407
3 0.040788 5.622489 3.522841 -8.170495 -7.707704 -6.313693
4 8.512173 1.896649 -8.831261 6.889746 6.960343 8.236696
5 -6.234313 -9.908385 4.934738 1.595130 3.116842 -2.078000
6 -1.998620 3.818398 5.444592 -7.503763 -8.727408 -8.117782
7 7.884663 3.818398 -8.046873 6.223019 4.646397 6.667921
8 -5.332267 -9.163214 1.993285 2.144201 4.646397 0.000627
9 -2.783008 2.288842 5.836786 -8.013618 -7.825365 -8.470759
Ending dataframe B_df:
Sample1 1 B Sample1 2 B Sample1 3 B Sample1 4 B Sample1 5 B Sample1 6 B
0 0.000038 0.000024 -0.000029 0.000008 0.000005 0.000012
1 0.000034 -0.000014 -0.000032 0.000041 0.000036 0.000028
2 0.000002 -0.000027 0.000010 0.000008 0.000005 -0.000014
3 0.000036 0.000003 -0.000011 0.000003 0.000002 -0.000006
4 0.000045 -0.000029 -0.000027 0.000037 0.000042 0.000018
5 0.000012 -0.000053 0.000015 0.000014 0.000020 -0.000023
6 0.000036 -0.000023 0.000004 0.000009 0.000004 -0.000028
7 0.000046 -0.000044 -0.000020 0.000042 0.000041 -0.000002
8 0.000013 -0.000071 0.000011 0.000019 0.000028 -0.000036
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
In the above example,
(x[j]-x[i]) = 0.000008
First of all, you can achieve a similar result using vectorized operations. Each element of the integration is just the mean of the current and next y value scaled by the corresponding difference in x. The final integral is just the cumulative sum of these elements. You can achieve the same result by doing something like
def findB(x, y):
"""
x : pandas.Series
y : pandas.DataFrame
"""
mean_y = (y[:-1] + y.shift(-1)[:-1]) / 2
delta_x = x.shift(-1)[:-1] - x[:-1]
scaled_int = mean_y.multiply(delta_x)
cumulative_int = scaled_int.cumsum(axis='index')
return cumulative_int.shift(1).fillna(0)
Here DataFrame.shift and Series.shift are used to match the indices of the "next" elements to the current. You have to use DataFrame.multiply rather than the * operator to ensure that the proper axis is used ('index' vs 'column'). Finally, DataFrame.cumsum provides the final integration step. DataFrame.fillna ensures that you have a first row of zeros as you did in the original solution. The advantage of using all the native pandas functions is that you can pass in a dataframe with any number of columns and have it operate on all of them simultaneously.
Do you really look for numeric values of the integral? Maybe you just need a picture? Then it is easier, using pyplot.
import matplotlib.pyplot as plt
# Introduce a column *bin* holding left limits of our bins.
df['bin'] = pd.cut(df['volume2'], 50).apply(lambda bin: bin.left)
# Group by bins and calculate *f*.
g = df[['bin', 'universe']].groupby('bin').sum()
# Plot the function using cumulative=True.
plt.hist(list(g.index), bins=50, weights=list(g['universe']), cumulative=True)
plt.show()