I have a csv file that looks like as shown in the picture. There are multiple rows like this whose values are zero in between. So in this row, i want an interpolated value of the upper and lower row. I used df.interpolate(method ='linear', limit_direction ='forward') to interpolate. However, the zero values are not treated as NaN values so it didnt work for me.
First replace all the zeros with np.nan and then the interpolate will work correctly:
import pandas as pd
import numpy as np
data = [
[7260,-4.458639405975710,-4.,7.E-08,0.1393070275997700,0.,-0.11144176562682400],
[8030,-4.452569075111660,-4.,4.E-08,0.1347428577024860,-0.1001462206643270,-0.04915374942019220],
[498,-4.450785570790800,-4.437233532812810,1.E-07,0.1577349354100960,-0.1628636478696300,-0.05505793797144350],
[1500,-4.450303023388150,-4.429207978066990,1.E-07,0.1219543073754720,-0.1886731968341070,-0.14408112469719300],
[6600,-4.462030024237730,-4.4286701710604900,4.E-08,0.100803412848051,-0.1840333872203410,-0.18430271378600200],
[8860,0.0,0.0,0.0,0.0,0.0,0.0],
[530,-4.453994378096950,-4.0037494206318200,-9.E-08,0.0594973737919224,1.0356594366090900,-0.03173366589936420],
[6904,-4.449221525263950,-3.1840342819501800,-2.E-07,0.0918042463623589,1.5125956674286500,-0.01150704151230230],
[7700,-4.454965896625150,-3.041102261967650,-1.E-07,0.1211292098853800,1.837772463779190,0.0680406376006960],
[6463,-4.4524324374160600,-3.1096025723730000,-4.E-08,0.1920291560629040,2.062490856824510,0.10665282217392200],
]
df = pd.DataFrame(data, columns=range(98, 105)) \
.replace(0, np.nan) \
.interpolate(method ='linear', limit_direction ='forward')
print(df)
Giving:
98 99 100 101 102 103 104
0 7260 -4.458639 -4.000000 7.000000e-08 0.139307 NaN -0.111442
1 8030 -4.452569 -4.000000 4.000000e-08 0.134743 -0.100146 -0.049154
2 498 -4.450786 -4.437234 1.000000e-07 0.157735 -0.162864 -0.055058
3 1500 -4.450303 -4.429208 1.000000e-07 0.121954 -0.188673 -0.144081
4 6600 -4.462030 -4.428670 4.000000e-08 0.100803 -0.184033 -0.184303
5 8860 -4.458012 -4.216210 -2.500000e-08 0.080150 0.425813 -0.108018
6 530 -4.453994 -4.003749 -9.000000e-08 0.059497 1.035659 -0.031734
7 6904 -4.449222 -3.184034 -2.000000e-07 0.091804 1.512596 -0.011507
8 7700 -4.454966 -3.041102 -1.000000e-07 0.121129 1.837772 0.068041
9 6463 -4.452432 -3.109603 -4.000000e-08 0.192029 2.062491 0.106653
Related
I have a bunch of data points in a timeseries in a pandas dataframe. Each column is supposedly independent of each other. I want to create a montecarlo process to calculate expected values for each of the columns. For that, my expectation is that the underlying data follows a brownian motion pattern, so I'd need to generate a normal distribution over the differences between points in time space.
I transform my data like this:
diffs = (data.diff() / data.shift(1))
This is what I have at the moment:
data = diffs.describe()
This gives the following output:
A B C
count 4986.000000 4963.000000 1861.000000
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
I process it like this to generate more samples:
import numpy as np
desired_samples = 1000
random = np.random.default_rng().normal(loc=[data.loc[["mean"]].to_numpy()], scale=[data.loc[["std"]].to_numpy()], size=[len(data.columns), desired_samples])
However this gives me an error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (441, 1000) and arg 1 with shape (1, 1, 441).
What I'd want is just a matrix of random values whose columns have the same std and mean as the sample's columns. I.e. such as when I do random.describe(), I'd get something like:
A B C
count 1000.0 1000.0 1000.0
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
What'd be the correct way to generate those samples?
You could use apply() to create a data frame of random normal values using the attributes of the associated columns.
Generate Test Data
nv = 50
d = {'A':np.random.normal(1,1,nv),'B':np.random.normal(2,2,nv),'C':np.random.normal(3,3,nv)}
df = pd.DataFrame(d)
print(df)
A B C
0 0.276252 -2.833479 5.746740
1 1.562030 1.497242 2.557416
2 0.883105 -0.861824 3.106192
3 0.352372 0.014653 4.006219
4 1.475524 3.151062 -1.392998
5 2.011649 -2.289844 4.371251
6 3.230964 3.578058 0.610422
7 0.366506 3.391327 0.812932
8 1.669673 -1.021665 4.262500
9 1.835547 4.292063 6.983015
10 1.768208 4.029970 3.971751
...
45 0.501706 0.926860 7.008008
46 1.759266 -0.215047 4.560403
47 1.899167 0.690204 -0.538415
48 1.460267 1.506934 1.306303
49 1.641662 1.066182 0.049233
df.describe()
A B C
count 50.000000 50.000000 50.000000
mean 0.962083 1.522234 2.992492
std 1.073733 1.848754 2.838976
Generate Random Values with same approx (calculated) Mean and STD
mat = df.apply(lambda x: np.random.normal(x.mean(),x.std(),100))
print(mat)
A B C
0 0.234955 2.201961 1.910073
1 1.973203 3.528576 5.925673
2 -0.858201 2.234295 1.741338
3 2.245650 2.805498 0.135784
4 1.913691 2.134813 2.246989
.. ... ... ...
95 2.996207 2.248727 2.792658
96 0.663609 4.533541 1.518872
97 0.848259 -0.348086 2.271724
98 3.672370 1.706185 -0.862440
99 0.392051 0.832358 -0.354981
[100 rows x 3 columns]
mat.describe()
A B C
count 100.000000 100.000000 100.000000
mean 0.877725 1.332039 2.673327
std 1.148153 1.749699 2.447532
If you want the matrix to be numpy
mat.to_numpy()
array([[ 0.78881292, 3.09428714, -1.22757096],
[ 0.13044099, -1.02564025, 2.6566989 ],
[ 0.06090083, 1.50629474, 3.61487469],
[ 0.71418932, 1.88441111, 5.84979454],
[ 2.34287411, 2.58478867, -4.04433653],
[ 1.41846256, 0.36414635, 8.47482082],
[ 0.46765842, 1.37188986, 3.28011085],
[ 0.87433273, 3.45735286, 1.13351138],
[ 1.59029413, 4.0227165 , 3.58282534],
[ 2.23663894, 2.75007385, -0.36242541],
[ 1.80967311, 1.29206572, 1.73277577],
[ 1.20787923, 2.75529187, 4.64721489],
[ 2.33466341, 6.43830387, 4.31354348],
[ 0.87379125, 3.00658046, 4.94270155],
etc ...
I currently have data which contains a location name, latitude, longitude and then a number value associated locations. The final goal for me would to get a dataframe that has the sum of the values of each location within specific distance ranges. A sample dataframe is below:
IDVALUE,Latitude,Longitude,NumberValue
ID1,44.968046,-94.420307,1
ID2,44.933208,-94.421310,10
ID3,33.755787,-116.359998,15
ID4,33.844843,-116.54911,207
ID5,44.92057,-93.44786,133
ID6,44.240309,-91.493619,52
ID7,44.968041,-94.419696,39
ID8,44.333304,-89.132027,694
ID9,33.755783,-116.360066,245
ID10,33.844847,-116.549069,188
ID11,44.920474,-93.447851,3856
ID12,44.240304,-91.493768,189
Firstly, I managed to get the distances between each of them using the haversine function. Using the code below I turned the latlongs into radians and then created a matrix where the diagonals are infinite values.
df_latlongs['LATITUDE'] = np.radians(df_latlongs['LATITUDE'])
df_latlongs['LONGITUDE'] = np.radians(df_latlongs['LONGITUDE'])
dist = DistanceMetric.get_metric('haversine')
latlong_df = pd.DataFrame(dist.pairwise(df_latlongs[['LATITUDE','LONGITUDE']].to_numpy())*6373, columns=df_latlongs.IDVALUE.unique(), index=df_latlongs.IDVALUE.unique())
np.fill_diagonal(latlong_df.values, math.inf)
This distance matrix is then in kilometres. What I'm struggling with next is to be able to filter the distances of each of the locations and get the total number of values within a range and link this to the original dataframe.
Below is the code I have used to filter the distance matrix to get all of the locations within 500 meters:
latlong_df_rows = latlong_df[latlong_df < 0.5]
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=0)
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=1)
My attempt was to them get a list for each location of the locations that were in this value using the code below:
within_range_df = latlong_df_rows.apply(lambda row: row[row < 0.05].index.tolist(), axis=1)
within_range_df = within_range_df.to_frame()
within_range_df = within_range_df.dropna(how='all', axis=0)
within_range_df = within_range_df.dropna(how='all', axis=1)
From here I was going to try and get the NumberValue from the original dataframe by looping through the list of values to obtain another column for the number for that location. Then sum all of them. The final dataframe would ideally look like the following:
ID VALUE,<500m,500-1000m,>100m
ID1,x1,y1,z1
ID2,x2,y2,z2
ID3,x3,y3,z3
ID4,x4,y4,z4
ID5,x5,y5,z5
ID6,x6,y6,z6
ID7,x7,y7,z7
ID8,x8,y8,z8
ID9,x9,y9,z9
ID10,x10,y10,z10
ID11,x11,y11,z11
ID12,x12,y12,z12
Where x y and z are the total number values for the nearest locations for different distances. I know this is probably really weird and overcomplicated so any tips to change the question or anything else that is needed I'll be happy to provide. Cheers
I would define a helper function, making use of BallTree, e.g.
from sklearn.neighbors import BallTree
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv')
We use query_radius() to get the IDs and use list comprehension to get the values and sum them;
locations_radians = np.radians(df[["Latitude","Longitude"]].values)
tree = BallTree(locations_radians, leaf_size=12, metric='haversine')
def summed_numbervalue_for_radius( radius_in_m=100):
distance_in_meters = radius_in_m
earth_radius = 6371000
radius = distance_in_meters / earth_radius
ids_within_radius = tree.query_radius(locations_radians, r=radius, count_only=False)
values_as_array = np.array(df.NumberValue)
summed_values = [values_as_array[ix].sum() for ix in ids_within_radius]
return np.array(summed_values)
With the helper function you can do for instance;
df = df.assign( sum_100=summed_numbervalue_for_radius(100))
df = df.assign( sum_500=summed_numbervalue_for_radius(500))
df = df.assign( sum_1000=summed_numbervalue_for_radius(1000))
df = df.assign( sum_1000_to_5000=summed_numbervalue_for_radius(5000)-summed_numbervalue_for_radius(1000))
Will give you
IDVALUE Latitude Longitude NumberValue sum_100 sum_500 sum_1000 \
0 ID1 44.968046 -94.420307 1 40 40 40
1 ID2 44.933208 -94.421310 10 10 10 10
2 ID3 33.755787 -116.359998 15 260 260 260
3 ID4 33.844843 -116.549110 207 395 395 395
4 ID5 44.920570 -93.447860 133 3989 3989 3989
5 ID6 44.240309 -91.493619 52 241 241 241
6 ID7 44.968041 -94.419696 39 40 40 40
7 ID8 44.333304 -89.132027 694 694 694 694
8 ID9 33.755783 -116.360066 245 260 260 260
9 ID10 33.844847 -116.549069 188 395 395 395
10 ID11 44.920474 -93.447851 3856 3989 3989 3989
11 ID12 44.240304 -91.493768 189 241 241 241
sum_1000_to_5000
0 10
1 40
2 0
3 0
4 0
5 0
6 10
7 0
8 0
9 0
10 0
11 0
I created a database and I am trying to substitute the categorical variables with some numerical values
that I calculated via 'pivot'. In my code, I am trying to iterate through the whole dataframe and if the dataframe categorical columns cells have the same values as one of the elements in 'sublist_names', they should be replaced by the element in 'sublist_values' located in the same position as the value in sublist names.
For example, while iterating the dataframe and each of the categorical columns, the first value of column called 'Name' is the string 'tom'. 'tom' is exactly the 7th element in 'sublist_names', which means it should be replaced by the 7th element in 'sublist_values' which is equal to 150.
I was able to obtain all the needed values but when it comes to solving this last task by iterating the whole dataframe instead of working column by column, I am not sure how to do it.
I hope I explained clearly, but for any questions feel free to ask.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = [['tom', 10,6,'brown',200],
['nick', 15,5.10,'red',150],
['juli', 14,5.5,'black',170]
,['peter', 10,6,'blue',290],
['axel', 15,5.10,'yellow',190],
['william', 14,5.5,'yellow',170]
,['tom', 10,6,'orange',100],
['tom', 15,5.10,'brown',150],
['angela', 14,5.5,'black',160]
,['peter', 10,6,'purple',220],
['nick', 15,5.10,'orange',150],
['aroon', 14,5.5,'red',170] ]
df = pd.DataFrame(data, columns=['Name', 'Age','height','color','weight'])
categorical_variables= (df.select_dtypes('object') ) # categorical variables
categ_var_list=(list(categorical_variables))
print(categ_var_list)
condition_pivot_list_names=[]
pivot_values_list=[]
for i in categ_var_list:
condition_pivot = df.pivot_table(index=i, values='weight', aggfunc=np.mean)
pivot_names = (condition_pivot.index.values.tolist())
condition_pivot_list_names.append(pivot_names)
pivot_values_draft = ((condition_pivot.values.tolist()))
pivot_values = [i[0] for i in pivot_values_draft]
pivot_values_list.append(pivot_values)
print(condition_pivot_list_names, 'condition pivot list names')
print(pivot_values_list,'pivot values list')
sublist_names=[(sublists) for sublists in condition_pivot_list_names]
print(sublist_names)
sublist_values=[(sublists1) for sublists1 in pivot_values_list]
print(sublist_values)
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
print(df['Name'])
This is what print( df[name]) shows:
0 tom
1 nick
2 juli
3 peter
4 axel
5 william
6 tom
7 tom
8 angela
9 peter
10 nick
11 aroon
And this is what should show:
0 150
1 150
2 170
3 255
4 190
5 170
6 150
7 150
8 160
9 255
10 150
11 170
You have two categorical values Name and Color. So you cam do something like this.
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
And than you can create a function myfunc() which will receive x from above code. What above code is doing is, it will iterate over the column one by one and pass value of each row one by one to the function. Inside the function you can define the logic to convert the categorical values something like this
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
Do the same thing for the column Color.
Try this:
df.Name = np.where(df.groupby('Name', as_index=False)['Name'].cumcount().eq(0), df.Name, df.weight)
Output:
Name Age height color weight
0 tom 10 6.0 brown 200
1 nick 15 5.1 red 150
2 juli 14 5.5 black 170
3 peter 10 6.0 blue 290
4 axel 15 5.1 yellow 190
5 william 14 5.5 yellow 170
6 100 10 6.0 orange 100
7 150 15 5.1 brown 150
8 angela 14 5.5 black 160
9 220 10 6.0 purple 220
10 150 15 5.1 orange 150
11 aroon 14 5.5 red 170
Okay I see your problem. Just write the code below before the function declaration.
sub_names=[]
sub_values=[]
for i in sublist_names:
sub_names.extend(i)
for i in sublist_values:
sub_values.extend(i)
Also dont forget to update variable names in myfunc().
To best illustrate consider the following SQL Illustration:
Table StockPrices, BarSeqId is a sequential number where each increment is information from next minute of trading.
The goal to achieve in pandas data frame is to transform this data:
StockPrice BarSeqId LongProfitTarget
105 0 109
100 1 105
103 2 107
103 3 108
104 4 110
105 5 113
into this data:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
106 0 109 Nan
100 1 105 3
103 2 107 5
105 3 108 Nan
104 4 110 Nan
107 5 113 Nan
to create a new column which describes at which soonest sequential time-frame a price target will be hit in the future from the current time-frame
Here is how it could be achieved in SQL:
SELECT S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget,
min(S2.BarSeqId) as TargetHitBarSeqId
FROM StockPrices S1
left outer join StockPrices S2 on S1.BarSeqId<s2.BarSeqId and
S2.StockPrice>=S1.LongProfitTarget
GROUP BY S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget
I would like the answer to be as follows:
someDataFrame['TargetHitBarSeqId'] = (pandas expression here ...**
assume that someDataFrame already has columns: StockPrice, BarSeqId, LongProfitTarget
data edited to illustrate case
so in the second row result should be
100 1 105 3
and NOT
100 1 105 0
since 3 and not 0 occurs after 1.
It is important that the barseq in question shall occur in the future (greater than current BarSeq)
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Here's one solution:
import pandas as pd
import numpy as np
df = <your input data frame>
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Output:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
0 100 1 105 3.0
1 103 2 107 5.0
2 105 3 108 NaN
3 104 4 110 NaN
4 107 5 113 NaN
from pathlib import Path
import pandas as pd
from itertools import islice
import numpy as np
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget,barseq):
try:
idx = df[(df.StockPrice >= longProfitTarget) & (df.BarSeqId>barseq)].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget'], row['BarSeqId']), axis=1)
df
The key misunderstanding for me was a need to use & operator instead of regular 'or'
Assuming data is manageable, consider a cross join followed by filter and groupby, which would replicate the SQL query:
cdf = pd.merge(df.assign(key=1), df.assign(key=1), on='key', suffixes=['','_'])\
.query('(BarSeqId < BarSeqId_) & (LongProfitTarget <= StockPrice_)')\
.groupby(['StockPrice', 'BarSeqId', 'LongProfitTarget'])['BarSeqId_'].min()
print(cdf)
# StockPrice BarSeqId LongProfitTarget
# 100 1 105 3
# 103 2 107 5
# Name: BarSeqId_, dtype: int64
I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:
>>> data.head()
Hs Tp wd seed max min
0 1 9 165 22 225 18
1 1 9 195 16 190 18
2 2 5 165 43 193 12
3 2 10 180 15 141 22
4 1 6 180 17 219 18
>>> len(data)
4500
I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.
In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:
>>> stdev.head()
Hs Tp max min
0 1 5 43.31321 4.597629
1 1 6 43.20004 4.640795
2 1 7 47.31507 4.569408
3 1 8 41.75081 4.651762
4 1 9 41.35818 4.285991
>>> len(stdev)
30
The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)
import pandas as pd
import numpy as np
#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)
# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
data_Hs = data[data['Hs'] == hs]
for tp in set(data_Hs['Tp']):
data_tp = data_Hs[data_Hs['Tp'] == tp]
stdev.loc[i] = [
hs,
tp,
max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]),
max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
i+=1
Thanks!
PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.
This could be a one-liner, if I understood you correctly:
data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])
(include reset_index() on the end if you want)