Get values at specific altitudes based on unevenly distributed observations - python

Input data looks like this (pandas DataFrame):
index altitude temperature
0 669084 76.0 NaN
1 669085 190.0 -70.0
2 669086 384.0 -290.0
3 669087 693.0 -430.0
4 669088 883.0 -290.0
5 669089 963.0 -250.0
6 669090 989.0 -250.0
7 669091 1259.0 -380.0
.....
It's essentially the result of a single vertical sounding. Measurements are made at "random" altitudes And I need to calculate values at specific altitudes, like 100m, 300m, 500m, 1000m and so on.
.
I presume it should be some form of interpolation, but not sure what's the best approach.
What is the best practice for that using python, numpy and pandas?

reindex & interpolate
First we set altitude as index so we can reindex for every "whole" altitude number.
Then we interpolate temperature between the measurements:
notice, in this case we cannot calculate for measurement 100 since altitude 76 is NaN:
min_alt = df['altitude'].min().astype(int)
max_alt = df['altitude'].max().astype(int)+1
newdf = df.set_index('altitude').reindex(range(min_alt, max_alt)).reset_index()
newdf['index'] = newdf['index'].ffill()
newdf['temperature'].interpolate(inplace=True)
newdf['temperature'].interpolate(limit_direction='backward', inplace=True)
Output
altitude index temperature
0 76 669084.0 -70.000000
1 77 669084.0 -70.000000
2 78 669084.0 -70.000000
3 79 669084.0 -70.000000
4 80 669084.0 -70.000000
... ... ... ...
1179 1255 669090.0 -378.074074
1180 1256 669090.0 -378.555556
1181 1257 669090.0 -379.037037
1182 1258 669090.0 -379.518519
1183 1259 669091.0 -380.000000
[1184 rows x 3 columns]
Then if we check the values at 300m, 500m, and 1000m:
newdf.query('altitude.isin([300,500,100])')
Output
altitude index temperature
224 300 669085.0 -194.742268
424 500 669086.0 -342.556634
924 1000 669090.0 -255.296296
We can see that temperature is interpolated.

Related

How to plot Numerical Values in matplotlib

So I have this kind of database:
Time Type Profit
2 82 s/l -51.3
5 9 t/p 164.32
8 38 s/l -53.19
11 82 s/l -54.4
14 107 s/l -54.53
.. ... ... ...
730 111 s/l -70.72
731 111 s/l -70.72
732 111 s/l -70.72
733 113 s/l -65.13
734 113 s/l -65.13
[239 rows x 3 columns]
I want to plot a chart which shows X as the time (that's already on week hours), and Y as profit(Which can be positive or negative). For Y, I would like for each hour (X) to have 2 bars to show the profit. The negative profit would be positive too in this case but in another bar.
For example we have -65 and 70. They would show as 65 and 70 on the chart but the loss would have a different bar color.
This is my code so far:
#reading the csv file
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Time','Type','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
#Takes in winning trades (t/p) and losing trades(s/l)
df = df[(df['Type'] == 't/p') | (df['Type'] == 's/l')]
#Plots the chart
ax = df.plot(title='Profits and Losses (Hour Of Week)',kind='bar')
#ax.legend(['Losses', 'Winners'])
plt.xlabel('Hour of Week')
plt.ylabel('Amount Of Profit/Loss')
plt.show()
You can groupby, unstack and plot:
(df.groupby(['Time','Type']).Profit.sum().abs()
.unstack('Type')
.plot.bar()
)
For your sample data above, the output is:

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

Iterating over pandas rows to get minimum

Here is my dataframe:
Date cell tumor_size(mm)
25/10/2015 113 51
22/10/2015 222 50
22/10/2015 883 45
20/10/2015 334 35
19/10/2015 564 47
19/10/2015 123 56
22/10/2014 345 36
13/12/2013 456 44
What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on.
As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way:
for the cell 222:
get_size_distance(absolute value):
(50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883
The resulting output should look like this:
Date cell tumor_size(mm) pair size_difference
25/10/2015 113 51 222 1
22/10/2015 222 50 123 6
22/10/2015 883 45 456 1
20/10/2015 334 35 345 1
19/10/2015 564 47 456 3
19/10/2015 123 56 456 12
22/10/2014 345 36 456 8
13/12/2013 456 44 NaN NaN
I will really appreciate your help
It's not pretty, but I believe it does the trick
a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]
# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
# Loop over all unique dates
for date in df.Date.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.Date < date].copy()
# Loop over each cell for this date and find the minimum
for row in df.loc[df.Date == date].itertuples():
# If no cells earlier are available use nans.
if compare_df.empty:
pairs.append(float('nan'))
diffs.append(float('nan'))
# Take lowest absolute value and fill in otherwise
else:
compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.size_diff.values[0])
df['pair'] = pairs
df['size_difference'] = diffs
returns:
Date cell tumor_size pair size_difference
0 2015-10-25 113 51 222.0 1.0
1 2015-10-22 222 50 564.0 3.0
2 2015-10-22 883 45 564.0 2.0
3 2015-10-20 334 35 345.0 1.0
4 2015-10-19 564 47 345.0 11.0
5 2015-10-19 123 56 345.0 20.0
6 2014-10-22 345 36 NaN NaN

How to slice pandas DataFrame based on values from another Dataframe without using for-loop?

I have a DataFrame df1:
df1.head() =
id type position
dates
2000-01-03 17378 600 400
2000-01-03 4203 600 150
2000-01-03 18321 600 5000
2000-01-03 6158 600 1000
2000-01-03 886 600 10000
2000-01-03 17127 600 800
2000-01-03 18317 1300 110
2000-01-03 5536 600 207
2000-01-03 5132 600 20000
2000-01-03 18191 600 2000
And a second DataFrame df2:
df2.head() =
dt_f dt_l
id_y id_x
670 715 2000-02-14 2003-09-30
704 2963 2000-02-11 2004-01-13
886 18350 2000-02-09 2001-09-24
1451 18159 2005-11-14 2007-03-06
2175 8648 2007-02-28 2007-09-19
2236 18321 2001-04-05 2002-07-02
2283 2352 2007-03-07 2007-09-19
6694 2007-03-07 2007-09-17
13865 2007-04-19 2007-09-19
14348 2007-08-10 2007-09-19
15415 2007-03-07 2007-09-19
2300 2963 2001-05-30 2007-09-26
I need to slice df1for each value of id_x, and count the number of rows within the interval dt_f:dt_l. This has to be done again for the values of id_y. Finally the result should be merged on df2, giving as output the following DataFrame:
df_result.head() =
dt_f dt_l n_x n_y
id_y id_x
670 715 2000-02-14 2003-09-30 8 10
704 2963 2000-02-11 2004-01-13 13 25
886 18350 2000-02-09 2001-09-24 32 75
1451 18159 2005-11-14 2007-03-06 48 6
where n_x(n_y) corresponds to the number of rows contained in the interval dt_f:dt_l for each value of id_x(id_y).
Here is the for-loop I have used:
idx_list = df2.index.tolist()
k = 1
for j in idx_list:
n_y = df1[df1.id == j[0]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count()
n_x = df1[df1.id == j[1]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count()
Would it be possible to do it without using a for-loop? DataFrame df1contains around 30000 rows and I am afraid a loop will slow down the process too much, since this is a small part of the whole script.
you want something like this:
#Merge the tables together - making sure we keep the index column
mg = df1.reset_index().merge(df2, left_on = 'id', right_on = 'id_x')
#Select only the rows that are within the start and end
mg = mg[(mg['index'] > mg['dt_f']) & (mg['index'] < mg['dt_l'])]
#Finally count by id_x
mg.groupby('id_x').count()
You'll need to tidy up the columns afterwards and repeat for id_y.

Applying matrix product in specific pandas columns

I have a pandas DataFrame structured in the following way
0 1 2 3 4 5 6 7 8 9
0 42 2012 106 1200 0.112986 -0.647709 -0.303534 31.73 14.80 1096
1 42 2012 106 1200 0.185159 -0.588728 -0.249392 31.74 14.80 1097
2 42 2012 106 1200 0.199910 -0.547780 -0.226356 31.74 14.80 1096
3 42 2012 106 1200 0.065741 -0.796107 -0.099782 31.70 14.81 1097
4 42 2012 106 1200 0.116718 -0.780699 -0.043169 31.66 14.78 1094
5 42 2012 106 1200 0.280035 -0.788511 -0.171763 31.66 14.79 1094
6 42 2012 106 1200 0.311319 -0.663151 -0.271162 31.78 14.79 1094
In which columns 4, 5 and 6 are actually the components of a vector. I want to apply a matrix multiplication in these columns, that is to replace columns 4, 5 and 6 with the vector resulting of a the multiplication of the previous vector with a matrix.
What I did was
DC=[[ .. definition of multiplication matrix .. ]]
def rotate(vector):
return dot(DC, vector)
data[[4,5,6]]=data[[4,5,6]].apply(rotate, axis='columns')
Which I thought should work, but the returned DataFrame is exactly the same as the original.
What am I missing here?
You code is correct but very slow. You can use values property to get the ndarray and use dot() to transform all the vectors at once:
import numpy as np
import pandas as pd
DC = np.random.randn(3, 3)
df = pd.DataFrame(np.random.randn(1000, 10))
df2 = df.copy()
df[[4,5,6]] = np.dot(DC, df[[4,5,6]].values.T).T
def rotate(vector):
return np.dot(DC, vector)
df2[[4,5,6]] = df2[[4,5,6]].apply(rotate, axis='columns')
df.equals(df2)
On my PC, it's about 90x faster.

Categories