I'm just getting into Pandas, and want to figure out a good way of holding time-varying data corresponding to multiple trials.
A concrete example might be:
Trial 1: Salinity = 0.1 (unchanging), pH (at time 1, 2, ...)
Trial 2: Salinity = 0.1 (unchanging), pH (at time 1, 2, ...)
Trial 3: Salinity = 0.2 (unchanging), pH (at time 1, 2, ...)
Trial 4: Salinity = 0.2 (unchanging), pH (at time 1, 2, ...)
Where you'll notice that experiments can be repeated multiple times with the same initial parameters (the salinity), but with different time-varying variables (pH).
A DataFrame is 2-dimensional, so I would have to create a DataFrame for each trial. Is this the best way to go about it, and how would I be able to combine them (ex: get the average pH over time for trials with the same initial setup)?
You can aggregate the data across Trials in a single pd.DataFrame. Below is an example.
df = pd.DataFrame({'Trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
'Date': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Salinity': [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
'pH': [2, 4, 1, 4, 6, 8, 3, 2, 9, 3, 1, 4, 6, 11, 4, 6]})
df = df.set_index(['Trial', 'Date', 'Salinity'])
# pH
# Trial Date Salinity
# 1 1 0.1 2
# 2 0.1 4
# 3 0.1 1
# 4 0.1 4
# 2 1 0.1 6
# 2 0.1 8
# 3 0.1 3
# 4 0.1 2
# 3 1 0.2 9
# 2 0.2 3
# 3 0.2 1
# 4 0.2 4
# 4 1 0.2 6
# 2 0.2 11
# 3 0.2 4
# 4 0.2 6
Explanation
In your dataframe construction, assign an identifier column, in this case Trial with an integer identifier.
Setting index by ['Trial', 'Date', 'Salinity'] provides a natural index for pandas to use for grouping, indexing and slicing.
For example, df.loc[(1, 2, 0.1)] will return a pd.Series derived from the dataframe indicating pH = 4.
Related
The example is in the picture. How could I drop rows with non-unique values in column 'signal'?
cols = ['signal', 'metabolite', 'adduct', 's_ind', 'm_ind', 'a_ind', 'distance']
data = [[0.500001, 1.000002, -0.5, 1, 1, 2, 0.000001],
[0.500001, 0.000002, 0.5, 1, 2, 1, 0.000001],
[0.500002, 1.000002, -0.5, 2, 1, 2, 0.000000],
[0.500002, 0.000002, 0.5, 2, 2, 1, 0.000000],
[0.500003, 1.000002, -0.5, 3, 1, 2, 0.000001],
[0.500003, 0.000002, 0.5, 3, 2, 1, 0.000001],
[1.000000, 1.000002, -0.5, 4, 1, 2, 0.499998],
[1.000000, 0.000002, 0.5, 4, 2, 1, 0.499998],
[0.000001, 1.000002, -0.5, 5, 1, 2, 0.500001],
[0.000001, 0.000002, 0.5, 5, 2, 1, 0.500001]]
df = pd.DataFrame(data=data, columns=cols)
display(df)
Just call drop_duplicates and pass the column list to subset parameter, it will keep only the first non-`unique value (You can pass one or more columns from which you want to drop the non-unique values).
df.drop_duplicates(subset=['signal'])
signal metabolite adduct s_ind m_ind a_ind distance
0 0.500001 1.000002 -0.5 1 1 2 0.000001
2 0.500002 1.000002 -0.5 2 1 2 0.000000
4 0.500003 1.000002 -0.5 3 1 2 0.000001
6 1.000000 1.000002 -0.5 4 1 2 0.499998
8 0.000001 1.000002 -0.5 5 1 2 0.500001
You can also pass keep as False if you don't want to include the non`-unique values at all.
You're looking for pd.drop_duplicates(). See here:
df = df.drop_duplicates("signal")
In the below code .I get the expected results of x1
import numpy as np
x1 = np.arange(0.5, 10.4, 0.8)
print(x1)
[ 0.5 1.3 2.1 2.9 3.7 4.5 5.3 6.1 6.9 7.7 8.5 9.3 10.1]
But in the code below, when i set dtype=int why the result of x2 is not [ 0 1 2 2 3 4 5 6 6 7 8 9 10] and Instead I am getting the value of x2 as [ 0 1 2 3 4 5 6 7 8 9 10 11 12] where last value 12 overshoots the end value of 10.4.Please clarify my concept regarding this.
import numpy as np
x2 = np.arange(0.5, 10.4, 0.8, dtype=int)
print(x2)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12]
According to the docs: https://docs.scipy.org/doc/numpy1.15.0/reference/generated/numpy.arange.html
stop : number
End of interval. The interval does not include this value, except in some cases where step is not an integer and floating point round-off affects the length of out.
arange : ndarray
Array of evenly spaced values.
For floating point arguments, the length of the result is ceil((stop - start)/step). Because of floating point overflow, this rule may result in the last element of out being greater than stop.
So here the last element will be.
In [33]: np.ceil((10.4-0.5)/0.8)
Out[33]: 13.0
Hence we see the overshoot to 12 in case of np.arange(0.5, 10.4, 0.8, dtype=int), since stop=13 in the above case, and the default value is 0,
hence the output we observe is.
In [35]: np.arange(0.5, 10.4, 0.8, dtype=int)
Out[35]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
Hence the better way of generating integer ranges, is to use integer parameters like so:
In [25]: np.arange(0, 11, 1)
Out[25]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Let's say that I have the following data-frame:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3], "date": [pd.Timestamp(2002, 2, 2), pd.Timestamp(2003, 3, 3), pd.Timestamp(2004, 4, 4), pd.Timestamp(2005, 5, 5), pd.Timestamp(2006, 6, 6), pd.Timestamp(2007, 7, 7), pd.Timestamp(2008, 8, 8), pd.Timestamp(2009, 9, 9), pd.Timestamp(2010, 10, 10), pd.Timestamp(2011, 11, 11)], "numeric": [0.9, 0.4, 0.2, 0.6, np.nan, 0.8, 0.7, np.nan, np.nan, 0.5], "nominal": [0, 1, 0, 1, 0, 0, 0, 1, 1, 1]})
What I want to achieve is to strip rows at the end of each group (assuming that the rows are grouped by id), such that the rows will be removed until a non-nan value will appear for the numeric column. Additionally, the last row for each group will always have a non-nan value for the numeric column and the last row should always be removed. So, the resulting data-frame is:
result_df = pd.DataFrame({"id": [1, 1, 2, 3], "date": [pd.Timestamp(2002, 2, 2), pd.Timestamp(2003, 3, 3), pd.Timestamp(2005, 5, 5), pd.Timestamp(2008, 8, 8)], "numeric": [0.9, 0.4, 0.6, 0.7], "nominal": [0, 1, 1, 0]})
More explanation on how we get to the resulting data-frame:
For id == 1 only the last row is removed since in the row before the last one there is a value for the numeric column.
For id == 2 the last two rows are removed because the last row is removed by default the row before the last one has a nan value.
For id == 3 the last three rows are removed because the last row is removed by default and the first non-nan value is on the forth row counting from below.
Moreover, what I am currently doing is:
df.groupby("id", as_index=False).apply(lambda x: x.iloc[:-1]).reset_index(drop=True)
However, this only removes the last row for each group and I want to remove the last N rows based on the condition explained above.
Please let me know if you need any further information and looking forward to your answers!
For the specific example you have posted just dropping the NaNs before grouping does the trick:
df = df.dropna().groupby('id').apply(lambda x: x.iloc[:-1]).reset_index(drop=True)
df
Out[58]:
id date numeric nominal
0 1 2002-02-02 0.9 0
1 1 2003-03-03 0.4 1
2 2 2005-05-05 0.6 1
3 3 2008-08-08 0.7 0
If you have a non-contiguous NaNs and you want to remove only the last block of NaNs:
def strip_rows(X):
X = X.iloc[:-1, :]
while pd.isna(X.iloc[-1, 2]):
X = X.iloc[:-1, :]
return X
df_1 = pd.DataFrame({"id": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3],
"date": [pd.Timestamp(2002, 2, 2),
pd.Timestamp(2003, 3, 3),
pd.Timestamp(2004, 4, 4),
pd.Timestamp(2005, 5, 5),
pd.Timestamp(2006, 6, 6),
pd.Timestamp(2007, 7, 7),
pd.Timestamp(2008, 8, 8),
pd.Timestamp(2009, 9, 9),
pd.Timestamp(2010, 10, 10),
pd.Timestamp(2011, 11, 11),
pd.Timestamp(2011, 12, 12),
pd.Timestamp(2012, 1, 1)],
"numeric": [0.9, 0.4, 0.2, 0.6, np.nan, 0.8, 0.7, np.nan, np.nan, 0.5, np.nan, 0.3],
"nominal": [0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1]})
df_2 = df_1.groupby('id').apply(strip_rows).reset_index(drop=True)
df_1
Out[151]:
id date numeric nominal
0 1 2002-02-02 0.9 0
1 1 2003-03-03 0.4 1
2 1 2004-04-04 0.2 0
3 2 2005-05-05 0.6 1
4 2 2006-06-06 NaN 0
5 2 2007-07-07 0.8 0
6 3 2008-08-08 0.7 0
7 3 2009-09-09 NaN 1
8 3 2010-10-10 NaN 1
9 3 2011-11-11 0.5 1
10 3 2011-12-12 NaN 0
11 3 2012-01-01 0.3 1
df_2
Out[152]:
id date numeric nominal
0 1 2002-02-02 0.9 0
1 1 2003-03-03 0.4 1
2 2 2005-05-05 0.6 1
3 3 2008-08-08 0.7 0
4 3 2009-09-09 NaN 1
5 3 2010-10-10 NaN 1
6 3 2011-11-11 0.5 1
I have a dataset of stores with 2D locations at daily timestamps. I am trying to match up each row with weather measurements made at stations at some other locations, also with daily timestamps, such that the Cartesian distance between each store and matched station is minimized. The weather measurements have not been performed daily, and the station positions may vary, so this is a matter of finding the closest station for each specific store at each specific day.
I realize that I can construct nested loops to perform the matching, but I am wondering if anyone here can think of some neat way of using pandas dataframe operations to accomplish this. A toy example dataset is shown below. For simplicity, it has static weather station positions.
store_df = pd.DataFrame({
'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],
'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],
'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
weather_station_df = pd.DataFrame({
'station_id': [1, 1, 1, 2, 2, 3, 3, 3],
'weather': [20, 21, 19, 17, 16, 18, 19, 17],
'x': [0, 0, 0, 5, 5, 3, 3, 3],
'y': [2, 2, 2, 1, 1, 3, 3, 3],
'date': [1, 2, 3, 1, 3, 1, 2, 3]})
The data below is the desired outcome. I have included station_id only for clarification.
store_id date station_id weather
0 1 1 1 20
1 1 2 1 21
2 1 3 1 19
3 2 1 2 17
4 2 2 3 19
5 2 3 2 16
6 3 1 3 18
7 3 2 3 19
8 3 3 3 17
The idea of the solution is to build the table of all combinations,
df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))
calculate the distance
df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2
and choose the minimum per group:
df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()
If you have a lot of date the you can do the join per group.
import math
import numpy as np
def distance(x1, x2, y1, y2):
return np.sqrt((x2-x1)**2 + (y2-y1)**2)
#Join On Date to get all combinations of store and stations per day
df_all = store_df.merge(weather_station_df, on=['date'])
#Apply distance formula to each combination
df_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])
#Get Minimum distance for each day Per store_id
df_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()
#Use resulting minimums to get the station_id matching the min distances
closest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')
#filter out the unnecessary columns
result_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])
edited: To use vectorized distance formula
Answering one Question, I ended up with a problem that I believe was a circumlocution way of solving which could have been done in a better way, but I was clueless
There are two list
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
optimal_partition, is one of the integer partition of the number 8 into 4 parts
I would like to sort optimal_partition, in a manner which matches the percentage distribution to as closest as possible which would mean, the individual partition should match the percent magnitude as closest as possible
So 3 -> 0.4, 2 -> 0.27 and 0.23 and 1 -> 0.1
So the final result should be
[2, 2, 3, 1]
The way I ended up solving this was
>>> percent = [0.23, 0.27, 0.4, 0.1]
>>> optimal_partition = [3, 2, 2, 1]
>>> optimal_partition_percent = zip(sorted(optimal_partition),
sorted(enumerate(percent),
key = itemgetter(1)))
>>> optimal_partition = [e for e, _ in sorted(optimal_partition_percent,
key = lambda e: e[1][0])]
>>> optimal_partition
[2, 2, 3, 1]
Can you suggest an easier way to solve this?
By easier I mean, without the need to implement multiple sorting, and storing and later rearranging based on index.
Couple of more examples:
percent = [0.25, 0.25, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
result = [2, 2, 3, 1]
percent = [0.2, 0.2, 0.4, 0.2]
optimal_partition = [3, 2, 2, 1]
result = [1, 2, 3, 2]
from numpy import take,argsort
take(opt,argsort(argsort(perc)[::-1]))
or without imports:
zip(*sorted(zip(sorted(range(len(perc)), key=perc.__getitem__)[::-1],opt)))[1]
#Test
l=[([0.23, 0.27, 0.4, 0.1],[3, 2, 2, 1]),
([0.25, 0.25, 0.4, 0.1],[3, 2, 2, 1]),
([0.2, 0.2, 0.4, 0.2],[3, 2, 2, 1])]
def f1(perc,opt):
return take(opt,argsort(argsort(perc)[::-1]))
def f2(perc,opt):
return zip(*sorted(zip(sorted(range(len(perc)),
key=perc.__getitem__)[::-1],opt)))[1]
for i in l:
perc, opt = i
print f1(perc,opt), f2(perc,opt)
# output:
# [2 2 3 1] (2, 2, 3, 1)
# [2 2 3 1] (2, 2, 3, 1)
# [1 2 3 2] (1, 2, 3, 2)
Use the fact that the percentages sum to 1:
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
total = sum(optimal_partition)
output = [total*i for i in percent]
Now you need to figure out a way to redistribute the fractional components somehow. Thinking out loud:
from operator import itemgetter
intermediate = [(i[0], int(i[1]), i[1] - int(i[1])) for i in enumerate(output)]
# Sort the list by the fractional component
s = sorted(intermediate, key=itemgetter(2))
# Now, distribute the first item's fractional component to the rest, starting at the top:
for i, tup in enumerate(s):
fraction = tup[2]
# Go through the remaining items in reverse order
for index in range(len(s)-1, i, -1):
this_fraction = s[index][2]
if fraction + this_fraction >= 1:
# increment this item by 1, clear the fraction, carry the remainder
new_fraction = fraction + this_fraction -1
s[index][1] = s[index][1] + 1
s[index][2] = 0
fraction = new_fraction
else:
#just add the fraction to this element, clear the original element
s[index][2] = s[index][2] + fraction
Now, I'm not sure I'd say that's "easier". I haven't tested it, and I'm sure I got the logic wrong in that last section. In fact, I'm attempting assignment to tuples, so I know there's at least one error. But it's a different approach.