Pandas - Find first occurance of number closest to an input value - python

I have a dataframe like below.
time speed
0 1 0.20
1 2 0.40
2 3 2.00
3 4 3.00
4 5 0.40
5 6 0.43
6 7 6.00
I would like to find the first occurance of a number ( in 'Speed' Column) that is closest to an input value I enter.
For example :
input value = 0.43
Expected Output :
Speed : 0.40 & corresponding Time : 2
The speed column should not be sorted for this problem.
I tried the below,but not getting the expected output.
Any help on this would be appreciated.

absolute closest
You can compute the absolute difference to your reference and get the idxmin:
speed_input = 0.43
df.loc[abs(df['speed']-speed_input).idxmin()]
output:
time 6.00
speed 0.43
Name: 5, dtype: float64
first closest with threshold:
i = 0.43
thresh = 0.03
df.loc[abs(df['speed']-i).le(thresh).idxmax()]
output:
time 2.0
speed 0.4
Name: 1, dtype: float64

One idea is round both values:
df[[(df['speed'].round(1)-round(speed_input, 1)).abs().idxmin()]]

Related

Take average of range entities and replace it in pandas column

I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64

Percent of total clusters per cluster per season using pandas

I have a pandas DataFrame that looks like this with 12 clusters in total. Certain clusters don't appear in a certain season.
I want to create a multi-line graph over the seasons of the percent of a specific cluster over each season. So if there are 30 teams in the 97-98 season and there are 10 teams in Cluster 1, then that value would be .33 since cluster 1 has one third of the total possible spots.
It'll look like this
And I want the dateset to look like this, where each cluster has its own percentage of the whole number of clusters in that season by percentage. I've tried using pandas groupby method to get a bunch of lists and then use value_counts() on that but that doesn't work since looping through df.groupby(['SEASON']) returns tuples, not a Series..
Thanks so much
Use .groupby combined with .value_counts and .unstack:
temp_df = df.groupby(['SEASON'])['Cluster'].value_counts(normalize=True).unstack().fillna(0.0)
temp_df.plot()
print(temp_df.round(2))
Cluster 0 1 2 4 5 6 7 10 11
SEASON
1996-97 0.1 0.21 0.17 0.21 0.07 0.1 0.03 0.07 0.03
1997-98 0.2 0.00 0.20 0.20 0.00 0.0 0.20 0.20 0.00

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25

Group rows in fixed duration windows that satisfy multiple conditions

I have a df as below. Consider df is indexed by timestamps as dtype='datetime64[ns]' i.e. 1970-01-01 00:00:27.603046999. I am putting dummy timestamps here.
Timestamp Address Type Arrival_Time Time_Delta
0.1 2 A 0.25 0.15
0.4 3 B 0.43 0.03
0.9 1 B 1.20 0.20
1.3 1 A 1.39 0.09
1.5 3 A 1.64 0.14
1.7 3 B 1.87 0.17
2.0 3 A 2.09 0.09
2.1 1 B 2.44 0.34
I have three unique "addresses" (1, 2,3).
I have two unique "types" (A, B)
Now what I am trying to do two things in simple way (possibly using pd.Grouper and pd.Groupby functions in Panda).
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each 1sec bin, for each "address" find the mean and sum of "Time_delta" only if "Type" = A.
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each bin, for each "address", find the mean and sum of Inter-Arrival Time*.
IAT = Arrival Time (i) - Arrival Time (i-1)
Note: If the timestamps duration/length is of 100 seconds, we should have exactly 100 rows in the output dataframe and six columns i.e. two (mean, sum) for each address.
For Problem 1:
I tried the following code:
df = pd.DataFrame({'Timestamp': Timestamp, 'Address': Address,
'Type': Type, 'Arrival_Time': Arrival_time, 'Time_Delta': Time_delta})
# Set index to Datetime
index = pd.DatetimeIndex(df[df.columns[3]]*10**9) # Convert timestamp into format
df = df.set_index(index) # Set timestamp as index
df_1 = df[df.columns[2]].groupby([pd.TimeGrouper('1S'), df['Address']]).mean().unstack(fill_value=0)
which gives results:
Timestamp 1 2 3
1970-01-01 00:00:00 0.20 0.15 0.030
1970-01-01 00:00:01 0.09 0.00 0.155
1970-01-01 00:00:02 0.34 0.00 0.090
As you can see, it gives the mean Time_delta for each address in the 1S bin, But I want to add the second condition i.e. find mean for each address only if Type=A. I hope problem 1 is now clear.
For Problem 2:
Its a bit complicated. I want to do get Mean IAT for each address in the same format (See below):
One possible way is to add an extra column to original df as df['IAT'], where
for in range (1, len(df))
i = 0
df['IAT'] = df['Arrival_Time'][i] - df['Arrival_Time'][i-1] i =
i=i+1
Then apply the same above code to find mean of IAT for each address if Type=A.
Actual Data
Timestamp Address Type Time Delta Arrival Time
1970-01-01 00:00:00.000000000 28:5a:ec:16:00:22 Control frame 0.000000 Nov 10, 2017 22:39:20.538561000
1970-01-01 00:00:00.000287000 28:5a:ec:16:00:23 Data frame 0.000287 Nov 10, 2017 22:39:20.548121000
1970-01-01 00:00:00.000896000 28:5a:ec:16:00:22 Control frame 0.000609 Nov 10, 2017 22:39:20.611256000
1970-01-01 00:00:00.001388000 28:5a:ec:16:00:21 Data frame 0.000492 Nov 10, 2017 22:39:20.321745000
... ...

How do I apply a lambda function on pandas slices, and return the same format as the input data frame?

I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91

Categories