I have a sample dataset. Here is:
import pandas as pd
import numpy as np
df = {'Point1': [50,50,50,45,45,35,35], 'Point2': [48,44,30,35,33,34,32], 'Dist': [4,6,2,7,8,3,6]}
df = pd.DataFrame(df)
df
And its output is here:
My goal is to find dist value with its condition and point2 value for each group of point1.
Here is my code. (It gives an error)
if df['dist'] < 5 :
df1 = df[df['dist'].isin(df.groupby('Point1').max()['Dist'].values)]
else :
df1 = df[df['dist'].isin(df.groupby('Point1').min()['Dist'].values)]
df1
And here is the expected output:
So, if there is exist Dist value less than 5, I would like to take the max one of these groups. If no, I would like to take the min one. I hope it would be clear.
IIUC, you want to find the closest Dist to 5, with prority for values lower than 5.
For this you can compute two columns to help you sort the values in order of priority and take the first one. Here 'cond' sort by ≤5 first, then >5, and cond2 by absolute distance to 5.
thresh = 5
(df
.assign(cond=df['Dist'].gt(thresh),
cond2=df['Dist'].sub(thresh).abs(),
)
.sort_values(by=['cond', 'cond2'])
.groupby('Point1', as_index=False).first()
.drop(columns=['cond', 'cond2'])
)
output:
Point1 Point2 Dist
0 35 34 3
1 45 35 7
2 50 48 4
NB. this is also sorting by Point1 in the process, if this is unwanted on can create a function to sort a dataframe this way and apply it per group. Let me know if this is the case
Since you are using pandas DataFrame you can use the brackets syntax to filter the the data
In your case:
df[df['Dist']] < 5
About the second part of the question, it was a little confusing, can you explain more about the "take the max one of these groups. If no, I would like to take the min one"
Related
Hi I was trying to figure out how to divide values from a DataFrame. But here I made an example for pandas series
a = pd.Series([1, 2, 16,64,128,360,720])
a
-----------------
0 1
1 2
2 16
3 64
4 128
5 360
6 720
So is there any way I could divide a number in a given row by the value from the previous row?
0 2
1 8
2 4
3 2
4 2.8
5 2
Furthermore, I also tried to get the output like "if the value is double, print the index".
Thank you for your help!
What it seems to me is that you are trying to divide a number in a given row by the one of the previous. This can be achieved using this code
import pandas as pd
import numpy as np
a = pd.Series([1, 2, 16,64,128,360,720])
division = pd.Series(np.divide(a.values[1:],a.values[:-1]))
index = pd.Series(np.multiply(division == 2, [i for i in range(len(a)-1)]))
Note: your question is very ill posed. You didn't specify what you wanted to achieve, I figured out by myself from the example. You also added a wrong snipped of code. Pay attention to make a nicer question next time
Well I don't know what exactly you want to divide your pandas Series by, but you can do it like this :
# If you want to store a new result
b = a / value_to_divive_by
# or if you want to apply it directly to your serie
a /= value_to_divive_by
or using list comprehension
b = [int(nb / your_value_here) for nb in a]
# with a min value to do the divison
b = [int(nb / your_value_here) for nb in a if nb > min_value]
There is probably other ways to do what you want, but there is two easy solutions
I have a Pandas dataframe with two columns I am interested in: A categorical label and a timestamp. Presumably what I'm trying to do would also work with ordered numerical data. The dataframe is already sorted by timestamps in ascending order. I want to find out which label spans the longest time-window and select only the values associated with it in the original dataframe.
I have tried grouping the df by label, calculating the difference and selecting the maximum (longest time-window) successfully, however I'm having trouble finding an expression to select the corresponding values in the original df using this information.
Consider this example with numerical values:
d = {'cat': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C','C','C'],
'val': [1,3,5,6,8,9,0,5,10,20,4,5,6,7,8,9,10]}
df = pd.DataFrame(data = d)
Here I would expect something equivalent to df.loc[df.cat == 'B'] since B has the maximum difference of all the categories.
df.groupby('cat').val.apply(lambda x: x.max() - x.min()).max()
gives me the correct difference, but I have no idea how to use this to select the correct category in the original df.
You can go for idxmax to get the category that gave rise to maximum peak-to-peak value within groups (np.ptp does the maximum minus minimum). Then you can index with loc as you said, or query:
>>> max_cat = df.groupby("cat").val.apply(np.ptp).idxmax()
>>> max_cat
"B"
>>> df.query("cat == #max_cat") # or df.loc[df.cat == max_cat]
cat val
6 B 0
7 B 5
8 B 10
9 B 20
I can't get the average or mean of a column in pandas. A have a dataframe. Neither of things I tried below gives me the average of the column weight
>>> allDF
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 0.94212
The following returns several values, not one:
allDF[['weight']].mean(axis=1)
So does this:
allDF.groupby('weight').mean()
If you only want the mean of the weight column, select the column (which is a Series) and call .mean():
In [479]: df
Out[479]:
ID birthyear weight
0 619040 1962 0.123123
1 600161 1963 0.981742
2 25602033 1963 1.312312
3 624870 1987 0.942120
In [480]: df.loc[:, 'weight'].mean()
Out[480]: 0.83982437500000007
Try df.mean(axis=0) , axis=0 argument calculates the column wise mean of the dataframe so the result will be axis=1 is row wise mean so you are getting multiple values.
Do try to give print (df.describe()) a shot. I hope it will be very helpful to get an overall description of your dataframe.
Mean for each column in df :
A B C
0 5 3 8
1 5 3 9
2 8 4 9
df.mean()
A 6.000000
B 3.333333
C 8.666667
dtype: float64
and if you want average of all columns:
df.stack().mean()
6.0
you can use
df.describe()
you will get basic statistics of the dataframe and to get mean of specific column you can use
df["columnname"].mean()
You can also access a column using the dot notation (also called attribute access) and then calculate its mean:
df.your_column_name.mean()
You can use either of the two statements below:
numpy.mean(df['col_name'])
# or
df['col_name'].mean()
Additionally if you want to get the round value after finding the mean.
#Create a DataFrame
df1 = {
'Subject':['semester1','semester2','semester3','semester4','semester1',
'semester2','semester3'],
'Score':[62.73,47.76,55.61,74.67,31.55,77.31,85.47]}
df1 = pd.DataFrame(df1,columns=['Subject','Score'])
rounded_mean = round(df1['Score'].mean()) # specified nothing as decimal place
print(rounded_mean) # 62
rounded_mean_decimal_0 = round(df1['Score'].mean(), 0) # specified decimal place as 0
print(rounded_mean_decimal_0) # 62.0
rounded_mean_decimal_1 = round(df1['Score'].mean(), 1) # specified decimal place as 1
print(rounded_mean_decimal_1) # 62.2
Do note that it needs to be in the numeric data type in the first place.
import pandas as pd
df['column'] = pd.to_numeric(df['column'], errors='coerce')
Next find the mean on one column or for all numeric columns using describe().
df['column'].mean()
df.describe()
Example of result from describe:
column
count 62.000000
mean 84.678548
std 216.694615
min 13.100000
25% 27.012500
50% 41.220000
75% 70.817500
max 1666.860000
You can simply go for:
df.describe()
that will provide you with all the relevant details you need, but to find the min, max or average value of a particular column (say 'weights' in your case), use:
df['weights'].mean(): For average value
df['weights'].max(): For maximum value
df['weights'].min(): For minimum value
You can use the method agg (aggregate):
df.agg('mean')
It's possible to apply multiple statistics:
df.agg(['mean', 'max', 'min'])
You can easily follow the following code
import pandas as pd
import numpy as np
classxii = {'Name':['Karan','Ishan','Aditya','Anant','Ronit'],
'Subject':['Accounts','Economics','Accounts','Economics','Accounts'],
'Score':[87,64,58,74,87],
'Grade':['A1','B2','C1','B1','A2']}
df = pd.DataFrame(classxii,index = ['a','b','c','d','e'],columns=['Name','Subject','Score','Grade'])
print(df)
#use the below for mean if you already have a dataframe
print('mean of score is:')
print(df[['Score']].mean())
Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.
I've been looking for a few hours and can't seem to find a topic related to that exact matter.
So basically, I want to apply on a groupby to find something else than the mean. My groupby returns two columns 'feature_name' and 'target_name', and I want to replace the value in 'target_name' by something else : the number of occurences of 1, of 0, the difference between both, etc.
print(df[[feature_name, target_name]])
When I print my dataframe with the column I use, I get the following : screenshot
I already have the following code to compute the mean of 'target_name' for each value of 'feature_name':
df[[feature_name, target_name]].groupby([feature_name],as_index=False).mean()
Which returns : this.
And I want to compute different things than the mean. Here are the values I want to compute in the end : what I want
In my case, the feature 'target_name' will always be equal to either 1 or 0 (with 1 being 'good' and 0 'bad'.
I have seen this example from an answer.:
df.groupby(['catA', 'catB'])['scores'].apply(lambda x: x[x.str.contains('RET')].count())
But I don't know how to apply this to my case as x would be simply an int.
And after solving this issue, I still need to compute more than just the count!
Thanks for reading ☺
import pandas as pd
import numpy as np
def my_func(x):
# Create your 3 metrics here
calc1 = x.min()
calc2 = x.max()
calc3 = x.sum()
# return a pandas series
return pd.Series(dict(metric1=calc1, metric2=calc2, metric3=calc3))
# Apply the function you created
df.groupby(...)['columns needed to calculate formulas'].apply(my_func).unstack()
Optionally, using .unstack() at the end allows you to see all your 3 metrics as column headers
As an example:
df
Out[]:
Names A B
0 In 0.820747 0.370199
1 Out 0.162521 0.921443
2 In 0.534743 0.240836
3 Out 0.910891 0.096016
4 In 0.825876 0.833074
5 Out 0.546043 0.551751
6 In 0.305500 0.091768
7 Out 0.131028 0.043438
8 In 0.656116 0.562967
9 Out 0.351492 0.688008
10 In 0.410132 0.443524
11 Out 0.216372 0.057402
12 In 0.406622 0.754607
13 Out 0.272031 0.721558
14 In 0.162517 0.408080
15 Out 0.006613 0.616339
16 In 0.313313 0.808897
17 Out 0.545608 0.445589
18 In 0.353636 0.465455
19 Out 0.737072 0.306329
df.groupby('Names')['A'].apply(my_func).unstack()
Out[]:
metric1 metric2 metric3
Names
In 0.162517 0.825876 4.789202
Out 0.006613 0.910891 3.879669