Pandas data frame excluding rows in a particular range - python

I have a DataFrame df as below. I am just wondering to exclude rows in a particular column, say Vader_Sentiment, which has values in range -0.1 to 0.1 and keep the remaining.
I have tried df = [df['Vader_Sentiment'] < -0.1 & df['Vader_Sentiment] > 0.1] but it doesn't seem to work.
Text Vader_Sentiment
A -0.010
B 0.206
C 0.003
D -0.089
E 0.025

You can use Series.between():
df.loc[~df.Vader_Sentiment.between(-0.1, 0.1)]
Text Vader_Sentiment
1 B 0.206
Three things:
The tilde (~) operator denotes an inverse/complement.
Make sure you have numeric data. df.dtypes should show float for Vader_Sentiment, not "object"
You can specify an inclusive parameter to note if you want intervals to be closed or open

Related

Adding values in a column by "formula" in a pandas dataframe

I am trying to add values to a column using a formula, using the information from this question: Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
I already have the first number of the column B and I want to make a formula for the rest of column B.
The dataframe looks something like this:
A B C
0.16 0.001433 25.775485
0.28 0 25.784443
0.28 0 25.792396
...
And the method I tried was:
for i in range(1, len(df)):
df.loc[i, "B"] = df.loc[i-1, "B"] + df.loc[i,"A"]*((df.loc[i,"C"]) - (df.loc[i-1,"C"]))
But this code produces an infinite loop, can someone help me with this?
you can use shift and a simple assignment.
The general rule in pandas if you use loops you're doing something wrong, it's considered an anti pattern.
df['B_new'] = df['B'].shift(-1) - df['A'] * ((df['C'] - df['C'].shift(-1)))
A B C B_new
0 0.16 0.001433 25.775485 0.001433
1 0.28 0.000000 25.784443 0.002227
2 0.28 0.000000 25.792396 NaN

Calculating Quantiles based on a column value?

I am trying to figure out a way in which I can calculate quantiles in pandas or python based on a column value? Also can I calculate multiple different quantiles in one output?
For example I want to calculate the 0.25, 0.50 and 0.9 quantiles for
Column Minutes in df where it is <= 5 and where it is > 5 and <=10
df[df['Minutes'] <=5]
df[(df['Minutes'] >5) & (df['Minutes']<=10)]
where column Minutes is just a column containing value of numerical minutes
Thanks!
DataFrame.quantile accepts values in array,
Try
df['minute'].quantile([0.25, 0.50 , 0.9])
Or filter the data first,
df.loc[df['minute'] <= 5, 'minute'].quantile([0.25, 0.50 , 0.9])

How to truncate a series/column of a data frame with single operation

There are operations to round or floor or ceiling a column/series of a dataframe but how can one specify the precision for a column and truncate the rest of the values?
df = pd.DataFrame({"a": (1.21233123, 1.2412304498), 'b':(2.11296876, 2.09870989)})
Given this simple data frame, lets say I want to truncate column a and column b to 3 precision without rounding, I simply want to remove the rest of the precision.
df = pd.DataFrame({"a": (1.212, 1.241), 'b':(2.112, 2.098)})
This would be a result df, there should be a column operation that can be executed but it seems that you can only specify precision for rounding.
Use numpy.trunc with a bit of trick:
import numpy as np
n_precision = 3
df = np.trunc(df * (10 ** n_precision))/ (10 ** n_precision)
print(df)
a b
0 1.212 2.112
1 1.241 2.098
Since np.trunc discards the fractional part, you first multiply numbers by the order of your precision, do np.trunc, divide them back to get the desired output.
You can use round:
In [11]: df.round(3)
Out[11]:
a b
0 1.212 2.113
1 1.241 2.099
To "round down" you can subtract 0.001 / 2 from the DataFrame first:
In [12]: (df - 0.0005).round(3)
Out[12]:
a b
0 1.212 2.112
1 1.241 2.098

Adding calculated constant value into Python data frame

I'm new to Python, and I believe this is very basic question (sorry for that), but I tried to look for a solution here: Better way to add constant column to pandas data frame and here: add column with constant value to pandas dataframe and in many other places...
I have a data frame like this "toy" sample:
A B
10 5
20 12
50 200
and I want to add new column (C) which will be the division of the last data cells of A and B (50/200); So in my example, I'd like to get:
A B C
10 5 0.25
20 12 0.25
50 200 0.25
I tried to use this code:
groupedAC ['pNr'] = groupedAC['cIndCM'][-1:]/groupedAC['nTileCM'][-1:]
but I'm getting the result only in the last cell (I believe it's a result of my code acting as a "pointer" and not as a number - but as I said, I tried to "convert" my result into a constant (even using temp variables) but with no success).
Your help will be appreciated!
You need to index it with .iloc[-1] instead of .iloc[-1:], because the latter returns a Series and thus when assigning back to the data frame, the index needs to be matched:
df.B.iloc[-1:] # return a Series
#2 150
#Name: B, dtype: int64
df['C'] = df.A.iloc[-1:]/df.B.iloc[-1:] # the index has to be matched in this case, so only
# the row with index = 2 gets updated
df
# A B C
#0 10 5 NaN
#1 20 12 NaN
#2 50 200 0.25
df.B.iloc[-1] # returns a constant
# 150
df['C'] = df.A.iloc[-1]/df.B.iloc[-1] # there's nothing to match when assigning the
# constant to a new column, the value gets broadcasted
df
# A B C
#0 10 5 0.25
#1 20 12 0.25
#2 50 200 0.25

Pandas - remove cells based on value

I have a dataframe with z-scores for several values. It looks like this:
ID Cat1 Cat2 Cat3
A 1.05 -1.67 0.94
B -0.88 0.22 -0.56
C 1.33 0.84 1.19
I want to write a script that will tell me which IDs correspond with values in each category relative to a cut-off value I specify as needed. Because I am working with z-scores, I will need to compare the absolute value against my cut-off.
So if I set my cut-off at 0.75, the resulting dataframe would be:
Cat1 Cat2 Cat3
A A A
B C C
C
If I set 1.0 as my cut-off value: the dataframe above would return:
Cat1 Cat2 Cat3
A A C
C
I know that I can do queries like this:
df1 = df[df['Cat1'] > 1]
df1
df1 = df[df['Cat1'] < -1]
df1
to individually query each column and find the information I'm looking for but this is tedious even if I figure out how to use the abs function to combine the two queries into one.How can I apply this filtration to the whole dataframe?
I've come up with this skeleton of a script:
cut_off = 1.0
cols = list(df.columns)
cols.remove('ID')
for col in cols:
# FOR CELL IN VALUE OF EACH CELL IN COLUMN:
if (abs.CELL < cut_off):
CELL = NaN
to basically just eliminate any values that don't meet the cut-off. If I can get this to work, it will bring me closer to my goal but I am stuck and don't even know if I am on the right track. Again, the overall goal is to quickly figure out which cells have absolute-values above the cut-off in each category be able to list the corresponding IDs.
I apologize if anything is confusing or vague; let me know in comments and I'll fix it. I've been trying to figure this out for most of today and my brain is somewhat fried
You don't have to apply the filtration to columns, you can also do
df[df > 1]
, and also,
df[df > 1] = np.NaN

Categories