Selecting Pandas DF between two values - python

I'm trying to subset a column of values that were extracted from a correlation matrix. I want to get values greater than 0.75 and less than -0.75. I tried the first line of code and it only gave me positive values greater than 0.75. The second line of code error'd out without a result.
Corr_matrix1 = Corr_matrix1[(Corr_matrix1['Coefficient'] >= abs(0.75))]
Corr_matrix1 = Corr_matrix1 [(Corr_matrix1 ['Coefficient'] >= 0.75) & (Corr_matrix1 ['Coefficient'] <= -0.75)]
Any help would be appreciated.

You can do this with the DataFrame.query method, one of my favorite features of pandas and it's pretty slept on. Here's an example;
df.corr().query(
'Coefficient <= -0.75'
'or Coefficient >= 0.75'
)
It's kind of odd, you pass the arguments as strings without commas in between multiple arguments. If you use a variable, you can use an f string.

Take a look at Interval Index
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html

Related

searching index of first and last value greater than - Python Pandas

I have 250 files with 1000 values which makes a Gaussian curve and I need to find the first and last index of the value that is bigger than half of the maximum. I loaded files as a list of dataframes and I was able to find its maximum using maxValues = dataframes_temp[1].max(). I was able to find the first closest value to the HotM (half of the maximum) using index_value_min = (dataframes_temp[1] - b[i]).apply(abs).idxmin() but it isn't greater value than HotM, that's the first problem.
and second problem I wanted to find last index of HotM using:
dataframes_temp = dataframes_list[1]
dataframes_temp2 = dataframes_temp.loc[::-1]
index_value_max = (dataframes_temp2[1] - b[i]).apply(abs).idxmin()
but it didn't work and just only found the same value from the first part.
So how can I find indexes of first and last values bigger than HotM?
How about instead of abs, use lambda x: float('inf') if x <= 0 else x so that smaller values will not be selected.

How to encode values that appear less than N times with special category ("Other", for instance)?

I have 12 columns of type object, and I want to encode those values in columns that appear less than N times (say, 1000) into special cateogry ("Other"). I tried this solution but I have 12 features to consider and I'd like to have something more universal. In addition, I tried doing something like this:
for col in train_data.select_dtypes('object'):
train_data.select_dtypes('object')[col] = np.where(train_data.select_dtypes('object')[col]. \
value_counts() < 1000, "Other", train_data.select_dtypes('object')[col].value_counts())
But got an error on sizes:
ValueError: Length of values (8) does not match length of index (251396)
How does universal solution look like?
Disclaimer: since previously I only used R language, I try to avoid the usage of for-loops and that is why this solution was not obvious for me
I found for myself the following solution for my specific problem. It is based on selecting column values which length is less than N with a usage of .loc, groupby and transform methods.
preprocessed_cols = []
for col in train_data.select_dtypes('object'):
if any(train_data.select_dtypes('object'). \
groupby(col)[col].transform('size')<1000) == True:
preprocessed_cols.append(col)
train_data.select_dtypes('object').loc[train_data.select_dtypes('object'). \
groupby(col)[col].\
transform('size')<1000, col] = 'Other'
print('There are {} features that needed replacement'.format(len(preprocessed_cols)))

How to include two conditions using np.where, when array two columns?

I've seen questions adjacent to this answered a number of times, but I'm really, really new to Python, and can't seem to get those answers to work for me...
I'm trying to access every row in an np array, where both columns have values greater than 1.
So, if x is my original array, and x has 500 rows and 2 columns, I want to know which rows, of those 500, contain 2 values > 1.
I've tried a bunch of solutions, but the following two seem the closest:
Test1 = x[(x[:,0:1] > 1) & (x[:,1:2] > 1)]
# Where the first condition should look for values greater than 1 in the first column, and the second condition should look for values greater than 1 in the second column.
Test2 = np.where(x[:,0:1] > 1 & x[:,1:2] > 1)
Any help would be greatly appreciated! Thanks so much!

Pandas adding decimal points when using read_csv

I'm working with some csv files and using pandas to turn them into a dataframe. After that, I use an input to find values to delete
I'm hung up on one small issue: for some columns it's adding ".o" to the values in the column. It only does this in columns with numbers, so I'm guessing it's reading the column as a float. How do I prevent this from happening?
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
Here's a sample of my code:
clientid = int(input('What client ID needs to be deleted?'))
df1 = pd.read_csv('Client.csv')
clientclean = df1.loc[df1['PersonalID'] != clientid]
clientclean.to_csv('Client.csv', index=None)
Ideally, I'd like all of the values to be the same as the original csv file, but without the rows with the clientid from the user input.
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
If PersonalID if the header of the problematic column, try this:
df1 = pd.read_csv('Client.csv', dtype={'PersonalID':np.int32})
Edit:
As there are no NaN value for integer.
You can try this on each problematic colums:
df1[col] = df1[col].fillna(-9999) # or 0 or any value you want here
df1[col] = df1[col].astype(int)
You could go through each value, and if it is a number x, subtract int(x) from it, and if this difference is not 0.0, convert the number x to int(x). Or, if you're not dealing with any non-integers, you could just convert all values that are numbers to ints.
For an example of the latter (when your original data does not contain any non-integer numbers):
for index, row in df1.iterrows():
for c, x in enumerate(row):
if isinstance(x, float):
df1.iloc[index,c] = int(x)
For an example of the former (if you want to keep non-integer numbers as non-integer numbers, but want to guarantee that integer numbers stay as integers):
import numbers
import sys
for c, col in enumerate(df1.columns):
foundNonInt = False
for r, index in enumerate(df1.index):
if isinstance(x, float):
if (x - int(x) > sys.float_info.epsilon):
foundNonInt = True
break
if (foundNonInt==False):
df1.iloc[:,c] = int(df1.iloc[:,c])
else:
Note, the above method is not fool-proof: if by chance, a non-integer number column from the original data set contains non-integers that are all x.0000000, all the way to the last decimal place, this will fail.
It was a datatype issue.
ALollz's comment lead me in the right direction. Pandas was assuming a data type of float, which added the decimal points.
I specified the datatype as object (from Akarius's comment) when using read_csv, which resolved the issue.

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Categories