Getting rid of outliers rows in multiple columns pandas dataframe - python

I have a pandas data frame with many columns (>100). I standarized all the columns value so every column is centered at 0 (they have mean 0 and std 1). I want to get rid of all the rows that are below -2 and above 2 taking into account all the columns. With this I mean, lets say in the first column the rows 2,3,4 are outliers and in the second column the rows 3,4,5,6 are outliers. Then I would like to get rid of the rows [2,3,4,5,6].
What I am trying to do is to use a for loop to pass for every column and collect the row index that are outliers and store them in a list. At the end I have a list containing lists with the row index of every column. I get the unique values to obtain the row index I should get rid of. My problem is I don´t know how to slice the data frame so it doesn´t contain these rows. I was thinking in using an %in% operator, but it doesn´t admit the format # list in a list#. I show my code below.
### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2.
'''
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []
for i in range(n_cols):
variable = aux_features[:,i] # We take one column at a time
condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
index = np.where(condition)
outliers_index.append(index)
outliers = [j for i in outliers_index for j in i]
outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.
total_index = list(range(n_rows))
aux = (total_index in unique_index)
outliers_2 contain a list with all the row indexes (this includes repetition), then in unique_index I get only the unique values so I end with all the row index that have outliers. I am stuck in this part. If anyone knows how to complete it or have better a idea of how get rid of these outliers (I guess my method would be very time consuming for really large datasets)

df = pd.DataFrame(np.random.standard_normal(size=(1000, 5))) # example data
cleaned = df[~(np.abs(df) > 2).any(1)]
Explanation:
Filter dataframe for values above and below 2. Returns dataframe containing boolean expressions:
np.abs(df) > 2
Check if row contains outliers. Evaluates to True for each row where an outlier exists:
(np.abs(df) > 2).any(1)
Finally select all rows without outlier using the ~ operator:
df[~(np.abs(df) > 2).any(1)]

Related

Taking the column names from the first row that has less than x Nan's

I have data as follows:
import pandas as pd
url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
print(df_cities.iloc[0:20,])
The column names can be found in row 15, but I would like this row number to be automatically determined. I thought the best way would be to take the first row for which the values are non-Nan for less than 10 items.
I combined this answer, to find this answer to do the following:
amount_Nan = df_cities.shape[1] - df_cities.count(axis=1)
# OR df.isnull().sum(axis=1).tolist()
print(amount_Nan)
col_names_index = next(i for i in amount_Nan if i < 3)
print(col_names_index)
df_cities.columns = df_cities.iloc[col_names_index]
The problem is that col_names_index keeps returning 0, while it should be 15. I think it is because amount_Nan returns rows and columns because of which next(i for i in amount_Nan if i < 3) works differently than expected.
The thing is that I do not really understand why. Can anyone help?
IIUC you can get first index of non missing value per second column by DataFrame.iloc with Series.notna and Series.idxmax, set columns names by this row and filter out values before this row by index:
i = df_cities.iloc[:, 1].notna().idxmax()
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]

Modifying Dataframes Stored in a List of Dataframes

I am segmenting data into a set of Pandas dataframes that have identical structure. For each dataframe, there are a total of cnames columns that have unique names, and a total of nrows rows, that are identified by an integer-valued index running from 0 to nrows-1. There are a total of nframes segments, each containing 3 dataframes.
The goal is, within each segment, calculate a quotient of two of the dataframes and send the result to the third. I've implemented and tested a process that works, but have a question as to why a slight variation of the process doesn't.
The steps (and variation) are as follows:
Initialize data frames:
Ldf_num = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Ldf_den = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Ldf_quo = [pd.DataFrame(0.0, index=range(0, nrows), columns=cnames) for x in range(0, nframes)]
Populate data frames:
#For loop over a set of data-records stored as a list of lists:
#Determine x, the index of the data frame related to this record, from the data
df_num = Ldf_num[x]
df_den = Ldf_den[x]
#Derive values (including row) for each column of the data frame, and store them as...
df_num[cname][row] += derived_value1
df_den[cname][row] += derived_value2
Determine quotient for each set of dataframes:
for x in range(0, nframes):
df_num = Ldf_num[x]
df_den = Ldf_den[x]
Ldf_quo[x] = df_num.div(df_den)
The above version of step 3 worked, i.e. I can print each dataframe in the quotient dataframe, and see that they have different values that match the numerator and denominator values.
3b. However, the versison below did not work:
for x in range(0, nframes):
df_num = Ldf_num[x]
df_den = Ldf_den[x]
df_quo = Ldf_quo[x]
df_quo = df_num.div(df_den)
...as all entries in all dataframes in the list Ldf_quo contained their initial value of 0.
Can anyone explain why when I assign a variable to a single dataframe stored in a list of dataframes, and I change values of the assigned variable, it changes the values in the original dataframe in the list in step 2...
...but when I send the output of the "div" method to a variable assigned to a single dataframe in a list of dataframes as in step 3b, the values in the original dataframe do not change (but I can get the desired result by sending the output from the "div" method explicitly to the right slot in the list of dataframes, as in step 3)
In the answer 3b. you are assigning the value at Ldf_quo[x] to df_quo which basically be an integer. However, when you do Ldf_quo[x], you assigning df_num.div(df_den) at the xth index of the data frame.

How can l extract a section of the pandas dataframe like marked in the picture below?

I am trying to extract the section (matrix) of the numbers in pandas dataframe like as marked in the given picture embedded above.
Please anyone who can assist me, I want to perform analytics based on the section (matrix) of a bigger data frame. Thank you in advance!!
You can use the .iloc[] function to select the rows and columns you want.
dataframe.iloc[5:15,6:15]
This should select rows 5-14 and columns 6-14.
Not sure if the numbers are correct but I think this method is what you were looking for.
edit: changed .loc[] to .iloc[] because we're using index values, and cleaned it up a bit
Here is the code to iterate over the whole dataframe
#df = big data frame
shape = (10,10) #shape of matrix to be analized, here is 10x10
step = 1 #step size, itterate over every number
#or
step = 10 #step size, itterate block by block
#keep in mind, iterating by block will leave some data out at the end of the rows and columns
#you can set step = shape if you are working with a matrix that isn't square, just be sure to change step in the code below to step[0] and step[1] respectively
for row in range( 0, len(df[0]) - shape[0]+1, step): #number of rows of big dataframe - number of rows of matrix to be analized
for col in range(0, len(df.iloc[0,:]) - shape[1]+1, step): #number of columns of big dataframe - number of columns of matrix to be analized
matrix = df.iloc[row:shape[0]+row, col:shape[1]+col] #slice out matrix and set it equal to 'matrix'
#analize matrix here
This is basically the same as #dafmedinama said, i just added more commenting and simplified specifying the shape of the matrix as well as included a step variable if you don't want to iterate over every single number every time you move the matrix.
Be sub_rows and sub_cols the dimension of the datafram to be extracted:
import pandas as pd
sub_rows = 10 # Amount of rows to be extracted
sub_cols = 3 # Amount of columns to be extracted
if sub_rows > len(df.index):
print("Defined sub dataframe rows are more than in the original dataframe")
elif sub_cols > len(df.columns):
print("Defined sub dataframe columns are more than in the original dataframe")
else:
for i in range(0,len(df.index)-sub_rows):
for j in range(0, len(df.columns)):
d.iloc[i:i+sub_rows, j:j+sub_cols] # Extracted dataframe
# Put here the code you need for your analysis

How do you filter rows in a dataframe based on the column numbers from a Python list?

I have a Pandas dataframe with two columns, x and y, that correspond to a large signal. It is about 3 million rows in size.
Wavelength from dataframe
I am trying to isolate the peaks from the signal. After using scipy, I got a 1D Python list corresponding to the indexes of the peaks. However, they are not the actual x-values of the signal, but just the index of their corresponding row:
from scipy.signal import find_peaks
peaks, _ = find_peaks(y, height=(None, peakline))
So, I decided I would just filter the original dataframe by setting all values in its y column to NaN unless they were on an index found in the peak list. I did this iteratively, however, since it is 3000000 rows, it is extremely slow:
peak_index = 0
for data_index in list(data.index):
if data_index != peaks[peak_index]:
data[data_index, 1] = float('NaN')
else:
peak_index += 1
Does anyone know what a faster method of filtering a Pandas dataframe might be?
Looping in most cases is extremely inefficient when it comes to pandas. Assuming you just need filtered DataFrame that contains the values of both x and y columns only when y is a peak, you may use the following piece of code:
df.iloc[peaks]
Alternatively, if you are hoping to retrieve an original DataFrame with y column retaining its peak values and having NaN otherwise, then please use:
df.y = df.y.where(df.y.iloc[peaks] == df.y.iloc[peaks])
Finally, since you seem to care about just the x values of the peaks, you might just rework the first piece in the following way:
df.iloc[peaks].x

In pure python (no numpy, etc.) how can I find the mean of certain columns of a two dimensional list?

I currently use CSV reader to create a two dimensional list. First, I strip off the header information, so my list is purely data. Sadly, a few columns are text (dates, etc) and some are just for checking against other data. What I'd like to do is take certain columns of this data and obtain the mean. Other columns I just need to ignore. What are the different ways that I can do this? I probably don't care about speed, I'm doing this once after I read the csv and my CSV files are maybe 2000 or so rows and only 30 or so columns.
This is assuming that all rows are of equal length, if they're not, you may have to add a few try / except cases in
lst = [] #This is the rows and columns, assuming the rows contain the columns
column = 2
temp = 0
for row in range (len(lst)):
temp += lst [row][column]
mean = temp / len (lst)
To test if the element is a number, for most cases, I use
try:
float(element) # int may also work depending on your data
except ValueError:
pass
Hope this helps; I can't test this code, as I'm on my phone.
Try this:
def avg_columns(list_name, *column_numbers):
running_sum = 0
for col in column_numbers:
for row in range(len(list_name)):
running_sum += list_name[row][col]
return running_sum / (len(list_name)*len(column_numbers))
You pass it the name of the list, and the indexes of the columns (starting at 0), and it will return the average of those columns.
l = [
[1,2,3],
[1,2,3]
]
print(avg_columns(l, 0)) # returns 1.0, the avg of the first column (index 0)
print(avg_columns(l, 0, 2)) # returns 2.0, the avg of column indices 0 and 2 (first and third)

Categories