How to delete multiple rows in a pandas DataFrame based on condition? - python

I know how to delete rows and columns from a dataframe using .drop() method, by passing axis and labels.
Here's the Dataframe:
Now, if i want to remove all rows whose STNAME is equal to from (Arizona all the way to Colorado), how should i do it ?
I know i could just do it by passing row labels 2 to 7 to .drop() method but if i have a lot of data and i don't know the starting and ending indexes, it won't be possible.

Might be kinda hacky, but here is an option:
index1 = df.index[df['STNAME'] == 'Arizona'].tolist()[0]
index2 = df.index[df['STNAME'] == 'Colorado'].tolist()[-1:][0]
df = df.drop(np.arange(index1, index2+1))
This basically takes the first index number of Arizona and the last index number of Colorado, and deletes every row from the data frame between these indexes.

Related

Pulling columns of dataframe into separate dataframe, then replacing duplicates with mean values

I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

How to create binary representations of words in pandas column?

I have a column which contains lists of variable sizes. The lists contain a limited amount of short text values. Around 60 unique values all together.
0 ["AC","BB"]
1 ["AD","CB", "FF"]
2 ["AA","CC"]
3 ["CA","BB"]
4 ["AA"]
I want to make this values columns in my data-frame and the values of this columns would be 1 if the values is in this row and 0 if not.
I know I could expand the list and than call unique and set those as new columns. But after that I don't know what to do?
Here's one way:
df = pd.get_dummies(df.explode('val')).sum(level = 0)
NOTE: Here (level=0) is kind of like a grouping operation that uses an index for grouping stuff. So, I prefer to use this after exploding the dataframe.

How to get index number for a row meeting specific condition

I am curious to know how to grab index number off of a dataframe that's meeting a specific condition. I've been playing with pandas.Index.get_loc, but no luck.
I've loaded a csv file, and it's structured in a way that has 1000+ rows with all column values filled in, but in the middle there is one completely empty row, and the data starts again. I wanted to get the index # of the row, so I can remove/delete all the subsequent rows that come after the empty row.
This is the way I identified the empty row, df[df["ColumnA"] ==None], but no luck in getting the row index number for that row. Please help!
What you most likely want is pd.DataFrame.dropna:
Return object with labels on given axis omitted where alternately any
or all of the data are missing
If the row is empty, you can simply do this:
df = df.dropna(how='all')
If you want to find indices of null rows, you can use pd.DataFrame.isnull:
res = df[df.isnull().all(axis=1)].index
To remove rows with indices greater than the first empty row:
df = df[df.index < res[0]]

Reindex a dataframe with duplicate index values

So I imported and merged 4 csv's into one dataframe called data. However, upon inspecting the dataframe's index with:
index_series = pd.Series(data.index.values)
index_series.value_counts()
I see that multiple index entries have 4 counts. I want to completely reindex the data dataframe so each row now has a unique index value. I tried:
data.reindex(np.arange(len(data)))
which gave the error "ValueError: cannot reindex from a duplicate axis." A google search leads me to think this error is because the there are up to 4 rows that share a same index value. Any idea how I can do this reindexing without dropping any rows? I don't particularly care about the order of the rows either as I can always sort it.
UPDATE:
So in the end I did find a way to reindex like I wanted.
data['index'] = np.arange(len(data))
data = data.set_index('index')
As I understand it, I just added a new column called 'index' to my data frame, and then set that column as my index.
As for my csv's, they were the four csv's under "download loan data" on this page of Lending Club loan stats.
It's pretty easy to replicate your error with this sample data:
In [92]: data = pd.DataFrame( [33,55,88,22], columns=['x'], index=[0,0,1,2] )
In [93]: data.index.is_unique
Out[93]: False
In [94:] data.reindex(np.arange(len(data))) # same error message
The problem is because reindex requires unique index values. In this case, you don't want to preserve the old index values, you merely want new index values that are unique. The easiest way to do that is:
In [95]: data.reset_index(drop=True)
Out[72]:
x
0 33
1 55
2 88
3 22
Note that you can leave off drop=True if you want to retain the old index values.

Categories