I have a pandas dataframe df. In every column eventually the values are '-' until the end of the dataframe. I would like to find the final row where there is no '-' value. How can I do that?
df.isin(['-'])
gives me a dataframe full of Trues and Falses. So I want the last row that only has False in it.
You can use df.tail(1) to pick the last row:
df[df.isin(['-'])].tail(1)
To get the last row with False use ~ Not operator:
df[~df.isin(['-'])].tail(1)
Related
Hi I am droping duplicate from dataframe based on one column i.e "ID", Till now i am droping the duplicate and keeping the first occurence but I want to keep the first(top) two occurrence instead of only one. So I can compare the values of first two rows of another column "similarity_score".
data_2 = data.sort_values('similarity_score' , ascending = False)
data_2.drop_duplicates(subset=['ID'], keep='first').reset_index()
Let us sort the values then do groupby + head
data.sort_values('similarity', ascending=False).groupby('ID').head(2)
Alternatively, you can use groupby + nlargest which will also give you the desired result:
data.groupby('ID')['similarity'].nlargest(2).droplevel(1)
I have the following condition which eliminates the last row of a group if that last row does not equal 'no' for the column outcome:
m1 = df.groupby(['id'])['outcome'].tail(1) != 'no'
I then use this condition to drop these rows from the dataframe:
df = df.drop(m1[m1].index)
However I do not know how to do the opposite and instead of dropping these rows from the original df, extract entire rows that satisfy the m1 condition. Any suggestions?
From the comments:
df.loc[m1[m1].index, :]
will work.
I have a data frame DF in Python and I want to filter its rows based on 2 columns.
In particular, I want to remove the rows where orderdate is earlier than the startdate
How can I reverse/opposite the condition inside the following code to achieve what I want?
DF = DF.loc[DF['orderdate']<DF['startdate']]
I could reframe the code like below but it won't cover some rows that have NaT and I want to keep them
DF = DF.loc[DF['orderdate']>=DF['startdate']]
Inserting the ~ in front of the condition in parenthesis will reverse the condition and remove all the rows that do not satisfy it.
DF = DF.loc[~(DF['orderdate']<DF['startdate'])]
1- loc takes the rows from the 'orderdate' column and compares them with the rows from the 'startdate' column. Where the condition is true, it returns the index of the lines and stores it in the ids array.
2 - The drop method deletes lines in the dataframe, the parameters are the array with the indices of the lines, and inplace = True, this ensures that the operation is performed on the dataframe itself, if it is False operation it will return a copy of the dataframe
# Get names of indexes for which column orderdate > = startdate
ids = DF.loc[DF['orderdate'] >= DF['startdate']].index
# Delete these row indexes from dataFrame
DF.drop(ids, inplace=True)
I'm trying to clean an excel file that has some random formatting. The file has blank rows at the top, with the actual column headings at row 8. I've gotten rid of the blank rows, and now want to use the row 8 string as the true column headings in the dataframe.
I use this code to get the position of the column headings by searching for the string 'Destination' in the whole dataframe, and then take the location of the True value in the Boolean mask to get the list for renaming the column headers:
boolmsk=df.apply(lambda row: row.astype(str).str.contains('Destination').any(), axis=1)
print(boolmsk)
hdrindex=boolmsk.index[boolmsk == True].tolist()
print(hdrindex)
hdrstr=df.loc[7]
print(hdrstr)
df2=df.rename(columns=hdrstr)
However when I try to use hdrindex as a variable, I get errors when the second dataframe is created (ie when I try to use hdrstr to replace column headings.)
boolmsk=df.apply(lambda row: row.astype(str).str.contains('Destination').any(), axis=1)
print(boolmsk)
hdrindex=boolmsk.index[boolmsk == True].tolist()
print(hdrindex)
hdrstr=df.loc[hdrindex]
print(hdrstr)
df2=df.rename(columns=hdrstr)
How do I use a variable to specify an index, so that the resulting list can be used as column headings?
I assume your indicator of actual header rows in dataframe is string "destination". Lets find where it is:
start_tag = df.eq("destination").any(1)
We'll keep the number of the index of first occurrence of word "destination" for further use:
start_row = df.loc[start_tag].index.min()
Using index number we will get list of values in the "header" row:
new_col_names = df.iloc[start_row].values.tolist()
And here we can assign new column names to dataframe:
df.columns = new_col_names
From here you can play with new dataframe, actual column names and proper indexing:
df2 = df.iloc[start_row+1:].reset_index(drop=True)
I am curious to know how to grab index number off of a dataframe that's meeting a specific condition. I've been playing with pandas.Index.get_loc, but no luck.
I've loaded a csv file, and it's structured in a way that has 1000+ rows with all column values filled in, but in the middle there is one completely empty row, and the data starts again. I wanted to get the index # of the row, so I can remove/delete all the subsequent rows that come after the empty row.
This is the way I identified the empty row, df[df["ColumnA"] ==None], but no luck in getting the row index number for that row. Please help!
What you most likely want is pd.DataFrame.dropna:
Return object with labels on given axis omitted where alternately any
or all of the data are missing
If the row is empty, you can simply do this:
df = df.dropna(how='all')
If you want to find indices of null rows, you can use pd.DataFrame.isnull:
res = df[df.isnull().all(axis=1)].index
To remove rows with indices greater than the first empty row:
df = df[df.index < res[0]]