How can I drop rows in pandas based on a condition - python

I'm trying to drop some rows in a pandas data frame, but I'm getting this error: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
I have a list of desired items that I want to stay in the Data Frame, so I wrote this:
#
import sys
import pandas as pd
biog = sys.argv[1]
df = pd.read_csv(biog, sep ='\t')
desired = ['Affinity Capture-Luminescence', 'Affinity Capture-MS', 'Affinity Capture-Western', 'Co-crystal Structure', 'Far Western', 'FRET', 'PCA', 'Reconstituted Complex']
new_df = df[['OFFICIAL_SYMBOL_A','OFFICIAL_SYMBOL_B','EXPERIMENTAL_SYSTEM']]
for i in desired:
print(i)
new_df.drop(new_df[new_df.EXPERIMENTAL_SYSTEM != i].index, inplace = True)
print(new_df)
#
it works if I place a single condition at a time, but when the for loop is inserted it doesn't work.
I didn't placed here the data because it is too large, I hope that this is enough.
thanks for the help

You can set a new df of when a column is in a list of values. No need to loop it.
new_df = new_df[new_df['EXPERIMENTAL_SYSTEM'].isin(desired)]

Related

How to fix autogenerated index in dataframe by real index after getting data from pd.read_html

I am not able to find how to index my dataframe columns properly
I tried some methods but not able to find right one
import pandas as pd
df = pd.read_html('sbi.html')
data = df[1]
i want the second row as my index of columns in which "Narration" is there
Set header parameter to 1:
data = pd.read_html('sbi.html', header=1)[0]
Or use skiprows parameter:
data = pd.read_html('sbi.html', skiprows=1)[0]

Need help to solve the Unnamed and to change it in dataframe in pandas

how set my indexes from "Unnamed" to the first line of my dataframe in python
import pandas as pd
df = pd.read_excel('example.xls','Day_Report',index_col=None ,skip_footer=31 ,index=False)
df = df.dropna(how='all',axis=1)
df = df.dropna(how='all')
df = df.drop(2)
To set the column names (assuming that's what you mean by "indexes") to the first row, you can use
df.columns = df.loc[0, :].values
Following that, if you want to drop the first row, you can use
df.drop(0, inplace=True)
Edit
As coldspeed correctly notes below, if the source of this is reading a CSV, then adding the skiprows=1 parameter is much better.

Why is my code is not filtering and still printing everything in the columns rather than what it should be filtering by?

This is my first post on here, so please excuse any errors in my posting.
Here is my code. I am trying to filter and print this data, but when the script runs it prints all of the column headers which is what I don't want. Any tips?
import pandas as pd
data = pd.read_csv("/Users/andrewschaper/Desktop/EQR_Data/EQR_Transactions_1.csv", low_memory=False)
print("Total rows: {0}".format(len(data)))
print(list(data))
df = pd.DataFrame(data)
df.sort_values(by=['Filing_Quarter','product_name','time_zone','increment_name'],ascending=[True, True, True, True])
df.filter(items=['RECID', 'transaction_unique_id', 'ferc_tariff_reference', 'contract_service_agreement', 'transaction_unique_identifier', 'transaction_begin_date', 'transaction_end_date', 'trade_date', 'exchange_brokerage_service', 'point_of_delivery_specific_location', 'class_name', 'term_name', 'increment_peaking_name'])
print(df)
You should just copy and paste your code here, but I think this will do:
# List of columns you want.
items = [x,y,z]
# Print the filtered dataframe.
df[items]
Have in mind that df[items] only prints your filtered dataframe, it does not drop the other columns. If that's what you want, then filtered_df = df[items].

pandas: subtracting two columns and saving result as an absolute

I have the code where I have a csv file opened in pandas and a new one I'm creating. There's a row I need to create "two last lines commented out" of an absolute value of subtracting two rows. I've tried a number of ideas in my head all bring an error.
import pandas as pd
import numpy as np
df = pd.read_csv(filename_read)
ids = df['id']
oosDF = pd.DataFrame()
oosDF['id'] = ids
oosDF['pred'] = pred
oosDF['y'] = df['target']
#oosDF['diff'] = oosdF['pred'] - oosDF['y']
#oosDF['diff'] = oosDF.abs()
I think you need for new DataFrame by subset (columns names in double []) and then get abs value of difference of columns:
oosDF = df[['id','pred', 'target']].replace(columns={'target':'y'})
oosDF['diff'] = (oosDF['pred'] - oosDF['y']).abs()
In your first commented line, you have oosdF instead of oosDF.
In your second commented line, you're setting the column to be abs() applied to the whole dataframe. That should be oosDF['diff'].abs()
Hope this helps!

How to compare two CSV files and get the difference?

I have two CSV files,
a1.csv
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/
a2.csv
city,state,link
Aguila,Arizona,http://www.co.apache.az.us
I want to get the difference.
Here is my attempt:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c
Expected Output:
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
But I am getting an error:
Empty DataFrame
Columns: [city, state, link]
Index: []**
I want to check based on the first two rows, then if they are the same, remove it off.
You can use pandas to read in two files, join them and remove all duplicate rows:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
# of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

Categories