delete string in a pandas dataframe - python

I have a
df = pandasdataframe with data.
I have a second pandas-dataframe (called df_outlier) with only some keys (that obviously also exist in df) and I want to remove them from df.
df_outlier
I was looking for something like the following function - but that might not be the right approach. The key contains alphanumeric values - so letters and numbers. So it is not an int.
clean_df = (df['ID'] - df_outlier['ID'])
Any ideas? Thanks.

To filter a df using multiple values from another df we can use isin, this will return a boolean mask for the rows where the values exist in the passed in list/Series. In order to filter out these values we use the negation operator ~ to invert the mask:
clean_df = df[~df['ID'].isin(df_outlier['ID'])]

Related

Map Value to Specific Row and Column - Python Pandas

I have a data set where i want to match the index row and change the value of a column within that row.
I have looked at map and loc and have been able to locate the data use df.loc but it filters that data down, all i want to do is change the value in a column on that row when that row is found.
What is the best approach - my original post can be found here:
Original post
It's simple to do in excel but struggling with Pandas.
Edit:
I have this so far which seems to work but it includes a lot of numbers after the total calculation along with dtype: int64
import pandas as pd
df = pd.read_csv(r'C:\Users\david\Documents\test.csv')
multiply = {2.1: df['Rate'] * df['Quantity']}
df['Total'] = df['Code'].map(multiply)
df.head()
how do i get around this?
The pandas method mask is likely a good option here. Mask takes two main arguments: a condition and something with which to replace values that meet that condition.
If you're trying to replace values with a formula that draws on values from multiple dataframe columns, you'll also want to pass in an additional axis argument.
The condition: this would be something like, for instance:
df['Code'] == 2.1
The replacement value: this can be a single value, a series/dataframe, or (most valuable for your purposes) a function/callable. For example:
df['Rate'] * df['Quantity']
The axis: Because you're passing a function/callable as the replacement argument, you need to tell mask() how to find those values. It might look something like this:
axis=0
So all together, the code would read like this:
df['Total'] = df['Code'].mask(
df['Code'] == 2.1,
df['Rate'] * df['Quantity'],
axis=0
)

How to check if only the integer portion of the elements in two pandas data columns match?

I checked the answer here but this doesn't work for me.
How to get the integer portion of a float column in pandas
As I need to write further conditional statements which will perform operations on the exact values in the columns and the corresponding values in other columns.
So basically I am hoping that for my two dataframes df1 and df2 I will form a concatenated dataframe using
dfn_c = pd.concat([dfn_1, dfn_2], axis=1)
then write something like
dfn_cn = dfn_c.loc[df1.X1.isin(df2['X2'])]
where X1 and X2 are the said columns respectively. The above line of course makes an exact comparison whereas I want to compare only the integer portion and then form the new dataframe.
IIUC, try casting to int then compare.
dfn_cn = dfn_c.loc[df1['X1'].astype(int).isin(df2['X2'].astype(int))]

Delete rows with a certain value in Python and Pandas

I want to delete rows who have certain values. The values that I want to delete have a "+" and are as follows:
cooperative+parallel
passive+prosocial
My dataset consists of 900000 rows, and about 2000 values contain the problem I mentioned.
I want the code something like this:
df = df[df.columnname != '+']
The above is for one column (its not working well) but I would also like one example for whole dataset.
I prefer the solution in Pandas.
Many thanks
Use Series.str.contains with invert mask by ~ and escape +, because special regex character with DataFrame.apply for all object columns selected by DataFrame.select_dtypes with DataFrame.any for test at least one match:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+')).any(axis=1)]
Or use regex=False:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+', regex=False)).any(axis=1)]
df = df[~df['columnname'].str.contains('+', regex=False)]
documentation is here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

Pandas: How to replace Zero values in a column with the mean of that column, For all columns with Zero Value

I have a dataframe with multiple values as zero.
I want to replace the values that are zero with the mean values of that column Without repeating code.
I have columns called runtime, budget, and revenue that all have zero and i want to replace those Zero values with the mean of that column.
Ihave tried to do it one column at a time like this:
print(df['budget'].mean())
-> 14624286.0643
df['budget'] = df['budget'].replace(0, 14624286.0643)
Is their a way to write a function to not have to write the code multiple time for each zero values for all columns?
So this is pandas dataframe I will using mask make all 0 to np.nan , then fillna
df=df.mask(df==0).fillna(df.mean())
Same we can achieve directly using replace method. Without fillna
df.replace(0,df.mean(axis=0),inplace=True)
Method info:
Replace values given in "to_replace" with "value".
Values of the DataFrame are replaced with other values dynamically.
This differs from updating with .loc or .iloc which require
you to specify a location to update with some value.
How about iterating through all columns and replacing them?
for col in df.columns:
val = df[col].mean()
df[col] = df[col].replace(0, val)

Validating pandas dataframe columns

I have a dataframe with columns as below -
u'wellthie_issuer_identifier', u'issuer_name', u'service_area_identifier', u'hios_plan_identifier', u'plan_year', u'type'
I need to validate values in each column and finally have a dataframe which is valid.
For example, I need to check if plan_year column satisfies below validation
presence: true, numericality: true, length: { is: 4 }
hios_plan_identifier column satisfies below regex.
format: /\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\z/,
presence: true, length: { minimum: 10 },
type column contains,
in: ['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan']
There are lot of columns which I need to validate. I have tried to give an example data.
I am able to check regex with str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z', regex=True)
Similary I can check other validation as well individually. I am confused as to how to put all the validation together. Should I put all in a if loop with and conditions. Is there a easy way to validate the dataframe columns ? Need help here
There are multiple pandas functions you could use of. Basically the syntax you could use to filter your dataframe by content is:
df = df[(condition1) & (condition2) & ...] # filter the df and assign to the same df
Specifically for your case, you could replace condition with following functions(expressions):
df[some_column] == some_value
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4
Finally, if you want to reset the index, you could use df = df.reset_index(drop=True) to reset your output df index to 0,1,2,...
Edit: To check for NaN, NaT, None values you could use
df[some_column].isnull()
For multiple columns, you could use
df[[col1, col2]].isin(valuelist).all(axis=1)

Categories