Remove specific values contained in DataFrame - python

I am trying to check if a specific value is contained anywhere in a certain column of my dataframe. I am using the following code, where it should clear data containing "0.0". However, it seemed like it is clearing data that does not contain "0.0" as well.
mydataset = mydataset[mydataset['Latitude'].astype(str).str.contains('0.0') == False]
Example of the data as follows. Highlighted in red are data being removed, upon applying the above code.

Here is problem . in regex is special char, so need regex=False or escape it by \, for invert mask use ~:
mydataset = mydataset[~mydataset['Latitude'].astype(str).str.contains('0.0', regex=False)]
Or:
mydataset = mydataset[~mydataset['Latitude'].astype(str).str.contains('0\.0')]

If you are using a pandas dataframe you can conditionally drop rows from your dataframe in the following manner:
mydataset = mydataset[str(mydataset.Latitude) != '0.0']
If you are trying to remove all 0 values and not just 0.0 then don't convert to string and it should drop any 0 value.

Related

Delete rows with a certain value in Python and Pandas

I want to delete rows who have certain values. The values that I want to delete have a "+" and are as follows:
cooperative+parallel
passive+prosocial
My dataset consists of 900000 rows, and about 2000 values contain the problem I mentioned.
I want the code something like this:
df = df[df.columnname != '+']
The above is for one column (its not working well) but I would also like one example for whole dataset.
I prefer the solution in Pandas.
Many thanks
Use Series.str.contains with invert mask by ~ and escape +, because special regex character with DataFrame.apply for all object columns selected by DataFrame.select_dtypes with DataFrame.any for test at least one match:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+')).any(axis=1)]
Or use regex=False:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+', regex=False)).any(axis=1)]
df = df[~df['columnname'].str.contains('+', regex=False)]
documentation is here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

How to check splitted values of a column without changing the dataframe?

I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column

I have missing data in my pandas dataframe. How can I tell python not to include it in a new dataframe?

I have a text file mart_export.txt full of two different types of keys that looks like this
Gene stable ID RefSeq match transcript
ENSG00000243959
ENSG00000206698
ENSG00000265684
ENSG00000251990
ENSG00000241552
ENSG00000050767 NM_173465.4
As you can see, most of the right column doesn't have any data, but I am trying to build a new pandas dataframe out of just the indices that have values for both columns. Here is my script so far
#Put the biomart export in a pandas dataframe
mart = pd.read_csv("mart_export.txt", delimiter="\t")
#Create new list of records with Gene Stable Id and RefSeq numbers
d = {'Gene Stable ID': [], 'RefSeq ID': []}
for i in mart:
if mart['RefSeq match transcript'] != NaN:
d['Gene Stable ID'].append(mart['Gene stable ID'])
d['RefSeq ID'].append(mart['RefSeq match transcript'])
In Spyder, the values in the second column that are blank are labeled NaN, but when I try to use this value in my code, I get an error in python that says NaN is not defined. How can I specify to python what a blank looks like?
You can drop rows or columns using dropna() method of pandas DataFrame.
In your case, it would be:
mart.dropna(axis="rows", inplace=True)
You can drop columns containing NaNs, specify how argument and so on, check the docs linked above.
To detect NaN, you may use pd.isna or pd.isnull.
However, mart is DataFrame, so mart['RefSeq match transcript'] is a column.
mart['RefSeq match transcript'] == something will return a series.
Therefore, condition 'if mart['RefSeq match transcript'] == something' will always return errors no matter what value you tried to compare.
You either need to dropna as shown in other answers or filter out nan such as belows:
mart_noNaN = mart[~mart['RefSeq match transcript'].isna()]
notice the '~' negation in front of `mart.

Validating pandas dataframe columns

I have a dataframe with columns as below -
u'wellthie_issuer_identifier', u'issuer_name', u'service_area_identifier', u'hios_plan_identifier', u'plan_year', u'type'
I need to validate values in each column and finally have a dataframe which is valid.
For example, I need to check if plan_year column satisfies below validation
presence: true, numericality: true, length: { is: 4 }
hios_plan_identifier column satisfies below regex.
format: /\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\z/,
presence: true, length: { minimum: 10 },
type column contains,
in: ['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan']
There are lot of columns which I need to validate. I have tried to give an example data.
I am able to check regex with str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z', regex=True)
Similary I can check other validation as well individually. I am confused as to how to put all the validation together. Should I put all in a if loop with and conditions. Is there a easy way to validate the dataframe columns ? Need help here
There are multiple pandas functions you could use of. Basically the syntax you could use to filter your dataframe by content is:
df = df[(condition1) & (condition2) & ...] # filter the df and assign to the same df
Specifically for your case, you could replace condition with following functions(expressions):
df[some_column] == some_value
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4
Finally, if you want to reset the index, you could use df = df.reset_index(drop=True) to reset your output df index to 0,1,2,...
Edit: To check for NaN, NaT, None values you could use
df[some_column].isnull()
For multiple columns, you could use
df[[col1, col2]].isin(valuelist).all(axis=1)

delete string in a pandas dataframe

I have a
df = pandasdataframe with data.
I have a second pandas-dataframe (called df_outlier) with only some keys (that obviously also exist in df) and I want to remove them from df.
df_outlier
I was looking for something like the following function - but that might not be the right approach. The key contains alphanumeric values - so letters and numbers. So it is not an int.
clean_df = (df['ID'] - df_outlier['ID'])
Any ideas? Thanks.
To filter a df using multiple values from another df we can use isin, this will return a boolean mask for the rows where the values exist in the passed in list/Series. In order to filter out these values we use the negation operator ~ to invert the mask:
clean_df = df[~df['ID'].isin(df_outlier['ID'])]

Categories