I want to delete rows who have certain values. The values that I want to delete have a "+" and are as follows:
cooperative+parallel
passive+prosocial
My dataset consists of 900000 rows, and about 2000 values contain the problem I mentioned.
I want the code something like this:
df = df[df.columnname != '+']
The above is for one column (its not working well) but I would also like one example for whole dataset.
I prefer the solution in Pandas.
Many thanks
Use Series.str.contains with invert mask by ~ and escape +, because special regex character with DataFrame.apply for all object columns selected by DataFrame.select_dtypes with DataFrame.any for test at least one match:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+')).any(axis=1)]
Or use regex=False:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+', regex=False)).any(axis=1)]
df = df[~df['columnname'].str.contains('+', regex=False)]
documentation is here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
Related
I need help cleaning a very large dataframe. One of the rows is "PostingTimeUtc" should be only dates but several rows inserted wrong and they have strings of text instead. How can I select all the rows for "PostingTimeUtc" which have strings instead of dates and drop them?
I'm new to this site and to coding, so please let me know if I'm being vague.
Please remember to add examples even if short -
This may work in your case:
from pandas.api.types import is_datetime64_any_dtype as is_datetime
df[df['column name'].map(is_datetime)]
Where map applies the is_datetime function (results in True or False) to each row and the Boolean filter is applied to the dataframe.
Don't forget to assign df to this result to retain the values as it is not done inplace.
df = df[df['column name'].map(is_datetime)]
I am assuming it's the pandas data frame. You can do this to filter rows on the basis of regex.
df.column_name.str.contains('your regex here')
I have a DataFrame that has columns with numbers, but these numbers are represented as strings. I want to find these columns automatically, without telling which column should be numeric. How can I do this in pandas?
You can utilise contains from pandas
>>> df.columns[df.columns.str.contains('.*[0-9].*', regex=True)]
The regex can be modified to accomodate a wide range of patterns you want to search
You can first filter using pd.to_numeric and then combine_first with original column:
df['COL_NAME'] = pd.to_numeric(df['COL_NAME'],errors='coerce').combine_first(df['COL_NAME'])
I am trying to check if a specific value is contained anywhere in a certain column of my dataframe. I am using the following code, where it should clear data containing "0.0". However, it seemed like it is clearing data that does not contain "0.0" as well.
mydataset = mydataset[mydataset['Latitude'].astype(str).str.contains('0.0') == False]
Example of the data as follows. Highlighted in red are data being removed, upon applying the above code.
Here is problem . in regex is special char, so need regex=False or escape it by \, for invert mask use ~:
mydataset = mydataset[~mydataset['Latitude'].astype(str).str.contains('0.0', regex=False)]
Or:
mydataset = mydataset[~mydataset['Latitude'].astype(str).str.contains('0\.0')]
If you are using a pandas dataframe you can conditionally drop rows from your dataframe in the following manner:
mydataset = mydataset[str(mydataset.Latitude) != '0.0']
If you are trying to remove all 0 values and not just 0.0 then don't convert to string and it should drop any 0 value.
I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column
I have a
df = pandasdataframe with data.
I have a second pandas-dataframe (called df_outlier) with only some keys (that obviously also exist in df) and I want to remove them from df.
df_outlier
I was looking for something like the following function - but that might not be the right approach. The key contains alphanumeric values - so letters and numbers. So it is not an int.
clean_df = (df['ID'] - df_outlier['ID'])
Any ideas? Thanks.
To filter a df using multiple values from another df we can use isin, this will return a boolean mask for the rows where the values exist in the passed in list/Series. In order to filter out these values we use the negation operator ~ to invert the mask:
clean_df = df[~df['ID'].isin(df_outlier['ID'])]