Find columns that contains numbers in strings - python

I have a DataFrame that has columns with numbers, but these numbers are represented as strings. I want to find these columns automatically, without telling which column should be numeric. How can I do this in pandas?

You can utilise contains from pandas
>>> df.columns[df.columns.str.contains('.*[0-9].*', regex=True)]
The regex can be modified to accomodate a wide range of patterns you want to search

You can first filter using pd.to_numeric and then combine_first with original column:
df['COL_NAME'] = pd.to_numeric(df['COL_NAME'],errors='coerce').combine_first(df['COL_NAME'])

Related

extract number of ranking position in pandas dataframe

I have a pandas dataframe with a column named ranking_pos. All the rows of this column look like this: #123 of 12,216.
The output I need is only the number of the ranking, so for this example: 123 (as an integer).
How do I extract the number after the # and get rid of the of 12,216?
Currently the type of the column is object, just converting it to integer with .astype() doesn't work because of the other characters.
You can use .str.extract:
df['ranking_pos'].str.extract(r'#(\d+)').astype(int)
or you can use .str.split():
df['ranking_pos'].str.split(' of ').str[0].str.replace('#', '').astype(int)
df.loc[:,"ranking_pos"] =df.loc[:,"ranking_pos"].str.replace("#","").astype(int)

How do I extract numbers from the strings in a pandas column of 'object'?

I have a dataframe named 'x'.
This dataframe is about the size and type of houses (eg 35A, 9B, 50C..) and is of type 'object' and contains missing values.
I want to extract only numbers from this dataframe and convert them to numeric type.
What should I do in this case?
I tried the following, but it didn't work:
df['x'] = df['x'].str[0:2]
df['x'] = pd.to_numeric(df['x'])
Output
ValueError: Unable to parse string "9A" at position 3766
I would use str.extract here:
df['x'] = pd.to_numeric(df['x'].str.extract(r'^(\d+)'))
The challenge with trying to use a pure substring approach is that we don't necessarily know how many characters to take. Regex gets around this problem.
You are making the assumption that for your strings in the x column, the first two characters will always be digits. Unfortunately, you have a row where x is 9A which doesn't convert to a numeric value.

Delete rows with a certain value in Python and Pandas

I want to delete rows who have certain values. The values that I want to delete have a "+" and are as follows:
cooperative+parallel
passive+prosocial
My dataset consists of 900000 rows, and about 2000 values contain the problem I mentioned.
I want the code something like this:
df = df[df.columnname != '+']
The above is for one column (its not working well) but I would also like one example for whole dataset.
I prefer the solution in Pandas.
Many thanks
Use Series.str.contains with invert mask by ~ and escape +, because special regex character with DataFrame.apply for all object columns selected by DataFrame.select_dtypes with DataFrame.any for test at least one match:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+')).any(axis=1)]
Or use regex=False:
df1 = df[~df.select_dtypes(object).apply(lambda x: x.str.contains('\+', regex=False)).any(axis=1)]
df = df[~df['columnname'].str.contains('+', regex=False)]
documentation is here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

read row and convert float to integer in pandas

I have a dataframe with multiple rows and columns. One of my columns (lets call that column A) has rows that contain mix of strings, strings and integers (i.e RSE1023), integers only and floats only. I want to find a way to convert the rows of the column A that are floats to integers. Probably with something that can scan through the column in the dataframe and find the rows that are columns and make them integers?
You could try something like:
df['A']=df['A'].apply(lambda r:int(r) if isinstance(r,float) else r)
In pandas you do not give datatypes to rows but to columns.
A trick you could use would be to .transpose the dataframe, turning the rows into columns and vice versa.

Use to_numeric on certain columns only in PANDAS

I have a dataframe with 15 columns. 5 of those columns use numbers but some of the entries are either blanks, or words. I want to convert those to zero.
I am able to convert the entries in one of the column to zero but when I try to do that for multiple columns, I am not able to do it. I tried this for one column:
pd.to_numeric(Tracker_sample['Product1'],errors='coerce').fillna(0)
and it works, but when I try this for multiple columns:
pd.to_numeric(Tracker_sample[['product1','product2','product3','product4','Total']],errors='coerce').fillna(0)
I get the error : arg must be a list, tuple, 1-d array, or Series
I think it is the way I am calling the columns to be fixed. I am new to pandas so any help would be appreciated. Thank you
You can use:
Tracker_sample[['product1','product2','product3','product4','Total']].apply(pd.to_numeric, errors='coerce').fillna(0)
With a for loop?
for col in ['product1','product2','product3','product4','Total']:
Tracker_sample[col] = pd.to_numeric(Tracker_sample[col],errors='coerce').fillna(0)

Categories