I have a pandas dataframe with a column named ranking_pos. All the rows of this column look like this: #123 of 12,216.
The output I need is only the number of the ranking, so for this example: 123 (as an integer).
How do I extract the number after the # and get rid of the of 12,216?
Currently the type of the column is object, just converting it to integer with .astype() doesn't work because of the other characters.
You can use .str.extract:
df['ranking_pos'].str.extract(r'#(\d+)').astype(int)
or you can use .str.split():
df['ranking_pos'].str.split(' of ').str[0].str.replace('#', '').astype(int)
df.loc[:,"ranking_pos"] =df.loc[:,"ranking_pos"].str.replace("#","").astype(int)
Related
I have a dataframe called "nums" and am trying to find the value of the column "angle" by specifying the values of other columns like this:
nums[(nums['frame']==300)&(nums['tad']==6)]['angl']
When I do so, I do not get a singular number and cannot do calculations on them. What am I doing wrong?
nums
First of all, in general you should use .loc rather than concatenate indexes like that:
>>> s = nums.loc[(nums['frame']==300)&(nums['tad']==6), 'angl']
Now, to get the float, you may use the .item() accessor.
>>> s.item()
-0.466331
I have a dataframe named 'x'.
This dataframe is about the size and type of houses (eg 35A, 9B, 50C..) and is of type 'object' and contains missing values.
I want to extract only numbers from this dataframe and convert them to numeric type.
What should I do in this case?
I tried the following, but it didn't work:
df['x'] = df['x'].str[0:2]
df['x'] = pd.to_numeric(df['x'])
Output
ValueError: Unable to parse string "9A" at position 3766
I would use str.extract here:
df['x'] = pd.to_numeric(df['x'].str.extract(r'^(\d+)'))
The challenge with trying to use a pure substring approach is that we don't necessarily know how many characters to take. Regex gets around this problem.
You are making the assumption that for your strings in the x column, the first two characters will always be digits. Unfortunately, you have a row where x is 9A which doesn't convert to a numeric value.
I have a DataFrame that has columns with numbers, but these numbers are represented as strings. I want to find these columns automatically, without telling which column should be numeric. How can I do this in pandas?
You can utilise contains from pandas
>>> df.columns[df.columns.str.contains('.*[0-9].*', regex=True)]
The regex can be modified to accomodate a wide range of patterns you want to search
You can first filter using pd.to_numeric and then combine_first with original column:
df['COL_NAME'] = pd.to_numeric(df['COL_NAME'],errors='coerce').combine_first(df['COL_NAME'])
I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)
I have a column that has the values 1,2,3....
I need to change this value to Cluster_1, Cluster_2, Cluster_3... dynamically. My original table looks like below, where cluster_predicted is a column, containing integer value and I need to convert these numbers to cluster_0, cluster_1...
I have tried the below code
clustersDf['clusterDfCategorical'] = "Cluster_" + str(clustersDf['clusterDfCategorical'])
But this is giving me a very weird output as shown below.
import pandas as pd
df = pd.DataFrame()
df['cols']=[1,2,3,4,5]
df['vals']=['one','two','three','four','five']
df['cols'] =df['cols'].astype(str)
df['cols']= 'confuse_'+df['cols']
print(df)
try this , the string conversion is making the issue for you.
One way to convert to string is to use astype