I need to extract numeric values from a string inside a pandas DataFrame.
Let's say the DataFrame cell is as follows (stored as a string):
[1.234,2.345]
I can get the first value with the following:
print(df['column_name'].str.extract('(\d+.\d+)',).astype('float'))
Output:
1.234
Now my thoughts to find both values was to do the following:
print(df['column_name'].str.extract('(\d+.\d+),(\d+.\d+)',).astype('float'))
but the output is then as follows:
NaN NaN
Expected output:
1.234 2.345
Why not just pd.eval:
>>> df['Float'] = pd.eval(df['String'])
>>> df
String Float
0 [1.234, 2.345] [1.234, 2.345]
1 [1.234, 2.345] [1.234, 2.345]
>>>
If you want to use a regex to extract floats, you can use str.findall:
>>> df['column_name'].str.findall(r'(-?\d+\.?\d+)').str.join(' ')
0 1.234 2.345
Name: String, dtype: object
Old answer:
Use ast.literal_eval:
import ast
df = pd.DataFrame({'String': ['[1.234, 2.345]']})
df['Float'] = df['String'].apply(ast.literal_eval)
Output:
>>> df
String Float
0 [1.234, 2.345] [1.234, 2.345]
>>> type(df.at[0, 'String'][0])
str
>>> type(df.at[0, 'Float'][0])
float
You can use pandas.str.split, setting n=2. If you want to expand the DataFrame you must set expand=True.
So the result might look like:
your_dataframe['your_column_name'].str.split(",", n=2, expand=True).astype(float)
Related
this question the reverse problem as
In Python, how to specify a format when converting int to string?
here I have string "0001" to integer 1
string "0023" to integer 23
I wish to use this on pandas dataframe since I have column looks like:
dic = {'UPCCode': ["00783927275569", "0007839272755834", "003485934573", "06372792193", "8094578237"]}
df = pd.DataFrame(data=dic)
I wish it become some thing like this
dic = {'UPCCode': [783927275569, 7839272755834, 3485934573, 6372792193, 8094578237]}
df = pd.DataFrame(data=dic)
if I use int(001) or float(0023)
it will gives me this error
SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
The best way is to use pd.to_numeric:
df['UPCCode'] = pd.to_numeric(df['UPCCode'])
print(df)
UPCCode
0 783927275569
1 7839272755834
2 3485934573
3 6372792193
4 8094578237
Here is a quick solution, just use the astype method:
>>> df = df.astype(int)
>>> df
UPCCode
0 783927275569
1 7839272755834
2 3485934573
3 6372792193
4 8094578237
If you want to apply this for column 'UPCCode' alone, do like this:
df = df['UPCCode'].astype(int)
Hello I found some way to do that just try:
df["UPCCode"] = df["UPCCode"].str.strip("0")
df["UPCCode"] = df["UPCCode"].astype(int)
I have the following pandas dataframe with only one column:
column_name
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
What I want to do is to extract each element inside that pandas column and put it into a string like this:
my_string = ['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date',
'cc_rec_end_date']
I tried to do this with the following code:
my_list = column_names.values.tolist()
However, the output is a list and it is not as desired:
[['cc_call_center_sk'], ['cc_call_center_id'], ['cc_rec_start_date'], ['cc_rec_end_date']]
The df.names.tolist() generates the expected result:
>>> df.name.tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
For example:
>>> df=pd.DataFrame([['cc_call_center_sk'], ['cc_call_center_id'], ['cc_rec_start_date'], ['cc_rec_end_date']], columns=['names'])
>>> df
names
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
>>> df = pd.DataFrame([['cc_call_center_sk'], ['cc_call_center_id'], ['cc_rec_start_date'], ['cc_rec_end_date']], columns=['names'])
>>> df.names.tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
are you sure you do not "group" values, or perform other "preprocessing" before obtaining the df.names?
You can use the tolist method on the 'column_name' series. Note that my_string is a list of strings, not a string. The name you have assigned is not appropriate.
>>> import pandas as pd
>>> df = pd.DataFrame(['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date'],
... columns=['column_name'])
>>> df
column_name
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
>>>
>>> df['column_name'].tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
If you prefer the dot notation, the following code does the same.
>>> df.column_name.tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
Lets say you have a data frame named df which looks like this:
df
column_name
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
then:
my_string = df.column_name.values.tolist()
or:
my_string = df['column_name'].values.tolist()
would give you the result that you want. Here is the result when you print my_string
['cc_call_center_sk',
'cc_call_center_id',
'cc_rec_start_date',
'cc_rec_end_date']
What you are trying to do is this:
my_strings = df.values.tolist()
This would give you a list of lists with the number of lists in the outer list being equal to the number of observations in your data frame. Each list would contain all the feature information pertaining to 1 observation.
I hope I was clear in explaining that to you.
Thank you
I want to extract numbers using regular expression
df['price'][0]
has
'[<em class="letter" id="infoJiga">3,402,000</em>]'
And I want to extract 3402000
How can I get this in pandas dataframe?
However the value is a string, try the below code.
#your code
df['price'][0] returns '[<em class="letter" id="infoJiga">3,402,000</em>]'
let us say this is x.
y = ''.join(c for c in x.split('>')[1] if c.isdigit()).strip()
print (y)
output: 3402000
Hope it works.
The simplest regex assuming nothing about the environment may be ([\d,]*). Than you can pandas' to_numeric function.
Are all your values formatted the same way? If so, you can use a simple regular expression to extract the numeric values then convert them to int.
import pandas as pd
import re
test_data = ['[<em class="letter" id="infoJiga">3,402,000</em>]','[<em class="letter" id="infoJiga">3,401,000</em>]','[<em class="letter" id="infoJiga">3,400,000</em>]','[<em class="letter" id="infoJiga">2,000</em>]']
df = pd.DataFrame(test_data)
>>> df[0]
0 [<em class="letter" id="infoJiga">3,402,000</em>]
1 [<em class="letter" id="infoJiga">3,401,000</em>]
2 [<em class="letter" id="infoJiga">3,400,000</em>]
3 [<em class="letter" id="infoJiga">2,000</em>]
Name: 0, dtype: object
Define a method that extracts and returns to integer
def get_numeric(data):
match = re.search('>(.+)<', data)
if match:
return int(match.group(1).replace(',',''))
return None
Apply it to DataFrame
df[1] = df[0].apply(get_numeric)
>>> df[1]
0 3402000
1 3401000
2 3400000
3 2000
Name: 1, dtype: int64
As shown in the screenshot below, I have 2 columns in an excel file. I'm trying to reduce the precision of the number fields eg 100.54000000000001 to 100.540. The number is stored as string, so when I convert it to float using
df['Unnamed: 5'] = pd.to_numeric(df['Unnamed: 5'], errors='coerce')
it converts strings to NaN. Can anyone help me with issue? I'm trying to convert only numbers to int, and words should remain strings.
EDIT: It would be acceptable to convert the numeric values back to string after rounding them. My code is as follows:
>>> import pandas as pd
>>> import numpy as np
>>> xl = pd.ExcelFile("WSJ_template.xls")
>>> xl.sheet_names
[u'losers', u'winners']
>>> dfw = xl.parse("winners")
>>> dfw.head()
<output>
>>> dfw = dfw.apply(pd.to_numeric, errors='coerce').combine_first(dfw)
>>> dfw = dfw.replace(np.nan, '', regex=True)
>>> dfw
<output>
As you already identified, we're best off using pd.DataFrame.apply. The only difference is rather than using a built-in function, we'll define our own.
We'll start off by filling the DataFrame (this is a placeholder, you already have this covered):
df = pd.DataFrame(columns=['Unnamed: 5', 'Unnamed: 6'],
data=[['NaN', 'NaN'],
['Average', 'Weekly'],
['100.540000000001', '0.2399999999999999'],
['99.3299999999998', '0.1700000000000001'],
['95.4800000000004', 'change'],
['bid', '1.929999999999999']])
Now we define a function to use to convert values. This function should try to cast to float and if it works, return the rounded value. If it doesn't work, just return the original value. Here's one possible route:
def round_only_nums(val):
try:
return '%s' % round(float(val), 3)
except:
return val
Next, let's apply that the columns that need processing:
cols_to_process = ['Unnamed: 5', 'Unnamed: 6']
for col in cols_to_process:
df[col] = df[col].apply(round_only_nums)
And our results:
>>> df
Unnamed: 5 Unnamed: 6
0 nan nan
1 Average Weekly
2 100.54 0.24
3 99.33 0.17
4 95.48 change
5 bid 1.93
I have this type of DataFrame I wish to utilize. But because the data i imported is using the i letter for the imaginary part of the complex number, python doesn't allow me to convert it as a float.
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
How can I proceed to change the i to j in each row of the DataFrame?
Thank you.
If you have a string like this: complexStr = "0.015291+0.0075383i", you could do:
complexFloat = complex(complexStr[:-1] + 'j')
If your data is a string like this: str = "5.0 0.01511+0.0035769i", you have to separate the first part, like this:
number, complexStr = str.split()
complexFloat = complex(complexStr[:-1] + 'j')
>>> complexFloat
>>> (0.015291+0.0075383j)
>>> type(complexFloat)
>>> <type 'complex'>
I'm not sure how you obtain your dataframe, but if you're reading it from a text file with a suitable header, then you can use a converter function to sort out the 'j' -> 'i' so that your dtype is created properly:
For file test.df:
a b
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
the code
import pandas as pd
df = pd.read_table('test.df',delimiter='\s+',
converters={'b': lambda v: complex(str(v.replace('i','j')))}
)
gives df as:
a b
0 5.0000 (0.01511+0.0035769j)
1 5.0298 (0.015291+0.0075383j)
2 5.0594 (0.015655+0.0094534j)
3 5.0874 (0.012456+0.011908j)
4 5.1156 (0.015332+0.011174j)
5 5.1458 (0.015758+0.0095832j)
with column dtypes:
a float64
b complex128