Pandas Extract Number with decimals from String

Pandas Extract Number with decimals from String - python

I am trying to extract all numbers including decimals, dots and commas form a string using pandas.
This is my DataFrame
rate_number
0 92 rate
0 33 rate
0 9.25 rate
0 (4,396 total
0 (2,620 total
I tried using df['rate_number'].str.extract('(\d+)', expand=False) but the results were not correct.
The DataFrame I need to extract should be the following:
rate_number
0 92
0 33
0 9.25
0 4,396
0 2,620

You can try this:
df['rate_number'] = df['rate_number'].replace('\(|[a-zA-Z]+', '', regex=True)
Better answer:
df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.]*[0-9]*)')
Output:
rate_number rate_number_2
0 92 92
1 33 33
2 9.25 9.25
3 4,396 4,396
4 2,620 2,620

Dan's comment above is not very noticeable but worked for me:
for df in df_arr:
df = df.astype(str)
df_copy = df.copy()
for i in range(1, len(df.columns)):
df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
new_df_arr.append(df_copy)

There is a small error with the asterisk's position:
df['rate_number_2'] = df['rate_number'].str.extract('([0-9]*[,.][0-9]*)')

Related

fillna() not allowing floating values

I'm testing a simple imputation method on the side using a copy of my dataset. I'm essentially trying to impute missing values with categorical means grouped by the target variable.
df_test_2 = train_df.loc[:,['Survived','Age']].copy() #copy of dataset for testing
#creating impute function
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
#imputing
impute(df_test_2,'Age')
The output is that the imputation is successful, but the values added are 30 and 28 instead of 30.7 and 28.3.
'Age' is float64.
Thank you
Edit: I simply copied the old code for calling the function here and corrected it now. Wasn't the issue in my original code; problem persists.

Have a look at this to see what may be going on
To test it I set up a simple case
import pandas as pd
import numpy as np
data = {'Survived' : [0,1,1,0,0,1], 'Age' :[12.2,45.4,np.nan,np.nan,64.3,44.3]}
df = pd.DataFrame(data)
df
This got the data set
Survived Age
0 0 12.2
1 1 45.4
2 1 NaN
3 0 NaN
4 0 64.3
5 1 44.3
I ran your function exactly
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
and this yielded this result
Survived Age
0 0. 12.2
1 1 45.4
2 1 28.3
3 0 28.3
4 0 64.3
5 1 44.3
As you can see on the index 3 the row age got filled with the wrong value. The problem is this 'Survived'==0. This is always going to be false. You are checking to see if the string is 0 and it is not.
What you may want is
df2 = df[df['Survived'] == 0].fillna(30.7)
df3 = df[df['Survived'] == 1].fillna(28.3)
dfout = df2.append(df3)
and the output is
Survived Age
0 0 12.2
3 0 30.7
4 0 64.3
1 1 45.4
2 1 28.3
5 1 44.3

Anish
I think is better to use the method apply() available in pandas. This method applies (in rows or in columns) a custom function over a dataframe.
I let you one post: Stack Question
Documentation pandas: Doc Apply df
regards,

Python: Erase all the strings except the information between > < in a dataframe

I need to erase all the content of the cell, except the information contained between > <
I have a dataframe with 100 * 15, that look like something like this:
df = pd.DataFrame(['irus 1/3 km >A001< absc ','#$ jiadhf 3 >A002<', '#AB >A003<'], columns=['AFF'])
df
AFF
0 irus 1/3 km >A001< absc
1 #$ jiadhf 3 >A002<
2 #AB >A003<
I need to get a result like this:
AFF
0 A001
1 A002
2 A003
I found that I need to use a command similar to this re.sub('[^>]+>', '', y), but I've been trying several attempts and I can get exactly the info I need.
Can somebody give me a hand?

You could use str.extract() with a capturing group:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(['irus 1/3 km >A001< absc ','#$ jiadhf 3 >A002<', '#AB >A003<'], columns=['AFF'])
In [3]: df['AFF'] = df['AFF'].str.extract(r">([A-Z0-9]+)<")
In [4]: print(df)
AFF
0 A001
1 A002
2 A003

Why is python pandas dataframe rounding my values?

I do not understand why pandas dataframe is rounding the values in my column where I divide the values of two other columns. I want the numbers in the new colums with two decimals, but the values are rounded. I have checked the dtypes of the columns and both are "float64".
import pandas as pd
import numpy as np
# CURRENT DIRECTORY
cd = os.path.dirname(os.getcwd())
# concatenate csv files
dfList = []
for root, dirs, files in os.walk(cd):
for fname in files:
if re.match("output_contigs_SCMgenes.csv", fname):
frame = pd.read_csv(os.path.join(root, fname))
dfList.append(frame)
df = pd.concat(dfList)
#replace nan in SCM column with 0
df['SCM'].fillna(0, inplace=True)
#add column with genes/SCM
df['genes/SCM'] = df['genes']/df['SCM']
The output is as follows:
genome contig genes SCM genes/SCM
0 20900 48 1 0 inf
1 20900 37 130 103 1
2 20900 35 1 1 1
3 20900 1 79 66 1
4 20900 66 5 3 2
But I want that my last column does not contain rounded values, but values with at least 2 decimals.

I could reproduce this behaviour by setting the pd.options.display.precision to 0:
In [4]: df['genes/SCM'] = df['genes']/df['SCM']
In [5]: df
Out[5]:
genome contig genes SCM genes/SCM
0 20900 48 1 0 inf
1 20900 37 130 103 1.262136
2 20900 35 1 1 1.000000
3 20900 1 79 66 1.196970
4 20900 66 5 3 1.666667
In [6]: pd.options.display.precision = 0
In [7]: df
Out[7]:
genome contig genes SCM genes/SCM
0 20900 48 1 0 inf
1 20900 37 130 103 1
2 20900 35 1 1 1
3 20900 1 79 66 1
4 20900 66 5 3 2
Check your Pandas & Numpy options

For rounding off with desired number of digits after decimal e.g. 2 digits after decimal as asked in the question
df.round({'genes/SCM': 2})
for multiple columns
df.round({'col1_name': 1, 'col2_name': 2})
Also, check precision is not set to 0, pd.set_option('precision', 5) can be used to set the precision appropriately. Here 5 is number of desired digits needed after decimal as an example.

Can't be sure because I can't reproduce but you can try:
from __future__ import division
at the very top of your script.

Try using round() function
df['genes/SCM'] = df['genes']/df['SCM'].round(2)

I had faced similar issue, if you're reading data from csv then use the option float_precision='round_trip' as
pd.read_csv(resultant_file, sep='\t',float_precision='round_trip')
It will hold of your precision, if you don't use this option it will limit the precision for speed.
-see #MarkDickinson comment.
and
if it's related to displaying data frame in jupyter notebook, then set the precision as display.precisionfollowing
pd.set_option("precision", 20)

align timeseries in pandas

I have 2 time series.
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df.index=pd.to_datetime(df['Time'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df1.index=pd.to_datetime(df1['Time'])
I am trying to align the time series so the index is in order. I am guessing reindex_like is what I need but not sure how to use it.
Here is my desired output
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

Here is what you need:
df.append(df1).sort().reset_index(drop=True)
If you need to compile more pieces together, it is more efficient to use pd.concat(<names of all your dataframes as a list>).
P.S. You code is a bit redundant: you don't need to cast Time into index if you don't need it there. You can sort values based on any column, like this:
import pandas as pd
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df.append(df1).sort_values('Time')

You can use concat, sort_index and reset_index:
df = pd.concat([df,df1]).sort_index().reset_index(drop=True)
print df
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Or you can use ordered_merge:
print pd.ordered_merge(df, df1)
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

Pandas error "Can only use .str accessor with string values"

I have the following input file:
"Name",97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,
And I am reading it in with:
#!/usr/bin/env python
import pandas as pd
import sys
import numpy as np
filename = sys.argv[1]
df = pd.read_csv(filename,header=None)
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
print df
However, I get the error
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2241, in __getattr__
return object.__getattribute__(self, name)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 188, in __get__
return self.construct_accessor(instance)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 528, in _make_str_accessor
raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
This worked OK in pandas 0.14 but does not work in pandas 0.17.0.

It's happening because your last column is empty so this becomes converted to NaN:
In [417]:
t="""'Name',97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[417]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 'Name' 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15 16
0 0N# NaN
If you slice your range up to the last row then it works:
In [421]:
for col in df.columns[2:-1]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[421]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
Alternatively you can just select the cols that are object dtype and run the code (skipping the first col as this is the 'Name' entry):
In [428]:
for col in df.select_dtypes([np.object]).columns[1:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[428]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN

I got this error while working in Eclipse. It turned out that the project interpreter was somehow (after an update I believe) reset to Python 2.7. Setting it back to Python 3.6 resolved this issue. It all resulted in several crashes, restarts and warnings. After several minutes of troubles it seems fixed now.
While I know this is not a solution to the problem posed here, I thought it might be useful for others, as I came to this page after searching for this error.

In this case we have to use the str.replace() method on that series, but first we have to convert it to str type:
df1.Patient = 's125','s45',s588','s244','s125','s123'
df1 = pd.read_csv("C:\\Users\\Gangwar\\Desktop\\competitions\\cancer prediction\\kaggle_to_students.csv")
df1.Patient = df1.Patient.astype(str)
df1['Patient'] = df1['Patient'].str.replace('s','').astype(int)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Extract Number with decimals from String - python

You can try this: df['rate_number'] = df['rate_number'].replace('\(|[a-zA-Z]+', '', regex=True) Better answer: df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.][0-9])') Output: rate_number rate_number_2 0 92 92 1 33 33 2 9.25 9.25 3 4,396 4,396 4 2,620 2,620

Dan's comment above is not very noticeable but worked for me: for df in df_arr: df = df.astype(str) df_copy = df.copy() for i in range(1, len(df.columns)): df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','') new_df_arr.append(df_copy)

There is a small error with the asterisk's position: df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.][0-9])')

Related

fillna() not allowing floating values

Python: Erase all the strings except the information between > < in a dataframe

Why is python pandas dataframe rounding my values?

align timeseries in pandas

Pandas error "Can only use .str accessor with string values"

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Extract Number with decimals from String - python

You can try this: df['rate_number'] = df['rate_number'].replace('\(|[a-zA-Z]+', '', regex=True) Better answer: df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.]*[0-9]*)') Output: rate_number rate_number_2 0 92 92 1 33 33 2 9.25 9.25 3 4,396 4,396 4 2,620 2,620

Dan's comment above is not very noticeable but worked for me: for df in df_arr: df = df.astype(str) df_copy = df.copy() for i in range(1, len(df.columns)): df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','') new_df_arr.append(df_copy)

There is a small error with the asterisk's position: df['rate_number_2'] = df['rate_number'].str.extract('([0-9]*[,.][0-9]*)')

Related

fillna() not allowing floating values

Python: Erase all the strings except the information between > < in a dataframe

Why is python pandas dataframe rounding my values?

align timeseries in pandas

Pandas error "Can only use .str accessor with string values"

Categories

Resources

You can try this: df['rate_number'] = df['rate_number'].replace('\(|[a-zA-Z]+', '', regex=True) Better answer: df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.][0-9])') Output: rate_number rate_number_2 0 92 92 1 33 33 2 9.25 9.25 3 4,396 4,396 4 2,620 2,620

There is a small error with the asterisk's position: df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.][0-9])')