Why is python pandas dataframe rounding my values? - python

I do not understand why pandas dataframe is rounding the values in my column where I divide the values of two other columns. I want the numbers in the new colums with two decimals, but the values are rounded. I have checked the dtypes of the columns and both are "float64".
import pandas as pd
import numpy as np
# CURRENT DIRECTORY
cd = os.path.dirname(os.getcwd())
# concatenate csv files
dfList = []
for root, dirs, files in os.walk(cd):
for fname in files:
if re.match("output_contigs_SCMgenes.csv", fname):
frame = pd.read_csv(os.path.join(root, fname))
dfList.append(frame)
df = pd.concat(dfList)
#replace nan in SCM column with 0
df['SCM'].fillna(0, inplace=True)
#add column with genes/SCM
df['genes/SCM'] = df['genes']/df['SCM']
The output is as follows:
genome contig genes SCM genes/SCM
0 20900 48 1 0 inf
1 20900 37 130 103 1
2 20900 35 1 1 1
3 20900 1 79 66 1
4 20900 66 5 3 2
But I want that my last column does not contain rounded values, but values with at least 2 decimals.

I could reproduce this behaviour by setting the pd.options.display.precision to 0:
In [4]: df['genes/SCM'] = df['genes']/df['SCM']
In [5]: df
Out[5]:
genome contig genes SCM genes/SCM
0 20900 48 1 0 inf
1 20900 37 130 103 1.262136
2 20900 35 1 1 1.000000
3 20900 1 79 66 1.196970
4 20900 66 5 3 1.666667
In [6]: pd.options.display.precision = 0
In [7]: df
Out[7]:
genome contig genes SCM genes/SCM
0 20900 48 1 0 inf
1 20900 37 130 103 1
2 20900 35 1 1 1
3 20900 1 79 66 1
4 20900 66 5 3 2
Check your Pandas & Numpy options

For rounding off with desired number of digits after decimal e.g. 2 digits after decimal as asked in the question
df.round({'genes/SCM': 2})
for multiple columns
df.round({'col1_name': 1, 'col2_name': 2})
Also, check precision is not set to 0, pd.set_option('precision', 5) can be used to set the precision appropriately. Here 5 is number of desired digits needed after decimal as an example.

Can't be sure because I can't reproduce but you can try:
from __future__ import division
at the very top of your script.

Try using round() function
df['genes/SCM'] = df['genes']/df['SCM'].round(2)

I had faced similar issue, if you're reading data from csv then use the option float_precision='round_trip' as
pd.read_csv(resultant_file, sep='\t',float_precision='round_trip')
It will hold of your precision, if you don't use this option it will limit the precision for speed.
-see #MarkDickinson comment.
and
if it's related to displaying data frame in jupyter notebook, then set the precision as display.precisionfollowing
pd.set_option("precision", 20)

Related

Pandas Extract Number with decimals from String

I am trying to extract all numbers including decimals, dots and commas form a string using pandas.
This is my DataFrame
rate_number
0 92 rate
0 33 rate
0 9.25 rate
0 (4,396 total
0 (2,620 total
I tried using df['rate_number'].str.extract('(\d+)', expand=False) but the results were not correct.
The DataFrame I need to extract should be the following:
rate_number
0 92
0 33
0 9.25
0 4,396
0 2,620
You can try this:
df['rate_number'] = df['rate_number'].replace('\(|[a-zA-Z]+', '', regex=True)
Better answer:
df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.]*[0-9]*)')
Output:
rate_number rate_number_2
0 92 92
1 33 33
2 9.25 9.25
3 4,396 4,396
4 2,620 2,620
Dan's comment above is not very noticeable but worked for me:
for df in df_arr:
df = df.astype(str)
df_copy = df.copy()
for i in range(1, len(df.columns)):
df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
new_df_arr.append(df_copy)
There is a small error with the asterisk's position:
df['rate_number_2'] = df['rate_number'].str.extract('([0-9]*[,.][0-9]*)')

Group pandas dataframe by quantile of single column

Sorry if this is duplicate post - I can't find a related post though
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
What I'd like is to group P by the quartiles/quantiles/deciles/etc of column A and then calculate a aggregate statistic (such as mean) by group. I can define deciles of the column as
P['A'].quantile(np.arange(10) / 10)
I'm not sure how to groupby the deciles of A. Thanks in advance!
If you want to group P e.g. by quartiles, run:
gr = P.groupby(pd.qcut(P.A, 4, labels=False))
Then you can perform any operations on these groups.
For presentation, below you have just a printout of P limited to
20 rows:
for key, grp in gr:
print(f'\nGroup: {key}\n{grp}')
which gives:
Group: 0
A B
0 8 24
3 10 94
10 9 93
15 4 91
17 7 49
Group: 1
A B
7 34 24
8 15 60
12 27 4
13 31 1
14 13 83
Group: 2
A B
4 52 98
5 53 66
9 58 16
16 59 67
18 47 65
Group: 3
A B
1 67 87
2 79 48
6 98 14
11 86 2
19 61 14
As you can see, each group (quartile) has 5 members, so the grouping is
correct.
As a supplement
If you are interested in borders of each quartile, run:
pd.qcut(P.A, 4, labels=False, retbins=True)[1]
Then cut returns 2 results (a tuple). The first element (number 0) is
the result returned before, but we are this time interested in the
second element (number 1) - the bin borders.
For your data they are:
array([ 4. , 12.25, 40.5 , 59.5 , 98. ])
So e.g. the first quartile is between 4 and 12.35.
You can use the quantile Series to make another column, to marking each row with its quantile label, and then group by that column. numpy searchsorted is very useful to do this:
import numpy as np
import pandas as pd
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
q = P['A'].quantile(np.arange(10) / 10)
P['G'] = P['A'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])
Since the quantile Series stores the lower values of the quantile intervals, be sure to pass the parameter side='right' to np.searchsorted to not get 0 (the minimum should be 1 or you have one index more than you need).
Now you can elaborate your statistics by doing, for example:
P.groupby('G').agg(['sum', 'mean']) #add to the list all the statistics method you wish

Shifted results in pandas rolling mean

In the (5 first rows) result below, you can see Freq column and the rolling means (3) column MMeans calculated using pandas:
Freq MMeans
0 215 NaN
1 453 NaN
2 277 315.000000
3 38 256.000000
4 1 105.333333
I was expecting MMeans to start at index 1 since 1 is the mean of (0-1-2). Is there an option that I am missing with rolling method?
edit 1
print(pd.DataFrame({
'Freq':eff,
'MMeans': dF['Freq'].rolling(3).mean()}))
edit 2
Sorry #Yuca for not being as clear as I'd like to. Next is the columns I'd like pandas to return :
Freq MMeans
0 215 NaN
1 453 315.000000
2 277 256.000000
3 38 105.333333
4 1 29.666667
which are not the results returned with min_periods=2
use min_periods =1
df['rol_mean'] = df['Freq'].rolling(3,min_periods=1).mean()
output:
Freq MMeans rol_mean
0 215 NaN 215.000000
1 453 NaN 334.000000
2 277 315.000000 315.000000
3 38 256.000000 256.000000
4 1 105.333333 105.333333

align timeseries in pandas

I have 2 time series.
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df.index=pd.to_datetime(df['Time'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df1.index=pd.to_datetime(df1['Time'])
I am trying to align the time series so the index is in order. I am guessing reindex_like is what I need but not sure how to use it.
Here is my desired output
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Here is what you need:
df.append(df1).sort().reset_index(drop=True)
If you need to compile more pieces together, it is more efficient to use pd.concat(<names of all your dataframes as a list>).
P.S. You code is a bit redundant: you don't need to cast Time into index if you don't need it there. You can sort values based on any column, like this:
import pandas as pd
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df.append(df1).sort_values('Time')
You can use concat, sort_index and reset_index:
df = pd.concat([df,df1]).sort_index().reset_index(drop=True)
print df
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Or you can use ordered_merge:
print pd.ordered_merge(df, df1)
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

Pandas error "Can only use .str accessor with string values"

I have the following input file:
"Name",97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,
And I am reading it in with:
#!/usr/bin/env python
import pandas as pd
import sys
import numpy as np
filename = sys.argv[1]
df = pd.read_csv(filename,header=None)
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
print df
However, I get the error
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2241, in __getattr__
return object.__getattribute__(self, name)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 188, in __get__
return self.construct_accessor(instance)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 528, in _make_str_accessor
raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
This worked OK in pandas 0.14 but does not work in pandas 0.17.0.
It's happening because your last column is empty so this becomes converted to NaN:
In [417]:
t="""'Name',97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[417]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 'Name' 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15 16
0 0N# NaN
If you slice your range up to the last row then it works:
In [421]:
for col in df.columns[2:-1]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[421]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
Alternatively you can just select the cols that are object dtype and run the code (skipping the first col as this is the 'Name' entry):
In [428]:
for col in df.select_dtypes([np.object]).columns[1:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[428]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
I got this error while working in Eclipse. It turned out that the project interpreter was somehow (after an update I believe) reset to Python 2.7. Setting it back to Python 3.6 resolved this issue. It all resulted in several crashes, restarts and warnings. After several minutes of troubles it seems fixed now.
While I know this is not a solution to the problem posed here, I thought it might be useful for others, as I came to this page after searching for this error.
In this case we have to use the str.replace() method on that series, but first we have to convert it to str type:
df1.Patient = 's125','s45',s588','s244','s125','s123'
df1 = pd.read_csv("C:\\Users\\Gangwar\\Desktop\\competitions\\cancer prediction\\kaggle_to_students.csv")
df1.Patient = df1.Patient.astype(str)
df1['Patient'] = df1['Patient'].str.replace('s','').astype(int)

Categories