I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.
It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.
Related
For a current project, I am planning to run a scikit-learn Stochastic Graduent Booster algorithm over a CSV set that includes numerical data.
When calling line sgbr.fit(X_train, y_train) of the script, I am however receiving a ValueError: could not convert string to float: with no further details given on the respective area that cannot be formatted.
I assume that this error is not related to the Python code itself but rather the CSV input. I have however already checked the CSV file to confirm all sections exclusively include floats:
Does anyone have an idea why the ValueError is appearing without further positional indication?
I thing there are not direct function to get positional indication.
you can try this to convert
print (df)
column
0 01
1 02
2 03
3 04
4 05
5 LS
print (pd.to_numeric(df.column.str, errors='coerce'))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
Name: column, dtype: float64
Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.
Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.
Not duplicate because I'm asking about pandas round().
I have a dataframe with some columns with numbers. I run
df = df.round(decimals=6)
That successfully truncated the long decimals instead of 15.36785699998 correctly writing: 15.367857, but I still get 1.0 or 16754.0 with a trailing zero.
How do I get rid of the trailing zeros in all the columns, once I ran pandas df.round() ?
I want to save the dataframe as a csv, and need the data to show the way I wish.
df = df.round(decimals=6).astype(object)
Converting to object will allow mixed representations. But, keep in mind that this is not very useful from a performance standpoint.
df
A B
0 0.149724 -0.770352
1 0.606370 -1.194557
2 10.000000 10.000000
3 10.000000 10.000000
4 0.843729 -1.571638
5 -0.427478 -2.028506
6 -0.583209 1.114279
7 -0.437896 0.929367
8 -1.025460 1.156107
9 0.535074 1.085753
df.round(6).astype(object)
A B
0 0.149724 -0.770352
1 0.60637 -1.19456
2 10 10
3 10 10
4 0.843729 -1.57164
5 -0.427478 -2.02851
6 -0.583209 1.11428
7 -0.437896 0.929367
8 -1.02546 1.15611
9 0.535074 1.08575
I am trying to solve one of the coursera's homework for beginners.
I have read the data and tried to convert it as it shown in the code piece below. I am looking for the frequency distribution of the considered variables and for this reason I am trying to round the values. I tried several methods but nothing give me what I am expecting (see below please)..
import pandas as pd
import numpy as np
# loading the database file
data = pd.read_csv('gapminder-2.csv',low_memory=False)
# number of observations (rows)
print len(data)
# number of variables (columns)
print len(data.columns)
sub1 = pd.DataFrame({'income':data['incomeperperson'].convert_objects(convert_numeric=True),
'alcohol':data['alcconsumption'].convert_objects(convert_numeric=True),
'suicide':data['suicideper100th'].convert_objects(convert_numeric=True)})
sub1.apply(pd.Series.round)
income = sub1['income'].value_counts(sort=False)
print income
However, I got
285.224449 1
2712.517199 1
21943.339898 1
1036.830725 1
557.947513 1
What I expect:
285 1
2712 1
21943 1
1036 1
557 1
You can implement Series.round()
ser = pd.Series([1.1,2.1,3.1,5.1])
print(ser)
0 1.1
1 2.1
2 3.1
3 5.1
dtype: float64
From here you can use .round(), the default is set to 0 per docs.
print(ser.round())
0 1
1 2
2 3
3 5
dtype: float64
To save changes you need to re-assign it to ser=ser.round().
I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...