Not duplicate because I'm asking about pandas round().
I have a dataframe with some columns with numbers. I run
df = df.round(decimals=6)
That successfully truncated the long decimals instead of 15.36785699998 correctly writing: 15.367857, but I still get 1.0 or 16754.0 with a trailing zero.
How do I get rid of the trailing zeros in all the columns, once I ran pandas df.round() ?
I want to save the dataframe as a csv, and need the data to show the way I wish.
df = df.round(decimals=6).astype(object)
Converting to object will allow mixed representations. But, keep in mind that this is not very useful from a performance standpoint.
df
A B
0 0.149724 -0.770352
1 0.606370 -1.194557
2 10.000000 10.000000
3 10.000000 10.000000
4 0.843729 -1.571638
5 -0.427478 -2.028506
6 -0.583209 1.114279
7 -0.437896 0.929367
8 -1.025460 1.156107
9 0.535074 1.085753
df.round(6).astype(object)
A B
0 0.149724 -0.770352
1 0.60637 -1.19456
2 10 10
3 10 10
4 0.843729 -1.57164
5 -0.427478 -2.02851
6 -0.583209 1.11428
7 -0.437896 0.929367
8 -1.02546 1.15611
9 0.535074 1.08575
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to transform dataset type. But i cant do it beacuse of there are two dot in my dataset. Im using pd.apply(pd.to_numeric) code. The error code I get is as follows;
ValueError: Unable to parse string "1.232.2" at position 1
my dataset is like this;
Price Value
1.232.2 1.235.3
2.345.2 1.234.2
3.343.5 5.433.3
I must do removing first dot. Example for;
Price Value
1232.2 1235.3
2345.2 1234.2
3343.5 5433.3
I waiting for help. Thank you.
Here's a way to do this.
Convert string to float format (multiple dots to single dot)
You can just do a regex to solve for this.
regex expression: '\.(?=.*\.)'
Explanation:
'\. --> lookup for literal .
(?=.*\.)' --> Exclude all but last .
For each found, replace with ''
The code for this is:
df['Price'] = df['Price'].str.replace('\.(?=.*\.)', '',regex=True)
df['Value'] = df['Value'].str.replace('\.(?=.*\.)', '',regex=True)
If you also want to convert it to numeric, you can directly give:
df['Price'] = pd.to_numeric(df['Price'].str.replace('\.(?=.*\.)', '',regex=True))
df['Value'] = pd.to_numeric(df['Value'].str.replace('\.(?=.*\.)', '',regex=True))
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 1232.2 1235.3
1 2345.2 1234.2
2 3343.5 5433.3
3 123.45 45625.5
4 0.825 00.0
5 000.2 55.5
6 1234 4567
7 NaN NaN
The pd.numeric() version of the solution will look like this:
After Cleanins DataFrame:
Note: it converts all values to 3 decimal places as one of them has 3 decimal places.
Price Value
0 1232.200 1235.3
1 2345.200 1234.2
2 3343.500 5433.3
3 123.450 45625.5
4 0.825 0.0
5 0.200 55.5
6 1234.000 4567.0
7 NaN NaN
Discard data if more than one period (.) in data
If you want to process all the columns in the dataframe, you can use applymap() and if you want to process for a specific column use apply. Also use pd.isnull() to check if data is NaN so you can ignore processing that data.
The below code addresses for NaN, numbers without decimal places, numbers with one period, numbers with multiple periods. The code assumes the data in the columns are either NaNs or strings with digits and periods. It assumes there are no alphabet or non digit characters (except dots). If you need the code to check for digits only, let me know.
The code also assumes that you want to discard the leading numbers. If you do want to concatenate the numbers, then a different solution needs to be implemented (for ex: 1.2345.67 will be replaced to 2345.67 and 1 will be discarded. example #2: 1.2.3.4.5 will be replaced with 4.5 while discarding 1.2.3. If this is NOT what you want, we need to change the code.
You can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price': ['1.232.2', '2.345.2', '3.343.5', '123.45', '0.825','0.0.0.2', '1234',np.NaN],
'Value': ['1.235.3', '1.234.2', '5.433.3', '456.25.5','0.0.0','5.5.5', '4567',np.NaN]})
print (df)
def remove_dots(x):
return x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:])
df = df.applymap(remove_dots)
print (df)
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
If you want to change specific columns only, then you can use apply.
df['Price'] = df['Price'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
df['Value'] = df['Value'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
print(df)
Before and after will be the same:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
need to fill the NA values with the past three values mean of that NA
this is my dataset
RECEIPT_MONTH_YEAR NET_SALES
0 2014-01-01 818817.20
1 2014-02-01 362377.20
2 2014-03-01 374644.60
3 2014-04-01 NA
4 2014-05-01 NA
5 2014-06-01 NA
6 2014-07-01 NA
7 2014-08-01 46382.50
8 2014-09-01 55933.70
9 2014-10-01 292303.40
10 2014-10-01 382928.60
is this dataset a .csv file or a dataframe. This NA is a 'NaN' or a string ?
import pandas as pd
import numpy as np
df=pd.read_csv('your dataset',sep=' ')
df.replace('NA',np.nan)
df.fillna(method='ffill',inplace=True)
you mention something about mean of 3 values..the above simply forward fills the last observation before the NaNs begin. This is often a good way for forecasting (better than taking means in certain cases, if persistence is important)
ind = df['NET_SALES'].index[df['NET_SALES'].apply(np.isnan)]
Meanof3 = df.iloc[ind[0]-3:ind[0]].mean(axis=1,skipna=True)
df.replace('NA',Meanof3)
Maybe the answer can be generalised and improved if more info about the dataset is known - like if you always want to take the mean of last 3 measurements before any NA. The above will allow you to check the indices that are NaNs and then take mean of 3 before, while ignoring any NaNs
This is simple but it is working
df_data.fillna(0,inplace=True)
for i in range(0,len(df_data)):
if df_data['NET_SALES'][i]== 0.00:
condtn = df_data['NET_SALES'][i-1]+df_data['NET_SALES'][i-2]+df_data['NET_SALES'][i-3]
df_data['NET_SALES'][i]=condtn/3
You could use fillna (assuming that your NA is already np.nan) and rolling mean:
import pandas as pd
import numpy as np
df = pd.DataFrame([818817.2,362377.2,374644.6,np.nan,np.nan,np.nan,np.nan,46382.5,55933.7,292303.4,382928.6], columns=["NET_SALES"])
df["NET_SALES"] = df["NET_SALES"].fillna(df["NET_SALES"].shift(1).rolling(3, min_periods=1).mean())
Out:
NET_SALES
0 818817.2
1 362377.2
2 374644.6
3 518613.0
4 368510.9
5 374644.6
6 NaN
7 46382.5
8 55933.7
9 292303.4
10 382928.6
If you want to include the imputed values I guess you'll need to use a loop.
I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.
It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.
I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...
I have an enormous timeseries of functions stored in a pandas dataframe in an HDF5 store and I want to make plots of a certain transform of every function in the timeseries. Since the number of plots is so large, and plotting them takes so long, I've used fork() and numpy.array_split() to break the indices up and run several plots in parallel.
Doing things this way means that every process has a copy of the whole timeseries. Since what limits how many processes I can run is the total amount of memory I use, I would like to be able to have each process store only it's own chunk of the dataframe.
How can I split up a pandas dataframe?
np.array_split works pretty well for this usecase.
[40]: df = DataFrame(np.random.randn(5,10))
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7 8 9
0 -1.998163 -1.973708 0.461369 -0.575661 0.862534 -1.326168 1.164199 -1.004121 1.236323 -0.339586
1 -0.591188 -0.162782 0.043923 0.101241 0.120330 -1.201497 -0.108959 -0.033221 0.145400 -0.324831
2 0.114842 0.200597 2.792904 0.769636 -0.698700 -0.544161 0.838117 -0.013527 -0.623317 -1.461193
3 1.309628 -0.444961 0.323008 -1.409978 -0.697961 0.132321 -2.851494 1.233421 -1.540319 1.107052
4 0.436368 0.627954 -0.942830 0.448113 -0.030464 0.764961 -0.241905 -0.620992 1.238171 -0.127617
Just pretty-printing as you get a list of 3 elements here.
In [43]: for dfs in np.array_split(df,3,axis=1):
....: print dfs, "\n"
....:
0 1 2 3
0 -1.998163 -1.973708 0.461369 -0.575661
1 -0.591188 -0.162782 0.043923 0.101241
2 0.114842 0.200597 2.792904 0.769636
3 1.309628 -0.444961 0.323008 -1.409978
4 0.436368 0.627954 -0.942830 0.448113
4 5 6
0 0.862534 -1.326168 1.164199
1 0.120330 -1.201497 -0.108959
2 -0.698700 -0.544161 0.838117
3 -0.697961 0.132321 -2.851494
4 -0.030464 0.764961 -0.241905
7 8 9
0 -1.004121 1.236323 -0.339586
1 -0.033221 0.145400 -0.324831
2 -0.013527 -0.623317 -1.461193
3 1.233421 -1.540319 1.107052