Pandas appending .0 to a number - python

I'm having an issues with pandas that I'm a little baffled on. I have a file with a lot of numeric values that do not need calculations. Most of them are coming out just fine, but I have a couple that are getting ".0" appended to the end.
Here is a sample input file:
Id1 Id2 Age Id3
"SN19602","1013743", "24", "23523"
"SN20077","2567897", "28", "24687"
And the output being generated:
Id1 Id2 Age Id3
"SN19602","1013743.0", "24", "23523"
"SN20077","2567897.0", "28", "24687"
Can anyone explain why some but not all of the numeric values are getting the .0 appended, and if there is any way I can prevent it? It is a problem when I perform the next step of my process with the CSV output.
I have tried to convert the data frame and the column itself to a string but it did not make an impact. Ideally I do not want to list each column to convert because a have a very large number of columns and would manually have to go through the output file to figure out which ones are getting the .0 appended and code for it. Any suggestions appreciated.
import pandas as pd
import csv
df_inputFile = pd.read_csv("InputFile.csv")
df_mappingFile = pd.read_csv("MappingFile.csv")
df_merged = df_inputFile.merge(df_mappingFile, left_on="Id", right_on="Id", how="left")
#This isn't affecting the output
df_merged.astype(str)
df_merged.to_csv("Output.csv", index=False, quoting=csv.QUOTE_ALL)

pandas.DataFrame.to_csv has a parameter float_format, which takes a regular float formatting string. This should work:
df_merged.to_csv("Output.csv", index=False, quoting=csv.QUOTE_ALL, float_format='%.0f')

Pandas may be considering the datatype of that column as float that is the reason you are getting .0 appended to the data. You can use
dtype=object in pd.read_csv .
df_inputFile = pd.read_csv("InputFile.csv", dtype=object) .
This will make pandas read and consider all columns as string.

I like loops. They are slow, but easy to understand.
This is elegant for the logic, but also it allows different formatting/decimals for each column.
Something like:
final_out = open("Output.txt", 'w')
for index, row in df.iterrows():
print ( '{:.0f}'.format(row['A']), '{:.0f}'.format(row['B']), '{:.0f}'.format(row['C']), , sep=",", file=final_out )
I think the best/faster way to do this is with something like tabulate or pretty printer.
First convert your dataframe to an array, this is easy.
array = df.values
Then you can use something neat like tabulate.
final_out = open("Output.txt", 'w')
from tabulate import tabulate as tb
print ( tb(array, numalign="right", floatfmt=".0f"), file=final_out )
you can read up a little more on tabulate or pretty printer. Above is a contextual example to get you started.
Similar to the loop above, tabulate allows a separator which could be a comma.
https://pypi.python.org/pypi/tabulate at Usage of the command line utility.
Pretty sure pretty printer can do this too and could be very well a better choice.
Both of these uses the new python printing. If you use python 2.7 you will need this nifty little statement as your first non-comment line in your script:
from __future__ import print_function

I have recently faced this issue. In my case, the column similar to the Id2 column in question had an empty cell that Pandas interpreted as nan. All the other cells of that column had trailing .0.
Reading the file with keep_default_na=False helps to avoid those trailing .0.
my_df = pd.read_csv("data.csv", keep_default_na=False)
P.S: I know this answer is instead a late one, but this worked for me without enforcing data types while reading the data or having to float format.

Related

python/pandas : Pandas changing the value adding extra digits in values [duplicate]

I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.

Any idea how to import this data set?

I have the following dataset:
https://github.com/felipe0216/fdfds/blob/master/sheffield_weather_station.csv
I know it is a csv, so I can use the pd.read_csv() function. However, if you look into the file it is not comma separated values nor a tab separated values, I am not sure what separation it has exactly. Any ideas about how to open it?
The proper way to do this is as follows:
df = pd.read_csv("sheffield_weather_station.csv", comment="#", delim_whitespace=True)
You should first download the .csv file. You have to tell pandas that there will be comments and that there will just be spaces separating the columns. The amount of spaces do not matter.

How to make read_csv more flexibile with numbers and whitespaces

I want to read a txt.file with Pandas and the Problem is the seperator/delimiter consits of a number and Minimum two blanks afterwards.
I already tried it similiar to this code (How to make separator in pandas read_csv more flexible wrt whitespace?):
pd.read_csv("whitespace.txt", header=None, delimiter=r"\s+")
This is only working if there is only a blank or more. So I adjustet it to the following code.
delimiter=r"\d\s\s+"
But this is seperating my dataframe when it sees two blanks or more, but i strictly Need the number before it followed by at least two blanks, anyone has an idea how to fix it?
My data Looks as follows:
I am an example of a dataframe
I have Problems to get read
100,00
So How can I read it
20,00
so the first row should be:
I am an example of a dataframe I have Problems to get read 100,00
followed by the second row:
So HOw can I read it 20,00
Id try it like this.
Id manipulate the text file before I attempt to parse it to a dataframe as follows:
import pandas as pd
import re
f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")
prepared_text = re.sub(r'(\d+,\d+)', r'\1#', g)
df = pd.DataFrame({'My columns':prepared_text.split('#')})
print(df)
This gives the following:
My columns
0 I am an example of a dataframe I have Problems...
1 So How can I read it 20,00
2
I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.
The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+#.
Then we use the inserted character as a delimiter.
There are some good examples here:
https://lzone.de/examples/Python%20re.sub

Pandas read csv file with float values results in weird rounding and decimal digits

I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.

pandas read_csv is putting all values in one column and one row

I've sought out an answer on multiple forums and YouTube but to no avail, sorry in advance if it is widely available and my keywords just weren't right.
I'm attempting to execute a simple pandas.read_csv('.csv',sep=','). However, the output I'm receiving is not splitting the data out into multiple columns as I imagine it should.
I'm getting back all of my headers in one row, separated by commas. The same is true for each line item tied to the respective headers.
I've tried setting this data up in a dataframe, manipulating the headers, manually adding the headers with no success.
For better understanding I've copied and pasted from Ipython notebook of what I'm seeing:
In [15]:
import pandas as pd
pd.read_csv('C:\Users\Dale\Desktop\ShpData\TrackerTW0.csv',sep=',')
Out[15]:
PurchaseOrderNumber,ShipmentFinalDestinationCity,TransferPointCity,POType,PlannedMode,ProgramType,FreightPaymentTerms,ContainerNumber,BL/AWB#,Mode,ShipmentFinalDestinationLocation,CarrierSCAC,Carrier,Forwarder,BrandDesc,POLCity,PODCity,InDCOutlookDate,InDCOriginalDate,AnticipatedShipDate,PlannedStockedDate,ExFactoryActualDate(LT),OriginConsolActualDate(LT),DepartLoadPortActualDate(LT),FullOutGatefromOceanTerminal(CYorPort)ActualDate(LT),DPArrivalActualDate(LT),FreightAvailableActualDate(LT),DestConsolActualDate(LT),DomDepartActualDate(LT),YardArrivalActualDate(LT),CarrierDropActualDate(LT),InDCActualDate(LT),StockedActualDate(LT),Vessel,VesselETADischargePortCity,DPArrivalOutlookDate,VesselETADischargePortActualDate(LT),FullOutGatefromOceanTerminal(CYorPort)OutlookDate,StockedOutlookDate,ShipmentLeg#,Metrics,TotalShippedQty
0 1251708,Rugby,Tuticorin,Initial Order,Ocean,Re...
1 1262597,Rugby,Hong Kong,Initial Order,Ocean,Re...
Thanks
You might want to try this, you have like 40 columns.
import pandas as pd
df = pd.read_csv('input.csv', names=['PurchaseOrderNumber','ShipmentFinalDestinationCity','TransferPointCity','POType','PlannedMode','ProgramType','FreightPaymentTerms','ContainerNumber','BL/AWB#','Mode','ShipmentFinalDestinationLocation','CarrierSCAC','Carrier','Forwarder','BrandDesc','POLCity','PODCity','InDCOutlookDate','InDCOriginalDate','AnticipatedShipDate','PlannedStockedDate','ExFactoryActualDate(LT)','OriginConsolActualDate(LT)','DepartLoadPortActualDate(LT)','FullOutGatefromOceanTerminal(CYorPort)ActualDate(LT)','DPArrivalActualDate(LT)','FreightAvailableActualDate(LT)','DestConsolActualDate(LT)','DomDepartActualDate(LT)','YardArrivalActualDate(LT)','CarrierDropActualDate(LT)','InDCActualDate(LT)','StockedActualDate(LT)','Vessel','VesselETADischargePortCity','DPArrivalOutlookDate','VesselETADischargePortActualDate(LT)','FullOutGatefromOceanTerminal(CYorPort)OutlookDate','StockedOutlookDate','ShipmentLeg#','Metrics','TotalShippedQty']
print df
Recently, I wanna process a csv file, the code is like this:
data = pd.read_csv(dir, sep=" ")
print(data)
the output also put all values in one row,
then I just use the default "sep" value, the problem has been solved.
data = pd.read_csv(dir, sep=",")
the situation seems like different from which the asker raised,
but I hope it's helpful for some other friends like me,
and this is my first comment, I hope it's not too bad!
It may not be the best option but it works!
Read the file as it is:
df = pd.read_csv('input.csv')
Get all the column names and assign them to a variable.
names= df.columns.str.split(',').tolist()
Split all the values by ','
df= df.iloc[:,0].str.split(',', expand=True)
Finally, assign 'names' to column names and that's it!
df.columns = names
I was also having the same issues. All of the columns were coming as one value. So the following worked for me.
df = pd.read_csv('/content/Reviews.csv',
sep=',',
error_bad_lines=False,
engine='python')

Categories