Can I output large numeric to csv as a string - python

I have a txt file that has columns several columns and some with large numbers and when I read it in through python and output it to a csv the numbers change and I lose important info. Example of txt file:
Identifier
12450006300638672
12450006300638689
12450006300638693
Example csv output:
Identifier Changed_format_in_csv
1.245E+16 12450006300638600
1.245E+16 12450006300638600
1.245E+16 12450006300638600
Is there a way I can get the file to output tho a csv without it changing the large numbers. I have a lot of other columns that are a mix between string and numeric data type, but I was just thinking if I could output everything as a string it would be fine.
This is what I've tried:
import pandas as pd
file1 = 'file.txt'
df = pd.read_csv(file1, sep="|", names=['Identifier'], index_col=False, dtype=str)
df.to_csv('file_new.csv', index=False)
I want the csv file to output like the txt file looks. Was hoping setting dtype=str would help, but it doesn't. Any help would be appreciated.

Short story:
I think this problem is related to the data type pandas is interpreting the content of 'file.txt'.
You could try:
df = df.assign(Identifier=lambda x: x['Identifier'].astype(int))
Long story:
I created file.txt with this content:
12450006300638672
12450006300638689
12450006300638693
Using pandas v0.23.3, I couldn't reproduce your problem with your displayed code, as shown here:
>>> import pandas as pd
>>> df = pd.read_csv('file.txt', sep="|", names=['Identifier'], index_col=False, dtype=str)
>>> df.to_csv('file_new.csv', index=False)
>>> print(df)
Identifier
0 12450006300638672
1 12450006300638689
2 12450006300638693
>>> exit()
$ cat file_new.csv
Identifier
12450006300638672
12450006300638689
12450006300638693
But I could reproduce your problem using pd.read_csv(..., dtype=float) instead:
>>> import pandas as pd
>>> df = pd.read_csv('file.txt', sep="|", names=['Identifier'], index_col=False, dtype=float)
>>> df.to_csv('file_new.csv', index=False)
>>> print(df)
Identifier
0 1.245001e+16
1 1.245001e+16
2 1.245001e+16
>>> exit()
$ cat file_new.csv
Identifier
1.2450006300638672e+16
1.2450006300638688e+16
1.2450006300638692e+16
It seems to be your case, where integer numbers are interpreted as float numbers.
If for some reason you can't interpret them as integers, you could do as follows:
>>> import pandas as pd
>>> df = pd.read_csv('file.txt', sep="|", names=['Identifier'], index_col=False, dtype=float)
>>> print(df)
Identifier
0 1.245001e+16
1 1.245001e+16
2 1.245001e+16
>>> df = df.assign(Identifier=lambda x: x['Identifier'].astype(int))
>>> print(df)
Identifier
0 12450006300638672
1 12450006300638688
2 12450006300638692
>>> df.to_csv('file_new.csv', index=False)
>>> exit()
$ cat file_new.csv
Identifier
12450006300638672
12450006300638688
12450006300638692

It's not pandas that's changing the large numbers, it's the app you're using to view the CSV. To hint to CSV apps that those numbers should be treated as strings, make sure that they're quoted in the output:
import csv
df.to_csv('file_new.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)
It should look like this:
"Identifier"
"12450006300638672"
"12450006300638689"
"12450006300638693"

Related

Splitting a column into 2 in a csv file using python

I have a .csv file with 100 rows of data displayed like this
"Jim 1234"
"Sam 1235"
"Mary 1236"
"John 1237"
What I'm trying to achieve is splitting the numbers from the names into 2 columns in python
edit*
Using,
import pandas as pd
df = pd.read_csv('test.csv', sep='\s+')
df.to_csv('result.csv', index=False)
I managed to get it to display like this in excel
However, the numbers still do not show up in column B as I expected.
Your data have only one column and a tab delimiter:
pd.read_csv('test.csv', quoting=1, header=None, squeeze=True) \
.str.split('\t', expand=True) \
.to_csv('result.csv', index=False, header=False)
very simple way,
data=pd.DataFrame(['Jim1234','Sam4546'])
data[0].str.split('(\d+)', expand=True)
if your file resemble to the picture below then the next code will work csv file content
import pandas as pd
df = pd.read_csv('a.csv', header=None, delimiter='\s')
df
code execution

Handle variable as file with pandas dataframe

I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )
import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]
It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.

Pandas Dataframe.to_csv decimal=',' doesn't work

In Python, I'm writing my Pandas Dataframe to a csv file and want to change the decimal delimiter to a comma (,). Like this:
results.to_csv('D:/Data/Kaeashi/BigData/ProcessMining/Voorbeelden/Voorbeeld/CaseEventsCel.csv', sep=';', decimal=',')
But the decimal delimiter in the csv file still is a .
Why? What do I do wrong?
If the decimal parameter doesn't work, maybe it's because the type of the column is object. (check the dtype value in the last line when you do df[column_name])
That can happen if some rows have values that couldn't be parsed as numbers.
You can force the column to change type:
Change data type of columns in Pandas.
But that can make you lose non numerical data in that column.
This functionality wasn't added until 0.16.0
Added decimal option in to_csv to provide formatting for non-‘.’ decimal separators (GH781)
Upgrade pandas to something more recent and it will work. The code below uses the 10 minute tutorial and pandas version 0.18.1
>>> import pandas as pd
>>> import numpy as np
>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 -0.157833 1.719554 0.564592 -0.228870
2013-01-02 -0.316600 1.545763 -0.206499 0.793412
2013-01-03 1.905803 1.172803 0.744010 1.563306
2013-01-04 -0.142676 -0.362548 -0.554799 -0.086404
2013-01-05 1.708246 -0.505940 -1.135422 0.810446
2013-01-06 -0.150899 0.794215 -0.628903 0.598574
>>> df.to_csv("test.csv", sep=';', decimal=',')
This creates a "test.csv" file that looks like this:
;A;B;C;D
2013-01-01;-0,157833276159;1,71955439009;0,564592278787;-0,228870244247
2013-01-02;-0,316599953358;1,54576303958;-0,206499307398;0,793411528039
2013-01-03;1,90580284184;1,17280324924;0,744010110291;1,56330623177
2013-01-04;-0,142676406494;-0,36254842687;-0,554799190671;-0,0864039782679
2013-01-05;1,70824597265;-0,50594004498;-1,13542154086;0,810446051841
2013-01-06;-0,150899136973;0,794214730009;-0,628902891897;0,598573645748
In the case when data is an object, and not a plain float type, for example python decimal.Decimal(10.12). First, change a type and then write to CSV file:
import pandas as pd
from decimal import Decimal
data_frame = pd.DataFrame(data={'col1': [1.1, 2.2], 'col2': [Decimal(3.3), Decimal(4.4)]})
data_frame.to_csv('report_decimal_dot.csv', sep=';', decimal=',', float_format='%.2f')
data_frame = data_frame.applymap(lambda x: float(x) if isinstance(x, Decimal) else x)
data_frame.to_csv('report_decimal_comma.csv', sep=';', decimal=',', float_format='%.2f')
Somehow i don't get this to work either. I always just end up using the following script to rectify it. It's dirty but it works for my ends:
for col in df.columns:
try:
df[col] = df[col].apply(lambda x: float(x.replace('.','').replace(',','.')))
except:
pass
EDIT: misread the question, you might use the same tactic the other way around by changing all your floats to strings :). Then again, you should probably just figure out why it's not working. Due post it if you get it to work.
This example suppose to work (as it works for me):
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(10))
with open('Data/out.csv', 'w') as f:
s.to_csv(f, index=True, header=True, decimal=',', sep=';', float_format='%.3f')
out.csv:
;0 0;0,091 1;-0,009 2;-1,427 3;0,022 4;-1,270
5;-1,134 6;-0,965 7;-1,298 8;-0,854 9;0,150
I don't see exactly why your code doesn't work, but anyway, try to use the above example to your needs.

how to avoid removing 0 from msb in python panda dataframe

i have data in column like 0123456789 after reading from a file it will get like 123456789 where column name is msisdn
how to fix this issue
am using the pandas script as follows
#!/usr/bin/env python
import gc
import pandas
csv1 = pandas.read_csv('/home/subin/Desktop/a.txt')
csv2 = pandas.read_csv('/home/subin/Desktop/b.txt')
merged = pandas.merge(csv1, csv2,left_on=['MSISDN'],right_on=['MSISDN'],how='left',suffixes=('#x', '#y'), sort=True).fillna('0')
merged.to_csv("/home/subin/Desktop/amergeb_out.txt", index=False, float_format='%.0f')
You can cast column msisdn to string by parameter dtype in read_csv:
temp=u"""msisdn
0123456789
0123456789"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), dtype={'msisdn': str})
print (df)
msisdn
0 0123456789
1 0123456789
csv1 = pandas.read_csv('/home/subin/Desktop/a.txt',dtype=str)
csv2 = pandas.read_csv('/home/subin/Desktop/b.txt',dtype={'MSISDN': str})
merged = pandas.merge(csv1, csv2,left_on=['MSISDN'],right_on=['MSISDN'],how='left',suffixes=('#x', '#y'), sort=True).fillna('0')
merged.to_csv("/home/subin/Desktop/amergeb_out.txt", index=False, float_format='%.0f')

pandas create data frame, floats are objects, how to convert?

I have a text file:
sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442
i import it:
with open('file.txt', 'r') as fo:
notes = next(fo)
headers,*raw_data = [row.strip('\r\n').split('\t') for row in fo] # get column headers and data
names = [row[0] for row in raw_data] # extract first row (variables)
data= np.array([row[1:] for row in raw_data],dtype=float) # get rid of first row
if i then convert it:
s = pd.DataFrame(data,index=names,columns=headers[1:])
the data is recognized as floats. I could get the sample names back as column by s=s.reset_index().
if i do
s = pd.DataFrame(raw_data,columns=headers)
the floats are objects and i cannot perform standard calculations.
How would you make the data frame ? Is it better to import the data as dict ?
BTW i am using python 3.3
You can parse your data file directly into data frame as follows:
df = pd.read_csv('file.txt', sep='\t', index_col='sample')
Which will give you:
value1 value2
sample
A 0.12120 0.2354
B 0.23493 1.3442
[2 rows x 2 columns]
Then, you can do your computations.
To parse such a file, one should use pandas read_csv function.
Below is a minimal example showing the use of read_csv with parameter delim_whitespace set to True
import pandas as pd
from StringIO import StringIO # Python2 or
from io import StringIO # Python3
data = \
"""sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442"""
# Creation of the dataframe
df = pd.read_csv(StringIO(data), delim_whitespace=True)

Categories