how to avoid removing 0 from msb in python panda dataframe - python

i have data in column like 0123456789 after reading from a file it will get like 123456789 where column name is msisdn
how to fix this issue
am using the pandas script as follows
#!/usr/bin/env python
import gc
import pandas
csv1 = pandas.read_csv('/home/subin/Desktop/a.txt')
csv2 = pandas.read_csv('/home/subin/Desktop/b.txt')
merged = pandas.merge(csv1, csv2,left_on=['MSISDN'],right_on=['MSISDN'],how='left',suffixes=('#x', '#y'), sort=True).fillna('0')
merged.to_csv("/home/subin/Desktop/amergeb_out.txt", index=False, float_format='%.0f')

You can cast column msisdn to string by parameter dtype in read_csv:
temp=u"""msisdn
0123456789
0123456789"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), dtype={'msisdn': str})
print (df)
msisdn
0 0123456789
1 0123456789

csv1 = pandas.read_csv('/home/subin/Desktop/a.txt',dtype=str)
csv2 = pandas.read_csv('/home/subin/Desktop/b.txt',dtype={'MSISDN': str})
merged = pandas.merge(csv1, csv2,left_on=['MSISDN'],right_on=['MSISDN'],how='left',suffixes=('#x', '#y'), sort=True).fillna('0')
merged.to_csv("/home/subin/Desktop/amergeb_out.txt", index=False, float_format='%.0f')

Related

Extra column appears when appending selected row from one csv to another in Python

I have this code which appends a column of a csv file as a row to another csv file:
def append_pandas(s,d):
import pandas as pd
df = pd.read_csv(s, sep=';', header=None)
df_t = df.T
df_t.iloc[0:1, 0:1] = 'Time Point'
df_t.at[1, 0] = 1
df_t.columns = df_t.iloc[0]
df_new = df_t.drop(0)
pdb = pd.read_csv(d, sep=';')
newpd = pdb.append(df_new)
from pandas import DataFrame
newpd.to_csv(d, sep=';')
The result is supposed to look like this:
Instead, every time the row is appended, there is an extra "Unnamed" column appearing on the left:
Do you know how to fix that?..
Please, help :(
My csv documents from which I select a column look like this:
You have to add index=False to your to_csv() method

Splitting a column into 2 in a csv file using python

I have a .csv file with 100 rows of data displayed like this
"Jim 1234"
"Sam 1235"
"Mary 1236"
"John 1237"
What I'm trying to achieve is splitting the numbers from the names into 2 columns in python
edit*
Using,
import pandas as pd
df = pd.read_csv('test.csv', sep='\s+')
df.to_csv('result.csv', index=False)
I managed to get it to display like this in excel
However, the numbers still do not show up in column B as I expected.
Your data have only one column and a tab delimiter:
pd.read_csv('test.csv', quoting=1, header=None, squeeze=True) \
.str.split('\t', expand=True) \
.to_csv('result.csv', index=False, header=False)
very simple way,
data=pd.DataFrame(['Jim1234','Sam4546'])
data[0].str.split('(\d+)', expand=True)
if your file resemble to the picture below then the next code will work csv file content
import pandas as pd
df = pd.read_csv('a.csv', header=None, delimiter='\s')
df
code execution

Pandas cuts off empty columns from csv file

I have the csv file that have columns with no content just headers. And I want them to be included to resulting DataFrame but pandas cuts them off by default. Is there any way to solve this by using read_csv not read_excell?
IIUC, you need header=None:
from io import StringIO
import pandas as pd
data = """
not_header_1,not_header_2
"""
df = pd.read_csv(StringIO(data), sep=',')
print(df)
OUTPUT:
Empty DataFrame
Columns: [not_header_1, not_header_2]
Index: []
Now, with header=None
df = pd.read_csv(StringIO(data), sep=',', header=None)
print(df)
OUTPUT:
0 1
0 not_header_1 not_header_2

Can I output large numeric to csv as a string

I have a txt file that has columns several columns and some with large numbers and when I read it in through python and output it to a csv the numbers change and I lose important info. Example of txt file:
Identifier
12450006300638672
12450006300638689
12450006300638693
Example csv output:
Identifier Changed_format_in_csv
1.245E+16 12450006300638600
1.245E+16 12450006300638600
1.245E+16 12450006300638600
Is there a way I can get the file to output tho a csv without it changing the large numbers. I have a lot of other columns that are a mix between string and numeric data type, but I was just thinking if I could output everything as a string it would be fine.
This is what I've tried:
import pandas as pd
file1 = 'file.txt'
df = pd.read_csv(file1, sep="|", names=['Identifier'], index_col=False, dtype=str)
df.to_csv('file_new.csv', index=False)
I want the csv file to output like the txt file looks. Was hoping setting dtype=str would help, but it doesn't. Any help would be appreciated.
Short story:
I think this problem is related to the data type pandas is interpreting the content of 'file.txt'.
You could try:
df = df.assign(Identifier=lambda x: x['Identifier'].astype(int))
Long story:
I created file.txt with this content:
12450006300638672
12450006300638689
12450006300638693
Using pandas v0.23.3, I couldn't reproduce your problem with your displayed code, as shown here:
>>> import pandas as pd
>>> df = pd.read_csv('file.txt', sep="|", names=['Identifier'], index_col=False, dtype=str)
>>> df.to_csv('file_new.csv', index=False)
>>> print(df)
Identifier
0 12450006300638672
1 12450006300638689
2 12450006300638693
>>> exit()
$ cat file_new.csv
Identifier
12450006300638672
12450006300638689
12450006300638693
But I could reproduce your problem using pd.read_csv(..., dtype=float) instead:
>>> import pandas as pd
>>> df = pd.read_csv('file.txt', sep="|", names=['Identifier'], index_col=False, dtype=float)
>>> df.to_csv('file_new.csv', index=False)
>>> print(df)
Identifier
0 1.245001e+16
1 1.245001e+16
2 1.245001e+16
>>> exit()
$ cat file_new.csv
Identifier
1.2450006300638672e+16
1.2450006300638688e+16
1.2450006300638692e+16
It seems to be your case, where integer numbers are interpreted as float numbers.
If for some reason you can't interpret them as integers, you could do as follows:
>>> import pandas as pd
>>> df = pd.read_csv('file.txt', sep="|", names=['Identifier'], index_col=False, dtype=float)
>>> print(df)
Identifier
0 1.245001e+16
1 1.245001e+16
2 1.245001e+16
>>> df = df.assign(Identifier=lambda x: x['Identifier'].astype(int))
>>> print(df)
Identifier
0 12450006300638672
1 12450006300638688
2 12450006300638692
>>> df.to_csv('file_new.csv', index=False)
>>> exit()
$ cat file_new.csv
Identifier
12450006300638672
12450006300638688
12450006300638692
It's not pandas that's changing the large numbers, it's the app you're using to view the CSV. To hint to CSV apps that those numbers should be treated as strings, make sure that they're quoted in the output:
import csv
df.to_csv('file_new.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)
It should look like this:
"Identifier"
"12450006300638672"
"12450006300638689"
"12450006300638693"

Having trouble removing headers when using pd.read_csv

I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius
Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254
I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600
I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

Categories