Pandas, read in file without a separator between columns - python

I want to read in a file that looks like this:
1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09
So, each column is denoted by a 15 character sequence, but there's no official separator. Does pandas have a way of doing this?

Yes! its called pd.read_fwf
from io import StringIO
import pandas as pd
txt = """ 1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09"""
pd.read_fwf(StringIO(txt), widths=[15] * 5, header=None)
0 1 2 3 4
0 1.499981e-01 2.499968e-01 3.999948e-01 5.999922e-01 9.999871e-01
1 1.499981e+00 2.499968e+00 5.999922e+00 9.999871e+00 1.999974e+01
2 4.999935e+01 9.999871e+01 0.000000e+00 -2.706363e+03 -6.370275e+03
3 -1.975213e+04 -4.649283e+04 -1.094354e+05 -3.393231e+05 -7.987023e+05
4 -1.879993e+06 -5.829214e+06 -1.372079e+07 -2.263858e+07 -4.254295e+07
5 -7.601675e+07 -1.254220e+08 -2.356903e+08 -3.888620e+08 -7.307020e+08
6 -1.305466e+09 -2.153480e+09 -4.044550e+09 -4.548962e+09 -5.325339e+09

Let's look at using pd.read_fwf:
df = pd.read_fwf(csv_file,widths=[15]*5,header=None)

Let's do like that:
for example: housing.data
dataset = pd.read_csv('c:/1/housing.data', engine = 'python', sep='\s+', header=None)

Related

Modify column data before a specific character using Regex in pandas

I'm trying to modify the Address column data by removing all the characters before the comma.
Sample data:
**ADDRESS**
0 Ksfc Layout,Bangalore
1 Vishweshwara Nagar,Mysore
2 Jigani,Bangalore
3 Sector-1 Vaishali,Ghaziabad
4 New Town,Kolkata
Expected Output:
**ADDRESS**
0 Bangalore
1 Mysore
2 Bangalore
3 Ghaziabad
4 Kolkata
I tried this code but it's not working can someone correct the code?
import pandas as pd
import regex as re
data = pd.read_csv("train.csv")
data.ADDRESS.replace(re.sub(r'.*,',"", data.ADDRESS), regex=True, inplace=True)
Try this:
data.ADDRESS = data.ADDRESS.str.split(',').str[-1]
You can do it without a regex:
def removeFirst(x):
return x.split(",")[-1]
df['ADDRESS'] = df['ADDRESS'].apply(removeFirst)
You can try like this without Regex:
data['ADDRESS'] = data['ADDRESS'].str.split(',').str[-1]
Use Series.str.replace:
data['ADDRESS'] = data['ADDRESS'].str.replace(r'.*,', '')
See proof

make boxplot with columns from 2 dataframes [python seaborn]

I want to make a boxplot like this:
My data looks like this. It's separated into a control and an intervention dataframe.
control_df
# of Days
0 10
1 12
2 30
intervention_df
# of Days
0 2
1 1
2 2
Unfortunately, I'm not able to easily put it into sns.boxplot. Any advice on how to format it to graph appreciated.
MVE below:
import pandas as pd
# This is how my actual data is coming in
data_control = {'# of Days':[10,12,30]}
data_intervention = {'# of Days':[2,1,2]}
control_df = pd.DataFrame(data_control)
intervention_df = pd.DataFrame(data_intervention)
# This is me manually making it better for a boxplot
boxplot_data = {'Type':['control','control','control','intervention','intervention','intervention'],
'# of Days':[10,12,30,2,1,2]}
import seaborn as sns
sns.boxplot(x='Type',y='# of Days',data=boxplot_data)
IIUC, you can use pd.concat with keys parameter and droplevel:
sns.boxplot(data=pd.concat([control_df, intervention_df],
axis=1,
keys=['Control', 'Intervention']).droplevel(1, axis=1))
Output:

Python Parse a Text file and convert it to a dataframe

I need help parsing a specific string from this text file and then converting it to a dataframe.
I am trying to parse this portion of the text file:
Graph Stats for Max-Clique:
|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446
After parsing the text file, I need to make it into a dataframe where the columns are |V|,|E|, |T|, T_avg, T_max, cc_avg, and cc_global. Please advice! Thanks :)
You can read directly to a Pandas dataframe via pd.read_csv. Just remember to use an appropriate sep parameter. You can set your index column as the first and transpose:
import pandas as pd
from io import StringIO
x = StringIO("""|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446""")
# replace x with 'file.txt'
df = pd.read_csv(x, sep=': ', header=None, index_col=[0]).T
Result
print(df)
0 |V| |E| d_max d_avg p |T| T_avg T_max \
1 566834.0 659570.0 8.0 2.0 0.000004 31315.0 0.0 5.0
0 cc_avg cc_global
1 0.017965 0.028145

Problems importing text file into Python/Pandas

I am trying to load in a really messy text file into Python/Pandas. Here is an example of what the data in the file looks like
('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:24','viewed_home_page'),('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:36','viewed_search_results'),('41aa8fac-1bd8-4f95-918c-413879ed43f1','bcca257d-68d3-47e6-bc58-52c166f3b27b','Madison, WI','2014-08-16 17:42:31','visit_start')
Here is my code
import pandas as pd
cols=['ID','Visit','Market','Event Time','Event Name']
table=pd.read_table('C:\Users\Desktop\Dump.txt',sep=',', header=None,names=cols,nrows=10)
But when I look at the table, it still does not read correctly.
All of the data is mainly on one row.
You could use ast.literal_eval to parse the data into a Python tuple of tuples, and then you could call pd.DataFrame on that:
import pandas as pd
import ast
cols=['ID','Visit','Market','Event Time','Event Name']
with open(filename, 'rb') as f:
data = ast.literal_eval(f.read())
df = pd.DataFrame(list(data), columns=cols)
print(df)
yields
ID Visit \
0 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
1 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
2 41aa8fac-1bd8-4f95-918c-413879ed43f1 bcca257d-68d3-47e6-bc58-52c166f3b27b
Market Event Time Event Name
0 Seattle, WA 2014-08-05 10:06:24 viewed_home_page
1 Seattle, WA 2014-08-05 10:06:36 viewed_search_results
2 Madison, WI 2014-08-16 17:42:31 visit_start

pandas to_csv: suppress scientific notation in csv file when writing pandas to csv

I am writing a pandas df to a csv. When I write it to a csv file, some of the elements in one of the columns are being incorrectly converted to scientific notation/numbers. For example, col_1 has strings such as '104D59' in it. The strings are mostly represented as strings in the csv file, as they should be. However, occasional strings, such as '104E59', are being converted into scientific notation (e.g., 1.04 E 61) and represented as integers in the ensuing csv file.
I am trying to export the csv file into a software package (i.e., pandas -> csv -> software_new) and this change in data type is causing problems with that export.
Is there a way to write the df to a csv, ensuring that all elements in df['problem_col'] are represented as string in the resulting csv or not converted to scientific notation?
Here is the code I have used to write the pandas df to a csv:
df.to_csv('df.csv', encoding='utf-8')
I also check the dtype of the problem column:
for df.dtype, df['problem_column'] is an object
For python 3.xx (Python 3.7.2)&
In [2]: pd.__version__ Out[2]: '0.23.4':
Options and Settings
For visualization of the dataframe pandas.set_option
import pandas as pd #import pandas package
# for visualisation fo the float data once we read the float data:
pd.set_option('display.html.table_schema', True) # to can see the dataframe/table as a html
pd.set_option('display.precision', 5) # setting up the precision point so can see the data how looks, here is 5
df = pd.DataFrame(np.random.randn(20,4)* 10 ** -12) # create random dataframe
Output of the data:
df.dtypes # check datatype for columns
[output]:
0 float64
1 float64
2 float64
3 float64
dtype: object
Dataframe:
df # output of the dataframe
[output]:
0 1 2 3
0 -2.01082e-12 1.25911e-12 1.05556e-12 -5.68623e-13
1 -6.87126e-13 1.91950e-12 5.25925e-13 3.72696e-13
2 -1.48068e-12 6.34885e-14 -1.72694e-12 1.72906e-12
3 -5.78192e-14 2.08755e-13 6.80525e-13 1.49018e-12
4 -9.52408e-13 1.61118e-13 2.09459e-13 2.10940e-13
5 -2.30242e-13 -1.41352e-13 2.32575e-12 -5.08936e-13
6 1.16233e-12 6.17744e-13 1.63237e-12 1.59142e-12
7 1.76679e-13 -1.65943e-12 2.18727e-12 -8.45242e-13
8 7.66469e-13 1.29017e-13 -1.61229e-13 -3.00188e-13
9 9.61518e-13 9.71320e-13 8.36845e-14 -6.46556e-13
10 -6.28390e-13 -1.17645e-12 -3.59564e-13 8.68497e-13
11 3.12497e-13 2.00065e-13 -1.10691e-12 -2.94455e-12
12 -1.08365e-14 5.36770e-13 1.60003e-12 9.19737e-13
13 -1.85586e-13 1.27034e-12 -1.04802e-12 -3.08296e-12
14 1.67438e-12 7.40403e-14 3.28035e-13 5.64615e-14
15 -5.31804e-13 -6.68421e-13 2.68096e-13 8.37085e-13
16 -6.25984e-13 1.81094e-13 -2.68336e-13 1.15757e-12
17 7.38247e-13 -1.76528e-12 -4.72171e-13 -3.04658e-13
18 -1.06099e-12 -1.31789e-12 -2.93676e-13 -2.40465e-13
19 1.38537e-12 9.18101e-13 5.96147e-13 -2.41401e-12
And now write to_csv using the float_format='%.15f' parameter
df.to_csv('estc.csv',sep=',', float_format='%.15f') # write with precision .15
file output:
,0,1,2,3
0,-0.000000000002011,0.000000000001259,0.000000000001056,-0.000000000000569
1,-0.000000000000687,0.000000000001919,0.000000000000526,0.000000000000373
2,-0.000000000001481,0.000000000000063,-0.000000000001727,0.000000000001729
3,-0.000000000000058,0.000000000000209,0.000000000000681,0.000000000001490
4,-0.000000000000952,0.000000000000161,0.000000000000209,0.000000000000211
5,-0.000000000000230,-0.000000000000141,0.000000000002326,-0.000000000000509
6,0.000000000001162,0.000000000000618,0.000000000001632,0.000000000001591
7,0.000000000000177,-0.000000000001659,0.000000000002187,-0.000000000000845
8,0.000000000000766,0.000000000000129,-0.000000000000161,-0.000000000000300
9,0.000000000000962,0.000000000000971,0.000000000000084,-0.000000000000647
10,-0.000000000000628,-0.000000000001176,-0.000000000000360,0.000000000000868
11,0.000000000000312,0.000000000000200,-0.000000000001107,-0.000000000002945
12,-0.000000000000011,0.000000000000537,0.000000000001600,0.000000000000920
13,-0.000000000000186,0.000000000001270,-0.000000000001048,-0.000000000003083
14,0.000000000001674,0.000000000000074,0.000000000000328,0.000000000000056
15,-0.000000000000532,-0.000000000000668,0.000000000000268,0.000000000000837
16,-0.000000000000626,0.000000000000181,-0.000000000000268,0.000000000001158
17,0.000000000000738,-0.000000000001765,-0.000000000000472,-0.000000000000305
18,-0.000000000001061,-0.000000000001318,-0.000000000000294,-0.000000000000240
19,0.000000000001385,0.000000000000918,0.000000000000596,-0.000000000002414
And now write to_csv using the float_format='%f' parameter
df.to_csv('estc.csv',sep=',', float_format='%f') # this will remove the extra zeros after the '.'
For more details check pandas.DataFrame.to_csv
Use the float_format argument:
In [11]: df = pd.DataFrame(np.random.randn(3, 3) * 10 ** 12)
In [12]: df
Out[12]:
0 1 2
0 1.757189e+12 -1.083016e+12 5.812695e+11
1 7.889034e+11 5.984651e+11 2.138096e+11
2 -8.291878e+11 1.034696e+12 8.640301e+08
In [13]: print(df.to_string(float_format='{:f}'.format))
0 1 2
0 1757188536437.788086 -1083016404775.687134 581269533538.170288
1 788903446803.216797 598465111695.240601 213809584103.112457
2 -829187757358.493286 1034695767987.889160 864030095.691202
Which works similarly for to_csv:
df.to_csv('df.csv', float_format='{:f}'.format, encoding='utf-8')
If you would like to use the values as formated string in a list, say as part of csvfile csv.writier, the numbers can be formated before creating a list:
with open('results_actout_file','w',newline='') as csvfile:
resultwriter = csv.writer(csvfile, delimiter=',')
resultwriter.writerow(header_row_list)
resultwriter.writerow(df['label'].apply(lambda x: '%.17f' % x).values.tolist())

Categories