Python Parse a Text file and convert it to a dataframe - python

I need help parsing a specific string from this text file and then converting it to a dataframe.
I am trying to parse this portion of the text file:
Graph Stats for Max-Clique:
|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446
After parsing the text file, I need to make it into a dataframe where the columns are |V|,|E|, |T|, T_avg, T_max, cc_avg, and cc_global. Please advice! Thanks :)

You can read directly to a Pandas dataframe via pd.read_csv. Just remember to use an appropriate sep parameter. You can set your index column as the first and transpose:
import pandas as pd
from io import StringIO
x = StringIO("""|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446""")
# replace x with 'file.txt'
df = pd.read_csv(x, sep=': ', header=None, index_col=[0]).T
Result
print(df)
0 |V| |E| d_max d_avg p |T| T_avg T_max \
1 566834.0 659570.0 8.0 2.0 0.000004 31315.0 0.0 5.0
0 cc_avg cc_global
1 0.017965 0.028145

Related

how to circumvent a filter a txt document with pandas where a row has string, int and floats

I have a format for a txt file like this:
501NA NA 1 9.517 6.338 0.776
502NA NA 2 2.683 7.229 0.642
503NA NA 3 6.856 9.313 0.543
504NA NA 4 9.412 3.246 0.808
505NA NA 5 1.994 2.141 0.620
506NA NA 6 3.571 9.574 0.575
I've got pandas to read the txt file, which I am most happy about. But when I try to filter it based on a condition it says it can't. I want pandas to spit out the data in the exact looking format it came in...basically output it as a txt string.
here's my code:
import pandas as pd
data=pd.read_csv("blockbig2.gro",sep= "\s+", header= None, keep_default_na=False)
data.columns = ['id', 'NA','index','x' ,'y','z']
print(data)
equation_x = ((data.x)-5)**2
equation_y = ((data.y)-5)**2
eq = equation_x + equation_y
data[eq<=24].to_txt('step1.txt',float_format = "%.3f", index = False ,header = False )
the print command gives me the right format, which I like. But what part am I missing?

Pandas - Extract a string starting with a particular character

It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.
You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')

Pandas, read in file without a separator between columns

I want to read in a file that looks like this:
1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09
So, each column is denoted by a 15 character sequence, but there's no official separator. Does pandas have a way of doing this?
Yes! its called pd.read_fwf
from io import StringIO
import pandas as pd
txt = """ 1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09"""
pd.read_fwf(StringIO(txt), widths=[15] * 5, header=None)
0 1 2 3 4
0 1.499981e-01 2.499968e-01 3.999948e-01 5.999922e-01 9.999871e-01
1 1.499981e+00 2.499968e+00 5.999922e+00 9.999871e+00 1.999974e+01
2 4.999935e+01 9.999871e+01 0.000000e+00 -2.706363e+03 -6.370275e+03
3 -1.975213e+04 -4.649283e+04 -1.094354e+05 -3.393231e+05 -7.987023e+05
4 -1.879993e+06 -5.829214e+06 -1.372079e+07 -2.263858e+07 -4.254295e+07
5 -7.601675e+07 -1.254220e+08 -2.356903e+08 -3.888620e+08 -7.307020e+08
6 -1.305466e+09 -2.153480e+09 -4.044550e+09 -4.548962e+09 -5.325339e+09
Let's look at using pd.read_fwf:
df = pd.read_fwf(csv_file,widths=[15]*5,header=None)
Let's do like that:
for example: housing.data
dataset = pd.read_csv('c:/1/housing.data', engine = 'python', sep='\s+', header=None)

Importing txt as dataframe in python

I have a txt file with the following format:
[(u'this guy',u'hey there',u'dfd fasd awe wedsad,daeraes',1),
(u'that guy',u'cya',u'dfd fasd es',1),
(u'another guy',u'hi',u'dfawe wedsad,daeraes',-1)]
and I would like to import it in python as a dataframe with 4 columns. I have tried:
trial = []
for line in open('filename.txt','r'):
trial.append(line.rstrip())
which give each line as a text. Using:
import pandas as pd
pd.read_csv('filename.txt', sep=",", header = None)
Using read_csv from pandas and separating in comma it was also taking into consideration the comma inside the text of the variables.
0 1 2 3 4 5
0 [(u'this guy' u'hey there' u'dfd fasd awe wedsad daeraes' 1) NaN
1 (u'that guy' u'cya' u'dfd fasd es' 1) NaN NaN
2 (u'another guy' u'hi' u'dfawe wedsad daeraes' -1)] NaN
Any idea how to overpass that?
Assuming you have the data in data.txt.
py_array = eval(open("data.txt").read())
dataframe = pd.DataFrame(py_array)
Python needs to parse the file first.
It doesn't make sense using read_csv since it isn't close enough to csv format.
I'm assuming you mean python, not matlab.
The data is already a matrix.
aa=[(u'this guy',u'hey there',u'dfd fasd awe wedsad,daeraes',1),
(u'that guy',u'cya',u'dfd fasd es',1),
(u'another guy',u'hi',u'dfawe wedsad,daeraes',-1)]
for i in range(3):
for j in range(4):
print aa[i][j]
output:
this guy
hey there
dfd fasd awe wedsad,daeraes
1
that guy
cya
dfd fasd es
1
another guy
hi
dfawe wedsad,daeraes
-1

pandas to_csv: suppress scientific notation in csv file when writing pandas to csv

I am writing a pandas df to a csv. When I write it to a csv file, some of the elements in one of the columns are being incorrectly converted to scientific notation/numbers. For example, col_1 has strings such as '104D59' in it. The strings are mostly represented as strings in the csv file, as they should be. However, occasional strings, such as '104E59', are being converted into scientific notation (e.g., 1.04 E 61) and represented as integers in the ensuing csv file.
I am trying to export the csv file into a software package (i.e., pandas -> csv -> software_new) and this change in data type is causing problems with that export.
Is there a way to write the df to a csv, ensuring that all elements in df['problem_col'] are represented as string in the resulting csv or not converted to scientific notation?
Here is the code I have used to write the pandas df to a csv:
df.to_csv('df.csv', encoding='utf-8')
I also check the dtype of the problem column:
for df.dtype, df['problem_column'] is an object
For python 3.xx (Python 3.7.2)&
In [2]: pd.__version__ Out[2]: '0.23.4':
Options and Settings
For visualization of the dataframe pandas.set_option
import pandas as pd #import pandas package
# for visualisation fo the float data once we read the float data:
pd.set_option('display.html.table_schema', True) # to can see the dataframe/table as a html
pd.set_option('display.precision', 5) # setting up the precision point so can see the data how looks, here is 5
df = pd.DataFrame(np.random.randn(20,4)* 10 ** -12) # create random dataframe
Output of the data:
df.dtypes # check datatype for columns
[output]:
0 float64
1 float64
2 float64
3 float64
dtype: object
Dataframe:
df # output of the dataframe
[output]:
0 1 2 3
0 -2.01082e-12 1.25911e-12 1.05556e-12 -5.68623e-13
1 -6.87126e-13 1.91950e-12 5.25925e-13 3.72696e-13
2 -1.48068e-12 6.34885e-14 -1.72694e-12 1.72906e-12
3 -5.78192e-14 2.08755e-13 6.80525e-13 1.49018e-12
4 -9.52408e-13 1.61118e-13 2.09459e-13 2.10940e-13
5 -2.30242e-13 -1.41352e-13 2.32575e-12 -5.08936e-13
6 1.16233e-12 6.17744e-13 1.63237e-12 1.59142e-12
7 1.76679e-13 -1.65943e-12 2.18727e-12 -8.45242e-13
8 7.66469e-13 1.29017e-13 -1.61229e-13 -3.00188e-13
9 9.61518e-13 9.71320e-13 8.36845e-14 -6.46556e-13
10 -6.28390e-13 -1.17645e-12 -3.59564e-13 8.68497e-13
11 3.12497e-13 2.00065e-13 -1.10691e-12 -2.94455e-12
12 -1.08365e-14 5.36770e-13 1.60003e-12 9.19737e-13
13 -1.85586e-13 1.27034e-12 -1.04802e-12 -3.08296e-12
14 1.67438e-12 7.40403e-14 3.28035e-13 5.64615e-14
15 -5.31804e-13 -6.68421e-13 2.68096e-13 8.37085e-13
16 -6.25984e-13 1.81094e-13 -2.68336e-13 1.15757e-12
17 7.38247e-13 -1.76528e-12 -4.72171e-13 -3.04658e-13
18 -1.06099e-12 -1.31789e-12 -2.93676e-13 -2.40465e-13
19 1.38537e-12 9.18101e-13 5.96147e-13 -2.41401e-12
And now write to_csv using the float_format='%.15f' parameter
df.to_csv('estc.csv',sep=',', float_format='%.15f') # write with precision .15
file output:
,0,1,2,3
0,-0.000000000002011,0.000000000001259,0.000000000001056,-0.000000000000569
1,-0.000000000000687,0.000000000001919,0.000000000000526,0.000000000000373
2,-0.000000000001481,0.000000000000063,-0.000000000001727,0.000000000001729
3,-0.000000000000058,0.000000000000209,0.000000000000681,0.000000000001490
4,-0.000000000000952,0.000000000000161,0.000000000000209,0.000000000000211
5,-0.000000000000230,-0.000000000000141,0.000000000002326,-0.000000000000509
6,0.000000000001162,0.000000000000618,0.000000000001632,0.000000000001591
7,0.000000000000177,-0.000000000001659,0.000000000002187,-0.000000000000845
8,0.000000000000766,0.000000000000129,-0.000000000000161,-0.000000000000300
9,0.000000000000962,0.000000000000971,0.000000000000084,-0.000000000000647
10,-0.000000000000628,-0.000000000001176,-0.000000000000360,0.000000000000868
11,0.000000000000312,0.000000000000200,-0.000000000001107,-0.000000000002945
12,-0.000000000000011,0.000000000000537,0.000000000001600,0.000000000000920
13,-0.000000000000186,0.000000000001270,-0.000000000001048,-0.000000000003083
14,0.000000000001674,0.000000000000074,0.000000000000328,0.000000000000056
15,-0.000000000000532,-0.000000000000668,0.000000000000268,0.000000000000837
16,-0.000000000000626,0.000000000000181,-0.000000000000268,0.000000000001158
17,0.000000000000738,-0.000000000001765,-0.000000000000472,-0.000000000000305
18,-0.000000000001061,-0.000000000001318,-0.000000000000294,-0.000000000000240
19,0.000000000001385,0.000000000000918,0.000000000000596,-0.000000000002414
And now write to_csv using the float_format='%f' parameter
df.to_csv('estc.csv',sep=',', float_format='%f') # this will remove the extra zeros after the '.'
For more details check pandas.DataFrame.to_csv
Use the float_format argument:
In [11]: df = pd.DataFrame(np.random.randn(3, 3) * 10 ** 12)
In [12]: df
Out[12]:
0 1 2
0 1.757189e+12 -1.083016e+12 5.812695e+11
1 7.889034e+11 5.984651e+11 2.138096e+11
2 -8.291878e+11 1.034696e+12 8.640301e+08
In [13]: print(df.to_string(float_format='{:f}'.format))
0 1 2
0 1757188536437.788086 -1083016404775.687134 581269533538.170288
1 788903446803.216797 598465111695.240601 213809584103.112457
2 -829187757358.493286 1034695767987.889160 864030095.691202
Which works similarly for to_csv:
df.to_csv('df.csv', float_format='{:f}'.format, encoding='utf-8')
If you would like to use the values as formated string in a list, say as part of csvfile csv.writier, the numbers can be formated before creating a list:
with open('results_actout_file','w',newline='') as csvfile:
resultwriter = csv.writer(csvfile, delimiter=',')
resultwriter.writerow(header_row_list)
resultwriter.writerow(df['label'].apply(lambda x: '%.17f' % x).values.tolist())

Categories