Importing txt as dataframe in python - python

I have a txt file with the following format:
[(u'this guy',u'hey there',u'dfd fasd awe wedsad,daeraes',1),
(u'that guy',u'cya',u'dfd fasd es',1),
(u'another guy',u'hi',u'dfawe wedsad,daeraes',-1)]
and I would like to import it in python as a dataframe with 4 columns. I have tried:
trial = []
for line in open('filename.txt','r'):
trial.append(line.rstrip())
which give each line as a text. Using:
import pandas as pd
pd.read_csv('filename.txt', sep=",", header = None)
Using read_csv from pandas and separating in comma it was also taking into consideration the comma inside the text of the variables.
0 1 2 3 4 5
0 [(u'this guy' u'hey there' u'dfd fasd awe wedsad daeraes' 1) NaN
1 (u'that guy' u'cya' u'dfd fasd es' 1) NaN NaN
2 (u'another guy' u'hi' u'dfawe wedsad daeraes' -1)] NaN
Any idea how to overpass that?

Assuming you have the data in data.txt.
py_array = eval(open("data.txt").read())
dataframe = pd.DataFrame(py_array)
Python needs to parse the file first.
It doesn't make sense using read_csv since it isn't close enough to csv format.

I'm assuming you mean python, not matlab.
The data is already a matrix.
aa=[(u'this guy',u'hey there',u'dfd fasd awe wedsad,daeraes',1),
(u'that guy',u'cya',u'dfd fasd es',1),
(u'another guy',u'hi',u'dfawe wedsad,daeraes',-1)]
for i in range(3):
for j in range(4):
print aa[i][j]
output:
this guy
hey there
dfd fasd awe wedsad,daeraes
1
that guy
cya
dfd fasd es
1
another guy
hi
dfawe wedsad,daeraes
-1

Related

how to circumvent a filter a txt document with pandas where a row has string, int and floats

I have a format for a txt file like this:
501NA NA 1 9.517 6.338 0.776
502NA NA 2 2.683 7.229 0.642
503NA NA 3 6.856 9.313 0.543
504NA NA 4 9.412 3.246 0.808
505NA NA 5 1.994 2.141 0.620
506NA NA 6 3.571 9.574 0.575
I've got pandas to read the txt file, which I am most happy about. But when I try to filter it based on a condition it says it can't. I want pandas to spit out the data in the exact looking format it came in...basically output it as a txt string.
here's my code:
import pandas as pd
data=pd.read_csv("blockbig2.gro",sep= "\s+", header= None, keep_default_na=False)
data.columns = ['id', 'NA','index','x' ,'y','z']
print(data)
equation_x = ((data.x)-5)**2
equation_y = ((data.y)-5)**2
eq = equation_x + equation_y
data[eq<=24].to_txt('step1.txt',float_format = "%.3f", index = False ,header = False )
the print command gives me the right format, which I like. But what part am I missing?

Import CSV file where last column has many separators [duplicate]

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

Pandas - Extract a string starting with a particular character

It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.
You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')

Python Parse a Text file and convert it to a dataframe

I need help parsing a specific string from this text file and then converting it to a dataframe.
I am trying to parse this portion of the text file:
Graph Stats for Max-Clique:
|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446
After parsing the text file, I need to make it into a dataframe where the columns are |V|,|E|, |T|, T_avg, T_max, cc_avg, and cc_global. Please advice! Thanks :)
You can read directly to a Pandas dataframe via pd.read_csv. Just remember to use an appropriate sep parameter. You can set your index column as the first and transpose:
import pandas as pd
from io import StringIO
x = StringIO("""|V|: 566834
|E|: 659570
d_max: 8
d_avg: 2
p: 4.10563e-06
|T|: 31315
T_avg: 0
T_max: 5
cc_avg: 0.0179651
cc_global: 0.0281446""")
# replace x with 'file.txt'
df = pd.read_csv(x, sep=': ', header=None, index_col=[0]).T
Result
print(df)
0 |V| |E| d_max d_avg p |T| T_avg T_max \
1 566834.0 659570.0 8.0 2.0 0.000004 31315.0 0.0 5.0
0 cc_avg cc_global
1 0.017965 0.028145

Replacing punctuation in a data frame based on punctuation list [duplicate]

This question already has answers here:
Fast punctuation removal with pandas
(4 answers)
Closed 4 years ago.
Using Canopy and Pandas, I have data frame a which is defined by:
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"]
test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.
Assuming df looks like:
test
%hgh&12
abc123!!!
porkyfries
I want my results to be:
test
hgh12
abc123
porkyfries
Effort so far:
from string import punctuation /-- import punctuation list from python itself
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"] /-- define the dataframe
for p in list(punctuation):
...: df2=df.med.str.replace(p,'')
...: df2=pd.DataFrame(df2);
...: df2
The command above basically just returns me with the same data set.
Appreciate any leads.
Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows.
Long story short, I need to clean data in a very efficient manner for big data sets.
Use replace with correct regex would be easier:
In [41]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
text
0 test
1 %hgh&12
2 abc123!!!
3 porkyfries
[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
In [49]:
df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
text
0 test
1 hgh12
2 abc123
3 porkyfries
[4 rows x 1 columns]
For removing punctuation from a text column in your dataframme:
In:
import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)
pattern
Out:
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]'
In:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df
Out:
text
0 book...regh
1 book...
2 boo,
3 book.
4 ball,
5 ballnroll"
6 "rope"
7 rick %
In:
df['text'] = df['text'].str.replace(pattern, '')
df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
Out:
text
0 bookregh
1 book
2 boo
3 book
4 ball
5 ballnroll
6 rope
7 rick
Translate is often considered the cleanest and fastest way to remove punctuation (source)
import string
text = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.

Categories