Why is re not working on my file? - python

I am using a regular expression to remove all the apostrophes in my textfile. I need to encode it in utf-8 for my other functions to work. So when I try this:
import re
import codecs
dataset=[]
with codecs.open(sys.argv[1], 'r', 'utf8') as fil:
for line in fil:
lines=[re.sub("'","",line) for line in fil]
print(lines)
dataset.append(lines.lower().strip().split())
Output:
[] #on printing lines
Traceback (most recent call last):
File "preproc.py", line 112, in <module>
dataset.append(lines.lower().strip().split())
AttributeError: 'list' object has no attribute 'lower'
Textfile contains a string like this: It's an amazing day she's said
It returns the same thing back to me on printing line.

So after a SO chat session, the question is really this. Given a list of lists of words, how do you replace the unicode apostrophe's and maintain the original data structure.
Given this data structure, strip out the \u2019 unicode characters
s = [[u'wasn\u2019t', u'right', u'part', u'say', u'things',
u'she\u2019s', u'hurt', u'terribly', u'she\u2019s',
u'speaking']]
Here's one working example of how to do this:
quotes_to_remove = [u"'", u"\u2019", u"\u2018"]
new_s = []
for line in s:
new_line = []
for word in line:
for quote in quotes_to_remove:
word = word.replace(quote, "")
new_line.append(word)
new_s.append(new_line)
print(new_s)
produces:
[[u'wasnt', u'right', u'part', u'say', u'things', u'shes',
u'hurt', u'terribly', u'shes', u'speaking']]
Also worth noting is that the asker is working in python 2.7.10 and the code provided in this answer is not tested on python 3.

I think it can work like this:
import re
import codecs
with codecs.open("textfile.txt", "r", "utf-8") as f:
for i, line in enumerate(f):
f[i] = re.sub("'","",line)
print(line)
You original method will not assign value to list f successfully.
I have make two easy experiment for you.
1.
list1 = [2,3,5,4,1,1,1,2,2,5,1]
for num in list1:
num = 1
print(list1)
output: [2, 3, 5, 4, 1, 1, 1, 2, 2, 5, 1]
2.
list1 = [2,3,5,4,1,1,1,2,2,5,1]
for i, num in enumerate(list1):
list1[i] = 1
print(list1)
output: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
So that is why your result is wrong. This is not regex question! Hope it helps. :)

Related

Removing line break and writing lists without square brackets and comas to a text file in python

I'm facing a few issues with regard to writing some arguments to a text file. Below are the outputs I need to see in my text file.
I want to write an output like this to the text file.
Input:
Hello
World
Output:
HelloWorld
2. I want to write an output like this into a text file.
Input:
[1, 2, 3, 4, 5]
Output:
1,2,3,4,5
I tried several ways to do this but couldn't find a proper way.
Hope to seek some help.
The Code:
progressList = [120, 0, 0] #A variable which wont change. (ie this variable only has this value))
resultList = ['Progress', 'Trailer'] #Each '' represents one user input
#loop for progress
with open("data.txt", "a") as f: # Used append as per my requirement
i = 0 #iterator
while i < len(resultList):
# f.write(resultList)
if resultList[i] == "Progress":
j = 0
f.write("Progress - ")
for j in range(3):
while j < 2:
f.write(', ', join(progressList[j]))
break
if j == 2:
f.write(progressList[j], end='')
break
Output (textfile):
Progress - 120, 0, 0
Thanks.
1st case would be something like this
>>> s = '''hello
... world'''
>>> ''.join(s.split())
'helloworld'
>>>
2nd one is funny
>>> s = "[1, 2, 3, 4, 5]"
>>> exec ('a = ' + s)
>>> ','.join([str(i) for i in a])
'1,2,3,4,5'
hope it helps

Read List in List [duplicate]

This question already has answers here:
How to convert string representation of list to a list
(19 answers)
Closed 5 months ago.
I have a text file and there is 3 lines on data in it.
[1, 2, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 3, 3]
[1, 1, 3, 3, 3, 1, 1, 1, 1, 2, 1, 1, 1, 3, 3]
[1, 2, 3, 1, 3, 1, 1, 3, 1, 3, 1, 1, 1, 3, 3]
I try to open and get data in it.
with open("rafine.txt") as f:
l = [line.strip() for line in f.readlines()]
f.close()
now i have list in list.
if i say print(l[0]) it shows me [1, 2, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 3, 3]
But i want to get numbers in it.
So when i write print(l[0][0])
i want to see 1 but it show me [
how can i fix this ?
You can use literal_eval to parse the lines from the file & build the matrix:
from ast import literal_eval
with open("test.txt") as f:
matrix = []
for line in f:
row = literal_eval(line)
matrix.append(row)
print(matrix[0][0])
print(matrix[1][4])
print(matrix[2][8])
result:
1
3
1
import json
with open("rafine.txt") as f:
for line in f.readlines():
line = json.loads(line)
print(line)
The best approach depends on what assumption you make about the data in your text file:
ast.literal_eval
If the data in your file is formatted the same way, it would be inside python source-code, the best approach is to use literal_eval:
from ast import literal_eval
data = [] # will contain list of lists
with open("filename") as f:
for line in f:
row = literal_eval(line)
data.append(row)
or, the short version:
with open(filename) as f:
data = [literal_eval(line) for line in f]
re.findall
If you can make few assumptions about the data, using regular expressions to find all digits might be a way forward. The below builds lists by simply extracting any digits in the text file, regardless of separators or other characters in the file:
import re
data = [] # will contain list of lists
with open("filename") as f:
for line in f:
row = [int(i) for i in re.findall(r'\d+', line)]
data.append(row)
or, in short:
with open(filename) as f:
data= [ [int(i) for i in re.findall(r'\d+', line)] for line in f ]
handwritten parsing
If both options are not suitable, there is always an option to parse by hand, to tailor for the exact format:
data = [] # will contain list of lists
with open(filename) as f:
for line in f:
row = [int(i) for i in line[1:-1].split(, )]
data.append(row)
The [1,-1] will remove the first and last character (the brackets), then split(", ") will split it into a list. for i in ... will iterate over the items in this list (assigning i to each item) and int(i) will convert i to an integer.

Python reading from file into multiple lists

I don't suppose someone could point me in the right direction?
I'm a bit wondering how best to pull values out of a text file then break them up and put them back into lists at the same place as their corresponding values.
I'm sorry If this isn't clear, maybe this will make it clearer. This is the code that outputs the file:
#while loop
with open('values', 'a') as o:
o.write("{}, {}, {}, {}, {}, {}, {}\n".format(FirstName[currentRow],Surname[currentRow], AnotherValue[currentRow], numberA, numberB))
currentRow+1
I would like to do the opposite and take the values, formatted as above and put them back into lists at the same place. Something like:
#while loop
with open('values', 'r') as o:
o.read("{}, {}, {}, {}, {}, {}, {}\n".format(FirstName[currentRow],Surname[currentRow], AnotherValue[currentRow], numberA, numberB))
currentRow +1
Thanks
I think the best corresponding way to do it is calling split on the text read in:
FirstName[currentRow],Surname[currentRow], AnotherValue[currentRow], numberA, numberB = o.read().strip().split(", ")
There is no real equivalent of formatted input, like scanf in C.
You should be able to do something like the following:
first_names = []
surnames = []
another_values = []
number_as = []
number_bs = []
for line in open('values', 'r'):
fn, sn, av, na, nb = line.strip().split(',')
first_names.append(fn)
surnames.append(sn)
another_values.append(av)
number_as.append(float(na))
number_bs.append(float(nb))
The for line in open() part iterates over each line in the file and the fn, sn, av, na, nb = line.strip().split(',') bit strips the newline \n off the end of each line and splits it on the commas.
In practice though I would probably use the CSV Module or something like Pandas which handle edge cases better. For example the above approach will break if a name or some other value has a comma in it!
with open('values.txt', 'r') as f:
first_names, last_names, another_values, a_values, b_values = \
(list(tt) for tt in zip(*[l.rstrip().split(',') for l in f]))
Unless you need update, list conversion list(tt) for tt in is also unnecessary.
May use izip from itertools instead of zip.
If you are allow to decide file format, saving and loading as json format may be useful.
import json
#test data
FirstName = range(5)
Surname = range(5,11)
AnotherValue = range(11,16)
numberAvec = range(16,21)
numberBvec = range(21,26)
#save
all = [FirstName,Surname,AnotherValue,numberAvec,numberBvec]
with open("values.txt","w") as fp:
json.dump(all,fp)
#load
with open("values.txt","r") as fp:
FirstName,Surname,AnotherValue,numberAvec,numberBvec = json.load(fp)
values.txt:
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]]

Converting strings read from a file into a array or list

I have python code that generates sets of numbers as arrays and stores them in a file, the file looks like follows
set([0, 2, 3])
set([0, 1, 3])
set([0, 1, 2])
I have another python code that reads this file and needs to convert the text line back to a array.
Method to read the file
def get_sets_from_file (self,file_name):
file_handle = open(file_name, "r")
all_sets_from_file = file_handle.read()
print all_sets_from_file
Once the text line is read, I need a mechasism to convert the textline back to a array.
Thanks,
Bhavesh.
EDIT-1:
Based on the suggestions given below, i have changed the file format to use comma-seperated file
set([8, 6, 7]),
set([8, 5, 7]),
set([8, 4, 7]),
set([8, 3, 7]),
you can apply this to each line in your file:
>>> line = "set([0, 2, 3])" #or "set([0, 2, 3]),"
>>> import re
>>> r = "set\(\[(.*)\]\)"
>>> m = re.search(r, line)
>>> match = m.group(1)
>>> a = [int(item.strip()) for item in match.split(',')]
>>> a
[0, 2, 3]
>>>
that could be implemented in your code as:
def get_sets_from_file (self,file_name):
total = []
with open(file_name, "r") as fhdl:
for line in fhdl:
a = do_the_regex_thing_above
total.append(a)
return total
edit (based on the comments from #Droogans):
this code will work perfectly with no change for the csv version of your document as you depicted it in the new edit.
However, the problem would be greatly simplified if you have access to the code that produces the current output. If this is the case, it would be more effective to pickling or jsoning your data. In this way you could recover your sets of list simply by pickle- or json-loading the generated output
It looks like your file contains properly formed python code. You can use this:
read each line of the file into a variable (m)
>>> m = "set([1, 3, 2])"
>>> eval(m)
set([1, 2, 3])
>>>
eval is considered very dangerous because it will do anything you ask it to (like reformat your disk or whatever). But since you know what is in the file you want to evaluate this might be the way for you to go.
If you just want to read and write simple lists of integers to/from a file:
import os
sets = [
set([0, 2, 3]),
set([0, 1, 3]),
set([0, 1, 2]),
]
def write_sets(path, sets):
with open(path, 'wb') as stream:
for item in sets:
item = ' '.join(str(number) for number in item)
stream.write(item + os.linesep)
def read_sets(path, sets):
sets = []
with open(path, 'rb') as stream:
for line in stream:
sets.append(set(int(number) for number in line.split()))
return sets
path = 'tmp/sets.txt'
write_sets(path, sets)
print read_sets(path, sets)
# [set([0, 2, 3]), set([0, 1, 3]), set([0, 1, 2])]
Why don't you serialize your data in something that you can easily deserialize from a string ?
JSON perfectly fits here.

Python - Parsing Columns and Rows

I am running into some trouble with parsing the contents of a text file into a 2D array/list. I cannot use built-in libraries, so have taken a different approach. This is what my text file looks like, followed by my code
1,0,4,3,6,7,4,8,3,2,1,0
2,3,6,3,2,1,7,4,3,1,1,0
5,2,1,3,4,6,4,8,9,5,2,1
def twoDArray():
network = [[]]
filename = open('twoDArray.txt', 'r')
for line in filename.readlines():
col = line.split(line, ',')
row = line.split(',')
network.append(col,row)
print "Network = "
print network
if __name__ == "__main__":
twoDArray()
I ran this code but got this error:
Traceback (most recent call last):
File "2dArray.py", line 22, in <module>
twoDArray()
File "2dArray.py", line 8, in twoDArray
col = line.split(line, ',')
TypeError: an integer is required
I am using the comma to separate both row and column as I am not sure how I would differentiate between the two - I am confused about why it is telling me that an integer is required when the file is made up of integers
Well, I can explain the error. You're using str.split() and its usage pattern is:
str.split(separator, maxsplit)
You're using str.split(string, separator) and that isn't a valid call to split. Here is a direct link to the Python docs for this:
http://docs.python.org/library/stdtypes.html#str.split
To directly answer your question, there is a problem with the following line:
col = line.split(line, ',')
If you check the documentation for str.split, you'll find the description to be as follows:
str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most
maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified, then there is no limit on the number of splits (all possible splits are made).
This is not what you want. You are not trying to specify the number of splits you want to make.
Consider replacing your for loop and network.append with this:
for line in filename.readlines():
# line is a string representing the values for this row
row = line.split(',')
# row is the list of numbers strings for this row, such as ['1', '0', '4', ...]
cols = [int(x) for x in row]
# cols is the list of numbers for this row, such as [1, 0, 4, ...]
network.append(row)
# Put this row into network, such that network is [[1, 0, 4, ...], [...], ...]
"""I cannot use built-in libraries""" -- do you really mean "cannot" as in you have tried to use the csv module and failed? If so, say so. Do you mean that "may not" as in you are forbidden to use a built-in module by the terms of your homework assignment? If so, say so.
Here is an answer that works. It doesn't leave a newline attached to the end of the last item in each row. It converts the numbers to int so that you can use them for whatever purpose you have. It fixes other errors that nobody else has mentioned.
def twoDArray():
network = []
# filename = open('twoDArray.txt', 'r')
# "filename" is a very weird name for a file HANDLE
f = open('twoDArray.txt', 'r')
# for line in filename.readlines():
# readlines reads the whole file into memory at once.
# That is quite unnecessary.
for line in f: # just iterate over the file handle
line = line.rstrip('\n') # remove the newline, if any
# col = line.split(line, ',')
# wrong args, as others have said.
# In any case, only 1 split call is necessary
row = line.split(',')
# now convert string to integer
irow = [int(item) for item in row]
# network.append(col,row)
# list.append expects only ONE arg
# indentation was wrong; you need to do this once per line
network.append(irow)
print "Network = "
print network
if __name__ == "__main__":
twoDArray()
Omg...
network = []
filename = open('twoDArray.txt', 'r')
for line in filename.readlines():
network.append(line.split(','))
you take
[
[1,0,4,3,6,7,4,8,3,2,1,0],
[2,3,6,3,2,1,7,4,3,1,1,0],
[5,2,1,3,4,6,4,8,9,5,2,1]
]
or you neeed some other structure as output? Please add what do you need as output?
class TwoDArray(object):
#classmethod
def fromFile(cls, fname, *args, **kwargs):
splitOn = kwargs.pop('splitOn', None)
mode = kwargs.pop('mode', 'r')
with open(fname, mode) as inf:
return cls([line.strip('\r\n').split(splitOn) for line in inf], *args, **kwargs)
def __init__(self, data=[[]], *args, **kwargs):
dataType = kwargs.pop('dataType', lambda x:x)
super(TwoDArray,self).__init__()
self.data = [[dataType(i) for i in line] for line in data]
def __str__(self, fmt=str, endrow='\n', endcol='\t'):
return endrow.join(
endcol.join(fmt(i) for i in row) for row in self.data
)
def main():
network = TwoDArray.fromFile('twodarray.txt', splitOn=',', dataType=int)
print("Network =")
print(network)
if __name__ == "__main__":
main()
The input format is simple, so the solution should be simple too:
network = [map(int, line.split(',')) for line in open(filename)]
print network
csv module doesn't provide an advantage in this case:
import csv
print [map(int, row) for row in csv.reader(open(filename, 'rb'))]
If you need float instead of int:
print list(csv.reader(open(filename, 'rb'), quoting=csv.QUOTE_NONNUMERIC))
If you are working with numpy arrays:
import numpy
print numpy.loadtxt(filename, dtype='i', delimiter=',')
See Why NumPy instead of Python lists?
All examples produce arrays equal to:
[[1 0 4 3 6 7 4 8 3 2 1 0]
[2 3 6 3 2 1 7 4 3 1 1 0]
[5 2 1 3 4 6 4 8 9 5 2 1]]
Read the data from the file. Here's one way:
f = open('twoDArray.txt', 'r')
buffer = f.read()
f.close()
Parse the data into a table
table = [map(int, row.split(',')) for row in buffer.strip().split("\n")]
>>> print table
[[1, 0, 4, 3, 6, 7, 4, 8, 3, 2, 1, 0], [2, 3, 6, 3, 2, 1, 7, 4, 3, 1, 1, 0], [5, 2, 1, 3, 4, 6, 4, 8, 9, 5, 2, 1]]
Perhaps you want the transpose instead:
transpose = zip(*table)
>>> print transpose
[(1, 2, 5), (0, 3, 2), (4, 6, 1), (3, 3, 3), (6, 2, 4), (7, 1, 6), (4, 7, 4), (8, 4, 8), (3, 3, 9), (2, 1, 5), (1, 1, 2), (0, 0, 1)]

Categories