I have a file having data in form
Your
Name
I am reading the file and want to convert the data in the list but each word as a separate list in the form of words. I tried the below code
def return_list():
a1_filename = tkinter.filedialog.askopenfilename()
a1_file = open(a1_filename, 'r')
grade= []
line = a1_file.readline()
while (line != ''):
for words in line:
b = words.rstrip('\n')
grade.append([b])
line = a1_file.readline()
return grade
My output is:
[['Y'], ['o'], ['u'], ['r'], [''], ['N'], ['a'], ['m'], ['e'], ['']]
But what I am trying to get is
[['Y','o','u','r'], ['N','a','m','e']]
You have two problems. The main one is that you're trying to build a two-level data structure with a single-feature construction. Instead, build the list of letters you want, and then append that list to your master list.
The second problem is that you're using append on a list, which adds the entire list structure.
while (line != ''):
chars = []
for words in line:
b = words.rstrip(' \n')
chars.append(b)
grade.append(chars)
line = a1_file.readline()
You needed to add a temporary list in your loop. That said we can make a change here to help close the file. In your example you never call a1_filename.close() so maybe you don't know that you need to close the file when you are done. To avoid forgetting about this it is best to use a with open statement as it will auto close after completion.
Try this:
def return_list():
a1_filename = 'test_file'
grade = [] # set up main list to be returned before the open statement.
with open(a1_filename, 'r') as a1_file: # use with open instead so you properly close file.
line = a1_file.readline().strip() # Strip whitespace here before while statement.
while line != '':
temp_list = [] # used to store a list of each word before appending to main list grade.
for char in line:
if char != '\n':
temp_list.append(char)
grade.append(temp_list) # Append temp list to main list.
line = a1_file.readline()
return grade
print(return_list())
Results:
[['Y', 'o', 'u', 'r'], ['N', 'a', 'm', 'e']]
If you would like a short and simple version we can use a fun one liner to clean up the new lines and at the same time generate our list of list. That said if it affects readability I would avoid one liners but if you can understand the one liner just by reading it then they are a fun and clean option to use:
with open('test_file') as f:
print([[char for char in word] for word in f.read().splitlines() if word])
Or:
with open('test_file') as f: print([[char for char in word] for word in f.read().splitlines() if word])
Results:
[['Y', 'o', 'u', 'r'], ['N', 'a', 'm', 'e']]
I put your example file into text.txt, here is an easier approach with map follows a filter function, it's ugly but works in one line:
with open("test.txt",'r') as f:
data = [*filter(lambda x: x!=[] ,
map(lambda x: list(x.strip("\n")),
f.readlines()))]
This should solve the problem.
with open('path/filename') as f:
lines = f.read().splitlines()
lines = [[char for char in line] for line in list(filter(None, lines))]
You're pretty close - just gotta make sure that each list gets appended to a master list, so that you have your list of lists.
def return_list():
a1_filename = tkinter.filedialog.askopenfilename()
a1_file = open(a1_filename, 'r')
list_of_lists=[]
grade= []
line = a1_file.readline()
while (line != ''):
for words in line:
b = words.rstrip('\n')
grade.append([b])
list_of_lists.append([grade])
line = a1_file.readline()
return list_of_lists
grades.append([b]) is going to wrap b in a list and then append it. That's not quite what you're looking for - instead, why not make a new list entry for each line?
grade=[]
line = a1_file.readline()
while(line != ''):
subgrade = []
for words in line:
b = words.rstrip('\n')
subgrade.append(b)
grade.append(subgrade)
line = a1_file.readline()
return grade
change your
for words in line:
to
for words in line.split():
A list comprehension offers a more succint and pythonic solution. Just update the loop:
while (line != ''):
grade += [list(w.rstrip('\n')) for w in line.split(' ')]
line = a1_file.readline()
Related
I have a list of strings, and want to use another list of strings and remove any instance of the combination of bad list in my list. Such as the output of the below would be foo, bar, foobar, foofoo... Currently I have tried a few things for example below
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
remove_list = ['\\n', '!', '*', '?', ':']
for remove in remove_list:
for strings in mylist:
strings = strings.replace(bad, ' ')
The above code doesnt work, I did at one point set it to a new variable and append that afterwords but that wasnt working well becuase if their was two issues in a string it would be appended twice.
You changed the temporary variable, not the original list. Instead, assign the result back into mylist
for bad in remove_list:
for pos, string in enumerate(mylist):
mylist[pos] = string.replace(bad, ' ')
Try this:
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
bads = ['\\n', '!', '*', '?', ':']
result = []
for s in mylist:
# s is a temporary copy
for bad in bads:
s = s.replace(bad, '') # for all bad remove it
result.append(s)
print(result)
Could be implemented more concise, but this way it's more understandable.
I had a hard time interpreting the question, but I see you have the result desired at the top of your question.
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
remove_list = ['\\n', '!', '*', '?', ':']
output = output[]
for strings in mylist:
for remove in remove_list:
strings = strings.replace(remove, '')
output.append(strings)
import re
for list1 in mylist:
t = regex.sub('', list1)
print(t)
If you just want to get rid of non-chars do this. It works a lot better than comparing two separate array lists.
Why not have regex do the work for you? No nested loops this way (just make sure to escape correctly):
import re
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
remove_list = [r'\\n', '\!', '\*', '\?', ':']
removals = re.compile('|'.join(remove_list))
print([removals.sub('', s) for s in mylist])
['foo', 'bar', 'foobar', 'foofoo']
Another solution you can use is a comprehension list and remove the characters you want. After that, you delete duplicates.
list_good = [word.replace(bad, '') for word in mylist for bad in remove_list]
list_good = list(set(list_good))
my_list = ["foo!", "bar\\n", "foobar!!??!!", "foofoo::*!"]
to_remove = ["!", "\\n", "?", ":", "*"]
for index, item in enumerate(my_list):
for char in to_remove:
if char in item:
item = item.replace(char, "")
my_list[index] = item
print(my_list) # outputs [“foo”,”bar”,”foobar”,”foofoo”]
I have a txt file with column wise written sentences and labels that look like:
O are
O there
O any
O good
B-GENRE romantic
I-GENRE comedies
O out
B-YEAR right
I-YEAR now
O show
O me
O a
O movie
O about
B-PLOT cars
I-PLOT that
I-PLOT talk
I want to read data from this txt file into two nested lists.
The desired output should be like:
labels = [['O','O','O','O','B-GENRE','I-GENRE','O','B-YEAR','I-YEAR'],['O','O','O','O','O','B-PLOT','I-PLOT','I-PLOT']]
sentences = [['are','there','any','good','romantic','comedies','out','right','now'],['show','me','a','movie','about','cars','that','talk']]
I have tried with the following:
with open("engtrain.bio.txt", "r") as f:
lsta = []
for line in f:
lsta.append([x for x in line.replace("\n", "").split()])
But I have the following output:
[['O', 'are'],
['O', 'there'],
['O', 'any'],
['O', 'good'],
['B-GENRE', 'romantic'],
['I-GENRE', 'comedies'],
['O', 'out'],
['B-YEAR', 'right'],
['I-YEAR', 'now'],
[],
['O', 'show'],
['O', 'me'],
['O', 'a'],
['O', 'movie'],
['O', 'about'],
['B-PLOT', 'cars'],
['I-PLOT', 'that'],
['I-PLOT', 'talk']]
Update
I also tried the following:
with open("engtest.bio.txt", "r") as f:
lines = f.readlines()
labels = []
sentences = []
for l in lines:
as_list = l.split("\t")
labels.append(as_list[0])
sentences.append(as_list[1].replace("\n", ""))
Unfortunately, still have an error:
IndexError Traceback (most recent call last)
<ipython-input-66-63c266df6ace> in <module>()
6 as_list = l.strip().split("\t")
7 labels.append(as_list[0])
----> 8 sentences.append(as_list[1].replace("\n", ""))
IndexError: list index out of range
The original data are from this link (engtest.bio or entrain.bio): https://groups.csail.mit.edu/sls/downloads/movie/
Could you help me please?
Thanks in advance
Iterate over each line and split it by tab:
labels = [[]]
sentences = [[]]
with open('engtrain.bio', 'r') as f:
for line in f.readlines():
line = line.rstrip()
if line:
label, sentence = line.split('\t')
labels[-1].append(label)
sentences[-1].append(sentence)
else:
labels.append([])
sentences.append([])
Output labels:
[['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR'], ['O', 'O', 'O', 'O', 'B-ACTOR', 'I-ACTOR', 'O', 'O', 'B-YEAR'] ...
Output sentences:
[['what', 'movies', 'star', 'bruce', 'willis'], ['show', 'me', 'films', 'with', 'drew', 'barrymore', 'from', 'the', '1980s'] ...
The lines in your file can be logically grouped into sections, separated by
blank lines. So you have in fact a two-level data structure, you need to
process a list of sections, and inside each section you need to process a list
of lines. Of course, the text file is a flat list of lines, so we need to
re-construct the 2 levels.
This a very general pattern, so here's one way to code this that can be reused, regardless of what you need to do inside each section:
labels = []
sentences = []
# Prepare next section
inner_labels = []
inner_sentences = []
with open('engtrain.bio.txt') as f:
for line in f.readlines():
if len(line.strip()) == 0:
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
# Prepare next section
inner_labels = []
inner_sentences = []
continue
# Process line in section
l, s = line.strip().split()
inner_labels.append(l)
inner_sentences.append(s)
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
To reuse this in a different situation, just re-define "Prepare next section", "Process line in section", and "Finish previous section".
There may be a more pythonic way to pre-process the list of lines, etc, but this is a reliable pattern that gets the job done.
all_labels, all_sentences = [], []
with open('inp', 'r') as f:
lines = f.readlines()
lines.append('') # make sure we process the last sentence
labels, sentences = [], []
for line in lines:
line = line.strip()
if not line: # detect the end of a sentence
if len(labels): # make sure we got some words here
all_labels.append(labels)
all_sentences.append(sentences)
labels, sentences = [], []
continue
# extend the current sentence
label, sentence = line.split()
labels.append(label)
sentences.append(sentence)
print(all_labels)
print(all_sentences)
I’m a programming neophyte and would like some assistance in understanding why the following algorithm is behaving in a particular manner.
My objective is for the function to read in a text file containing words (can be capitalized), strip the whitespace, split the items into separate lines, convert all capital first characters to lowercase, remove all single characters (e.g., “a”, “b”, “c”, etc.), and add the resulting words to a list. All words are to be a separate item in the list for further processing.
Input file:
A text file (‘sample.txt’) contains the following data - “a apple b Banana c cherry”
Desired output:
[‘apple’, ‘banana’, ‘cherry’]
In my initial attempt I tried to iterate through the list of words to test if their length was equal to 1. If so, the word was to be removed from the list, with the other words remaining in the list. This resulted in the following, non-desired output: [None, None, None]
filename = ‘sample.txt’
with open(filename) as input_file:
word_list = input_file.read().strip().split(' ')
word_list = [word.lower() for word in word_list]
word_list = [word_list.remove(word) for word in word_list if len(word) == 1]
print(word_list)
Produced non-desired output = [None, None, None]
My next attempt was to instead iterate through the list for words to test if their length was greater than 1. If so, the word was to be added to the list (leaving the single characters behind). The desired output was achieved using this method.
filename = ‘sample.txt’
with open(filename) as input_file:
word_list = input_file.read().strip().split(' ')
word_list = [word.lower() for word in word_list]
word_list = [word for word in word_list if len(word) > 1]
print(word_list)
Produced desired Output = [‘apple’, ‘banana’, ‘cherry’]
My questions are:
Why didn’t the initial code produce the desired result when it seemed to be the most logical and most efficient?
What is the best ‘Pythonic’ way to achieve the desired result?
The reason you got the output you got is
You're removing items from the list as you're looping through it
You are trying to use the output of list.remove (which just modifies the list and returns None)
Your last list comprehension (word_list = [word_list.remove(word) for word in word_list if len(word) == 1]) is essentially equivalent to this:
new_word_list = []
for word in word_list:
if len(word) == 1:
new_word_list.append(word_list.remove(word))
word_list = new_word_list
And as you loop through it this happens:
# word_list == ['a', 'apple', 'b', 'banana', 'c', 'cherry']
# new_word_list == []
word = word_list[0] # word == 'a'
new_word_list.append(word_list.remove(word))
# word_list == ['apple', 'b', 'banana', 'c', 'cherry']
# new_word_list == [None]
word = word_list[1] # word == 'b'
new_word_list.append(word_list.remove(word))
# word_list == ['apple', 'banana', 'c', 'cherry']
# new_word_list == [None, None]
word = word_list[2] # word == 'c'
new_word_list.append(word_list.remove(word))
# word_list == ['apple', 'banana', 'cherry']
# new_word_list == [None, None, None]
word_list = new_word_list
# word_list == [None, None, None]
The best 'Pythonic' way to do this (in my opinion) would be:
with open('sample.txt') as input_file:
file_content = input_file.read()
word_list = []
for word in file_content.strip().split(' '):
if len(word) == 1:
continue
word_list.append(word.lower())
print(word_list)
In your first approach, you are storing the result of word_list.remove(word) in the list which is None. Bcz list.remove() method return nothing but performing action on a given list.
Your second approach is the pythonic way to achieve your goal.
The second attempt is the most pythonic. The first one can still be achieved with the following:
filename = 'sample.txt'
with open(filename) as input_file:
word_list = input_file.read().strip().split(' ')
word_list = [word.lower() for word in word_list]
for word in word_list:
if len(word) == 1:
word_list.remove(word)
print(word_list)
Why didn’t the initial code produce the desired result when it seemed
to be the most logical and most efficient?
It's advised to never alter a list while iterating over it. This is because it is iterating over a view of the initial list and that view will differ from the original.
What is the best ‘Pythonic’ way to achieve the desired result?
Your second attempt. But I'd use a better naming convention and your comprehensions can be combined as you're only making them lowercase in the first one:
word_list = input_file.read().strip().split(' ')
filtered_word_list = [word.lower() for word in word_list if len(word) > 1]
I am writing a mini program and within my program there is a function which reads in a text file and returns the individual words from the sentence. However I am having trouble seeing the individual words printed even though I return them. I don't really get why unless I have a big problem with my whitespace. Can you please help? For your information I am only a beginner. The program asks the user for an input of a filename the program then reads the file in the function should then turn the fie into a list and find the individual words from the list and stores them in that list
file_input = input("enter a filename to read: ")
#unique_words = []
def file(user):
unique_words = []
csv_file = open(user + ".txt","w")
main_file = csv_file.readlines()
csv_file.close()
for i in main_list:
if i not in unique_words:
unique_words.append(i)
return unique_words
#display the results of the file being read in
print (file(file_input))
Sorry I am using notepad:
check to see if checking works
it seems you only have one word for each line in your file.
def read_file(user):
with open(user + ".txt","r") as f:
data = [ line.strip() for line in f.readlines() ]
return list( set(data) )
--update---
if you have more than one word in each line and separated by space
def read_file(user):
with open(user + ".txt","r") as f:
data = [ item.strip() for line in f.readlines() for item in line.split(' ')]
return list( set(data) )
In fact, I can not reproduce you problem. Given a proper CSV input file 1) such as
a,b,c,d
e,f,g,h
i,j,k,l
your program prints this, which apart from the last '' seems fine:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', '']
However, you can significantly simplify your code.
instead of appending a , to each line, and then joining by "", just join by , (this will also get rid of that last '')
do the strip directly in join, using a generator expression
main_string = ",".join(line.strip() for line in main_file)
instead of join and then split, use a double-for-loop list comprehension:
main_list = [word for line in csv_file for word in line.strip().split(",")]
instead of doing all this by hand, use the csv module:
main_list = [word for row in csv.reader(csv_file) for word in row]
assuming that order is not important, use a set to remove duplicates:
unique_words = set(main_list)
and if order is important, you can (ab)use collections.OrderedDict:
unique_words = list(collections.OrderedDict((x, None) for x in main_list))
use with to open and close the file
Putting it all together:
import csv
def read_file(user):
with open(user + ".txt") as csv_file:
main_list = [word for row in csv.reader(csv_file) for word in row]
unique_words = set(main_list) # or OrderedDict, see above
return unique_words
1) Update: The reason why it does not work on your "Example text..." file shown in your edit is because that is not a CSV file. CSV mean "comma separated values", but the words in that file a separated by spaces, so you will have to split by spaces instead of by commas:
def read_file(user):
with open(user + ".txt") as text_file:
main_list = [word for line in text_file for word in line.strip().split()]
return set(main_list)
If all you want is a list of each word that occurs in the text, you are doing far too much work. You want something like this:
unique_words = []
all_words = []
with open(file_name, 'r') as in_file:
text_lines = in_file.readlines() # Read in all line from the file as a list.
for line in text_lines:
all_words.extend(line.split()) # iterate through the list of lines, extending the list of all words to include the words in this line.
unique_words = list(set(all_words)) # reduce the list of all words to unique words.
You can simplify your code by using a set because it will only contain unique elements.
user_file = raw_input("enter a filename to read: ")
#function to read any file
def read_file(user):
unique_words = set()
csv_file = open(user + ".txt","r")
main_file = csv_file.readlines()
csv_file.close()
for line in main_file:
line = line.split(',')
unique_words.update([x.strip() for x in line])
return list(unique_words)
#display the results of the file being read in
print (read_file(user_file))
The output for a file with the contents:
Hello, world1
Hello, world2
is
['world2', 'world1', 'Hello']
This is my code:
line = input('Line: ')
if 'a' in line:
print(line.replace('a', 'afa'))
elif 'e' in line:
print(line.replace('e', 'efe'))
It's obviously not finished, but I was wondering, let's say there was an 'a' and an 'e', how would I replace both of them in the same statement?
Why not:
import re
text = 'hello world'
res = re.sub('([aeiou])', r'\1f\1', text)
# hefellofo woforld
line = input('Line: ')
line = line.replace('a', 'afa')
line = line.replace('e', 'efe')
line = line.replace('i', 'ifi')
line = line.replace('o', 'ofo')
line = line.replace('u', 'ufu')
print(line)
Got it!
let's say there was an 'a' and an 'e', how would I replace both of them in the same statement?
You can chain the replace() calls:
print(line.replace('a', 'afa').replace('e', 'efe'))
my_string = 'abcdefghij'
replace_objects = {'a' : 'b', 'c' : 'd'}
for key in replace_objects:
mystring = mystring.replace(key, replace_objects[key])
If you've got a load of replacements to do and you want to populate the replacement list after a while it's quite easy with a dictionary. Altho regexp or re is prefered.