I am writing a mini program and within my program there is a function which reads in a text file and returns the individual words from the sentence. However I am having trouble seeing the individual words printed even though I return them. I don't really get why unless I have a big problem with my whitespace. Can you please help? For your information I am only a beginner. The program asks the user for an input of a filename the program then reads the file in the function should then turn the fie into a list and find the individual words from the list and stores them in that list
file_input = input("enter a filename to read: ")
#unique_words = []
def file(user):
unique_words = []
csv_file = open(user + ".txt","w")
main_file = csv_file.readlines()
csv_file.close()
for i in main_list:
if i not in unique_words:
unique_words.append(i)
return unique_words
#display the results of the file being read in
print (file(file_input))
Sorry I am using notepad:
check to see if checking works
it seems you only have one word for each line in your file.
def read_file(user):
with open(user + ".txt","r") as f:
data = [ line.strip() for line in f.readlines() ]
return list( set(data) )
--update---
if you have more than one word in each line and separated by space
def read_file(user):
with open(user + ".txt","r") as f:
data = [ item.strip() for line in f.readlines() for item in line.split(' ')]
return list( set(data) )
In fact, I can not reproduce you problem. Given a proper CSV input file 1) such as
a,b,c,d
e,f,g,h
i,j,k,l
your program prints this, which apart from the last '' seems fine:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', '']
However, you can significantly simplify your code.
instead of appending a , to each line, and then joining by "", just join by , (this will also get rid of that last '')
do the strip directly in join, using a generator expression
main_string = ",".join(line.strip() for line in main_file)
instead of join and then split, use a double-for-loop list comprehension:
main_list = [word for line in csv_file for word in line.strip().split(",")]
instead of doing all this by hand, use the csv module:
main_list = [word for row in csv.reader(csv_file) for word in row]
assuming that order is not important, use a set to remove duplicates:
unique_words = set(main_list)
and if order is important, you can (ab)use collections.OrderedDict:
unique_words = list(collections.OrderedDict((x, None) for x in main_list))
use with to open and close the file
Putting it all together:
import csv
def read_file(user):
with open(user + ".txt") as csv_file:
main_list = [word for row in csv.reader(csv_file) for word in row]
unique_words = set(main_list) # or OrderedDict, see above
return unique_words
1) Update: The reason why it does not work on your "Example text..." file shown in your edit is because that is not a CSV file. CSV mean "comma separated values", but the words in that file a separated by spaces, so you will have to split by spaces instead of by commas:
def read_file(user):
with open(user + ".txt") as text_file:
main_list = [word for line in text_file for word in line.strip().split()]
return set(main_list)
If all you want is a list of each word that occurs in the text, you are doing far too much work. You want something like this:
unique_words = []
all_words = []
with open(file_name, 'r') as in_file:
text_lines = in_file.readlines() # Read in all line from the file as a list.
for line in text_lines:
all_words.extend(line.split()) # iterate through the list of lines, extending the list of all words to include the words in this line.
unique_words = list(set(all_words)) # reduce the list of all words to unique words.
You can simplify your code by using a set because it will only contain unique elements.
user_file = raw_input("enter a filename to read: ")
#function to read any file
def read_file(user):
unique_words = set()
csv_file = open(user + ".txt","r")
main_file = csv_file.readlines()
csv_file.close()
for line in main_file:
line = line.split(',')
unique_words.update([x.strip() for x in line])
return list(unique_words)
#display the results of the file being read in
print (read_file(user_file))
The output for a file with the contents:
Hello, world1
Hello, world2
is
['world2', 'world1', 'Hello']
Related
I have a txt file with column wise written sentences and labels that look like:
O are
O there
O any
O good
B-GENRE romantic
I-GENRE comedies
O out
B-YEAR right
I-YEAR now
O show
O me
O a
O movie
O about
B-PLOT cars
I-PLOT that
I-PLOT talk
I want to read data from this txt file into two nested lists.
The desired output should be like:
labels = [['O','O','O','O','B-GENRE','I-GENRE','O','B-YEAR','I-YEAR'],['O','O','O','O','O','B-PLOT','I-PLOT','I-PLOT']]
sentences = [['are','there','any','good','romantic','comedies','out','right','now'],['show','me','a','movie','about','cars','that','talk']]
I have tried with the following:
with open("engtrain.bio.txt", "r") as f:
lsta = []
for line in f:
lsta.append([x for x in line.replace("\n", "").split()])
But I have the following output:
[['O', 'are'],
['O', 'there'],
['O', 'any'],
['O', 'good'],
['B-GENRE', 'romantic'],
['I-GENRE', 'comedies'],
['O', 'out'],
['B-YEAR', 'right'],
['I-YEAR', 'now'],
[],
['O', 'show'],
['O', 'me'],
['O', 'a'],
['O', 'movie'],
['O', 'about'],
['B-PLOT', 'cars'],
['I-PLOT', 'that'],
['I-PLOT', 'talk']]
Update
I also tried the following:
with open("engtest.bio.txt", "r") as f:
lines = f.readlines()
labels = []
sentences = []
for l in lines:
as_list = l.split("\t")
labels.append(as_list[0])
sentences.append(as_list[1].replace("\n", ""))
Unfortunately, still have an error:
IndexError Traceback (most recent call last)
<ipython-input-66-63c266df6ace> in <module>()
6 as_list = l.strip().split("\t")
7 labels.append(as_list[0])
----> 8 sentences.append(as_list[1].replace("\n", ""))
IndexError: list index out of range
The original data are from this link (engtest.bio or entrain.bio): https://groups.csail.mit.edu/sls/downloads/movie/
Could you help me please?
Thanks in advance
Iterate over each line and split it by tab:
labels = [[]]
sentences = [[]]
with open('engtrain.bio', 'r') as f:
for line in f.readlines():
line = line.rstrip()
if line:
label, sentence = line.split('\t')
labels[-1].append(label)
sentences[-1].append(sentence)
else:
labels.append([])
sentences.append([])
Output labels:
[['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR'], ['O', 'O', 'O', 'O', 'B-ACTOR', 'I-ACTOR', 'O', 'O', 'B-YEAR'] ...
Output sentences:
[['what', 'movies', 'star', 'bruce', 'willis'], ['show', 'me', 'films', 'with', 'drew', 'barrymore', 'from', 'the', '1980s'] ...
The lines in your file can be logically grouped into sections, separated by
blank lines. So you have in fact a two-level data structure, you need to
process a list of sections, and inside each section you need to process a list
of lines. Of course, the text file is a flat list of lines, so we need to
re-construct the 2 levels.
This a very general pattern, so here's one way to code this that can be reused, regardless of what you need to do inside each section:
labels = []
sentences = []
# Prepare next section
inner_labels = []
inner_sentences = []
with open('engtrain.bio.txt') as f:
for line in f.readlines():
if len(line.strip()) == 0:
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
# Prepare next section
inner_labels = []
inner_sentences = []
continue
# Process line in section
l, s = line.strip().split()
inner_labels.append(l)
inner_sentences.append(s)
# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)
To reuse this in a different situation, just re-define "Prepare next section", "Process line in section", and "Finish previous section".
There may be a more pythonic way to pre-process the list of lines, etc, but this is a reliable pattern that gets the job done.
all_labels, all_sentences = [], []
with open('inp', 'r') as f:
lines = f.readlines()
lines.append('') # make sure we process the last sentence
labels, sentences = [], []
for line in lines:
line = line.strip()
if not line: # detect the end of a sentence
if len(labels): # make sure we got some words here
all_labels.append(labels)
all_sentences.append(sentences)
labels, sentences = [], []
continue
# extend the current sentence
label, sentence = line.split()
labels.append(label)
sentences.append(sentence)
print(all_labels)
print(all_sentences)
I have a file having data in form
Your
Name
I am reading the file and want to convert the data in the list but each word as a separate list in the form of words. I tried the below code
def return_list():
a1_filename = tkinter.filedialog.askopenfilename()
a1_file = open(a1_filename, 'r')
grade= []
line = a1_file.readline()
while (line != ''):
for words in line:
b = words.rstrip('\n')
grade.append([b])
line = a1_file.readline()
return grade
My output is:
[['Y'], ['o'], ['u'], ['r'], [''], ['N'], ['a'], ['m'], ['e'], ['']]
But what I am trying to get is
[['Y','o','u','r'], ['N','a','m','e']]
You have two problems. The main one is that you're trying to build a two-level data structure with a single-feature construction. Instead, build the list of letters you want, and then append that list to your master list.
The second problem is that you're using append on a list, which adds the entire list structure.
while (line != ''):
chars = []
for words in line:
b = words.rstrip(' \n')
chars.append(b)
grade.append(chars)
line = a1_file.readline()
You needed to add a temporary list in your loop. That said we can make a change here to help close the file. In your example you never call a1_filename.close() so maybe you don't know that you need to close the file when you are done. To avoid forgetting about this it is best to use a with open statement as it will auto close after completion.
Try this:
def return_list():
a1_filename = 'test_file'
grade = [] # set up main list to be returned before the open statement.
with open(a1_filename, 'r') as a1_file: # use with open instead so you properly close file.
line = a1_file.readline().strip() # Strip whitespace here before while statement.
while line != '':
temp_list = [] # used to store a list of each word before appending to main list grade.
for char in line:
if char != '\n':
temp_list.append(char)
grade.append(temp_list) # Append temp list to main list.
line = a1_file.readline()
return grade
print(return_list())
Results:
[['Y', 'o', 'u', 'r'], ['N', 'a', 'm', 'e']]
If you would like a short and simple version we can use a fun one liner to clean up the new lines and at the same time generate our list of list. That said if it affects readability I would avoid one liners but if you can understand the one liner just by reading it then they are a fun and clean option to use:
with open('test_file') as f:
print([[char for char in word] for word in f.read().splitlines() if word])
Or:
with open('test_file') as f: print([[char for char in word] for word in f.read().splitlines() if word])
Results:
[['Y', 'o', 'u', 'r'], ['N', 'a', 'm', 'e']]
I put your example file into text.txt, here is an easier approach with map follows a filter function, it's ugly but works in one line:
with open("test.txt",'r') as f:
data = [*filter(lambda x: x!=[] ,
map(lambda x: list(x.strip("\n")),
f.readlines()))]
This should solve the problem.
with open('path/filename') as f:
lines = f.read().splitlines()
lines = [[char for char in line] for line in list(filter(None, lines))]
You're pretty close - just gotta make sure that each list gets appended to a master list, so that you have your list of lists.
def return_list():
a1_filename = tkinter.filedialog.askopenfilename()
a1_file = open(a1_filename, 'r')
list_of_lists=[]
grade= []
line = a1_file.readline()
while (line != ''):
for words in line:
b = words.rstrip('\n')
grade.append([b])
list_of_lists.append([grade])
line = a1_file.readline()
return list_of_lists
grades.append([b]) is going to wrap b in a list and then append it. That's not quite what you're looking for - instead, why not make a new list entry for each line?
grade=[]
line = a1_file.readline()
while(line != ''):
subgrade = []
for words in line:
b = words.rstrip('\n')
subgrade.append(b)
grade.append(subgrade)
line = a1_file.readline()
return grade
change your
for words in line:
to
for words in line.split():
A list comprehension offers a more succint and pythonic solution. Just update the loop:
while (line != ''):
grade += [list(w.rstrip('\n')) for w in line.split(' ')]
line = a1_file.readline()
file_str = input("Enter poem: ")
my_file = open(file_str, "r")
words = file_str.split(',' or ';')
I have a file on my computer that contains a really long poem, and I want to see if there are any words that are duplicated per line (hence it being split by punctuation).
I have that much, and I don't want to use a module or Counter, I would prefer to use loops. Any ideas?
You can use sets to track seen items and duplicates:
>>> words = 'the fox jumped over the lazy dog and over the bear'.split()
>>> seen = set()
>>> dups = set()
>>> for word in words:
if word in seen:
if word not in dups:
print(word)
dups.add(word)
else:
seen.add(word)
the
over
with open (r"specify the path of the file") as f:
data = f.read()
if(set([i for i in data if f.count(f)>1])):
print "Duplicates found"
else:
print "None"
SOLVED !!!
I can give the explanation with working program
file content of sam.txt
sam.txt
Hello this is star hello the data are Hello so you can move to the
hello
file_content = []
resultant_list = []
repeated_element_list = []
with open(file="sam.txt", mode="r") as file_obj:
file_content = file_obj.readlines()
print("\n debug the file content ",file_content)
for line in file_content:
temp = line.strip('\n').split(" ") # This will strip('\n') and split the line with spaces and stored as list
for _ in temp:
resultant_list.append(_)
print("\n debug resultant_list",resultant_list)
#Now this is the main for loop to check the string with the adjacent string
for ii in range(0, len(resultant_list)):
# is_repeated will check the element count is greater than 1. If so it will proceed with identifying duplicate logic
is_repeated = resultant_list.count(resultant_list[ii])
if is_repeated > 1:
if ii not in repeated_element_list:
for2count = ii + 1
#This for loop for shifting the iterator to the adjacent string
for jj in range(for2count, len(resultant_list)):
if resultant_list[ii] == resultant_list[jj]:
repeated_element_list.append(resultant_list[ii])
print("The repeated strings are {}\n and total counts {}".format(repeated_element_list, len(repeated_element_list)))
Output:
debug the file content ['Hello this is abdul hello\n', 'the data are Hello so you can move to the hello']
debug resultant_list ['Hello', 'this', 'is', 'abdul', 'hello', 'the', 'data', 'are', 'Hello', 'so', 'you', 'can', 'move', 'to', 'the', 'hello']
The repeated strings are ['Hello', 'hello', 'the']
and total counts 3
Thanks
def Counter(text):
d = {}
for word in text.split():
d[word] = d.get(word,0) + 1
return d
there is loops :/
to split on punctionation just us
matches = re.split("[!.?]",my_corpus)
for match in matches:
print Counter(match)
For this kinda file;
A hearth came to us from your hearth
foreign hairs with hearth are same are hairs
This will check whole poem;
lst = []
with open ("coz.txt") as f:
for line in f:
for word in line.split(): #splited by gaps (space)
if word not in lst:
lst.append(word)
else:
print (word)
Output:
>>>
hearth
hearth
are
hairs
>>>
As you see there are two hearth here, because in whole poem there are 3 hearth.
For check line by line;
lst = []
lst2 = []
with open ("coz.txt") as f:
for line in f:
for word in line.split():
lst2.append(word)
for x in lst2:
if x not in lst:
lst.append(x)
lst2.remove(x)
print (set(lst2))
>>>
{'hearth', 'are', 'hairs'}
>>>
The sample below is to strip punctuations and converting text into lower case from a ranbo.txt file...
Help me to split this with whitespace
infile = open('ranbo.txt', 'r')
lowercased = infile.read().lower()
for c in string.punctuation:
lowercased = lowercased.replace(c,"")
white_space_words = lowercased.split(?????????)
print white_space_words
Now after this split - how can I found how many words are in this list?
count or len function?
white_space_words = lowercased.split()
splits using any length of whitespace characters.
'a b \t cd\n ef'.split()
returns
['a', 'b', 'cd', 'ef']
But you could do it also other way round:
import re
words = re.findall(r'\w+', text)
returns a list of all "words" from text.
Get its length using len():
len(words)
and if you want to join them into a new string with newlines:
text = '\n'.join(words)
As a whole:
with open('ranbo.txt', 'r') as f:
lowercased = f.read().lower()
words = re.findall(r'\w+', lowercased)
number_of_words = len(words)
text = '\n'.join(words)
I'm trying to convert a set I've defined into a list so I can use it for indexing.
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
The set seems to contain items just fine. However the list is always empty when I monitor its value in the variable explorer (and when I later call the index function).
What am I doing wrong?
EDIT: This is the entire code. I'm trying to find the location of words in 'p' in 'o' and chart the number of its occurrences in a single line. It's a huge list of words so manually entering anything is out of the question.
p = open("p.txt", 'r')
o = open("o.txt", 'r')
t = open("t.txt", 'w')
lines = p.readlines()
vlines = o.readlines()
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
for i in lines:
thisline = i.split();
thisline[:] = [word for word in thisline if not word.isdigit()]
count = len(thisline)
j = []
j.append(count)
for sword in thisline:
num = thisline.count(sword)
#index=0
#for m in vlines:
#if word is not m:
#index+=1
ix = been.index(sword)
j.append(' ' + str(ix) + ':' + str(num))
j.append('\n')
for item in j:
t.write("%s" % item)
Output should be in the format '(total number of items in line) (index):(no. of occurrences)'.
I think I'm pretty close but this part is bugging me.
Your code is working just fine.
>>> p = '''
the 123 dogs
chased 567 cats
through 89 streets'''.splitlines()
>>> seen = set()
>>> for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
>>> been = list(seen)
>>>
>>> seen
set(['streets', 'chased', 'cats', 'through', 'the', 'dogs'])
>>> been
['streets', 'chased', 'cats', 'through', 'the', 'dogs']
Unless there's a reason why you want to read line by line you can simply replace this:
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
with:
been = list(set([w for w in open('p.txt', 'r').read().split() if not w.isdigit()]))