I'm trying to parse a txt file and put sentences in a list that fit my criteria.
The text file consists of several thousand lines and I'm looking for lines that start with a specific string, lets call this string 'start'.
The lines in this text file can belong together and are somehow seperated with \n at random.
This means I have to look for any string that starts with 'start', put it in an empty string 'complete' and then continue scanning each line after that to see if it also starts with 'start'.
If not then I need to append it to 'complete' because then it is part of the entire sentence. If it does I need to append 'complete' to a list, create a new, empty 'complete' string and start appending to that one. This way I can loop through the entire text file without paying attention to the number of lines a sentence exists of.
My code thusfar:
import sys, string
lines_1=[]
startswith = ('keys', 'values', 'files', 'folders', 'total')
completeline = ''
with open (sys.argv[1]) as f:
data = f.read()
for line in data:
if line.lower().startswith(startswith):
completeline = line
else:
completeline += line
lines_1.append(completeline)
# check some stuff in output
for l in lines_1:
print "______"
print l
print len(lines_1)
However this puts the entire content in 1 item in the list, where I'd like everything to be seperated.
Keep in mind that the lines composing one sentence can span one, two, 10 or 1000 lines so it needs to spot the next startswith value, append the existing completeline to the list and then fill completeline up with the next sentence.
Much obliged!
Two issues:
Iterating over a string, not lines:
When you iterate over a string, the value yielded is a character, not a line. This means for line in data: is going character by character through the string. Split your input by newlines, returning a list, which you then iterate over. e.g. for line in data.split('\n'):
Overwriting the completeline inside the loop
You append a completed line at the end of the loop, but not when you start recording a new line inside the loop. Change the if in the loop to something like this:
if line.lower().startswith(startswith):
if completeline:
lines_1.append(completeline)
completeline = line
For task like this
"I'm trying to parse a txt file and put sentences in a list that fit my criteria"
I usually prefer using dictionary for such kind of ideas, for example
from collections import defaultdict
seperatedItems = defaultdict(list)
for sentence in fileDataAsAList:
if satisfiesCriteria("start",sentence):
seperatedItems["start"].append(sentence)
def satisfiesCriteria(criteria,sentence):
if sentence.lower.startswith(criteria):
return True
return False
Something like this should suffise.. the code is just for giving you idea of what you might like to do.. you can have list of criterias and loop over them which will add sentences related to different creterias into dictionary something like this
mycriterias = ['start','begin','whatever']
for criteria in mycriterias:
for sentence in fileDataAsAList:
if satisfiesCriteria(criteria ,sentence):
seperatedItems[criteria ].append(sentence)
mind the spellings :p
Related
load_datafile() takes a single string parameter representing the filename of a datafile.
This function must read the content of the file, convert all letters to their lowercase, and store
the result in a string, and finally return that string. I will refer to this string as data throughout
this specification, you may rename it. You must also handle all exceptions in case the datafile
is not available.
Sample output:
data = load_datafile('harry.txt')
print(data)
the hottest day of the summer so far was drawing to a close and a drowsy silence
lay over the large, square houses of privet drive.
load_wordfile() takes a single string argument representing the filename of a wordfile.
This function must read the content of the wordfile and store all words in a one-dimensional
list and return the list. Make sure that the words do not have any additional whitespace or newline character in them. You must also handle all exceptions in case the files are not
available.
Sample outputs:
pos_words = load_wordfile("positivewords.txt")
print(pos_words[2:9])
['abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed',
'acclamation']
neg_words = load_wordfile("negativewords.txt")
print(neg_words[10:19])
['aborts', 'abrade', 'abrasive', 'abrupt', 'abruptly', 'abscond', 'absence',
'absent-minded', 'absentee']
MY CODE BELOW
def load_datafile('harryPotter.txt'):
data = ""
with open('harryPotter.txt') as file:
lines = file.readlines()
temp = lines[-1].lower()
return data
Your code has two main problems. The first one is that you are assigning an empty string to the variable data and returning it, so no matter what you do with the contents of the file you always return an empty string. The second one is that file.readlines() returns a list of strings, where each line in the file is an element on the list and you are only converting the last element lines[-1] to lowercase.
To fix your code you should make sure that you store the contents of the file on the data variable and you should apply the lower() function to each line on the file and not just the last one. Something like this:
def load_datafile(file_name):
data = ''
with open(file_name) as file:
lines = file.readlines()
for line in lines:
data = data + line.lower() + '\n'
return data
The previous example is not the best way of doing this but it's very easy to understand what is happening and I think that is more important when you are starting. To make it more efficient you might want to change it to:
def load_datafile(file_name):
with open(file_name) as file:
return '\n'.join(line.lower() for line in file.readlines())
I am using Python-3 and I am reading a text file which can have multiple paragraphs separated by '\n'. I want to split all those paragraphs into a separate list. There can be n number of paragraphs in the input file.
So this split and output list creation should happen dynamically thereby allowing me to view a particular paragraph by just entering the paragraph number as list[2] or list[3], etc....
So far I have tried the below process :
input = open("input.txt", "r") #Reading the input file
lines = input.readlines() #Creating a List with separate sentences
str = '' #Declaring a empty string
for i in range(len(lines)):
if len(lines[i]) > 2: #If the length of a line is < 2, It means it can be a new paragraph
str += lines[i]
This method will not store paragraphs into a new list (as I am not sure how to do it). It will just remove the line with '\n' and stores all the input lines into str variable. When I tried to display the contents of str, it is showing the output as words. But I need them as sentences.
And my code should store all the sentences until first occurence of '\n' into a separate list and so on.
Any ideas on this ?
UPDATE
I found a way to print all the lines that are present until '\n'. But when I try to store them into the list, it is getting stored as letters, not as whole sentences. Below is the code snippet for reference
input = open("input.txt", "r")
lines = input.readlines()
input_ = []
for i in range(len(lines)):
if len(lines[i]) <= 2:
for j in range(i):
input_.append(lines[j]) #This line is storing as letters.
even "input_ += lines" is storing as letters, Not as sentences.
Any idea how to modify this code to get the desired output ?
Don't forgot to do input.close(), or the file won't save.
Alternatively you can use with.
#Using "with" closes the file automatically, so you don't need to write file.close()
with open("input.txt","r") as file:
file_ = file.read().split("\n")
file_ is now a list with each paragraph as a separate item.
It's as simple as 2 lines.
So I have a text file that is structured like this:
Product ID List:
ABB:
578SH8
EFC025
EFC967
CNC:
HDJ834
HSLA87
...
...
This file continues on with many companies' names and Id's below them. I need to then get the ID's of the chosen company and append them to a list, where they will be used to search a website. Here is the current line I have to get the data:
PID = open('PID.txt').read().split()
This works great if there are only Product ID's of only 1 company in there and no text. This does not work for what I plan on doing however... How can I have the reader read from (an example) after where it says ABB: to before the next company? I was thinking maybe add some kind of thing in the file like ABB END to know where to cut to, but I still don't know how to cut out between lines in the first place... If you could let me know, that would be great!
Two consecutive newlines act as a delimeter, so just split there an construct a dictionary of the data:
data = {i.split()[0]: i.split()[1:] for i in open('PID.txt').read().split('\n\n')}
Since the file is structured like that you could follow these steps:
Split based on the two newline characters \n\n into a list
Split each list on a single newline character \n
Drop the first element for a list containing the IDs for each company
Use the first element (mentioned above) as needed for the company name (make sure to remove the colon)
Also, take a look at regular expressions for parsing data like this.
with open('file.txt', 'r') as f: # open the file
next(f) # skip the first line
results = {} # initialize a dictionary
for line in f: # iterate through the remainder of the file
if ':' in line: # if the line contains a :
current = line.strip() # strip the whitespace
results[current] = [] # and add it as a dictionary entry
elif line.strip(): # otherwise, and if content remains after stripping whitespace,
results[current].append(line.strip()) # append this line to the relevant list
This should at least get you started, you will likely have better luck using dictionaries than lists, at least for the first part of your logic. By what method will you pass the codes along?
a = {}
f1 = open("C:\sotest.txt", 'r')
current_key = ''
for row in f1:
strrow = row.strip('\n')
if strrow == "":
pass
elif ":" in strrow:
current_key = strrow.strip(':')
a[current_key] = []
else:
a[current_key].append(strrow)
for key in a:
print key
for item in a[key]:
print item
I have a textfile that I wanna count the word "quack" in.
textfile named "quacker.txt" example:
This is the textfile quack.
Oh, and how quack did quack do in his exams back in 2009?\n Well, he passed with nine P grades and one B.\n He says that quack he wants to go to university in the\n future but decided to try and make a career on YouTube before that Quack....\n So, far, it’s going very quack well Quack!!!!
So here I want 7 as the output.
readf= open("quacker.txt", "r")
lst= []
for x in readf:
lst.append(str(x).rstrip('\n'))
readf.close()
#above gives a list of each row.
cv=0
for i in lst:
if "quack" in i.strip():
cv+=1
above only works for one "quack" in the element of the list
Well if the file isn't too long, you could try:
with open('quacker.txt') as f:
text = f.read().lower() # make it all lowercase so the count works below
quacks = text.count('quack')
As #PadraicCunningham mentioned in the comments, this would also count the 'quack' in
words like 'quacks' or 'quacking'. But if that's not an issue, then this is fine.
you're incrementing by one if the line contains the string, but what if the line has several occurrences of 'quack'?
try:
for line in lst:
for word in line.split():
if 'quack' in word:
cv+=1
You need to lower, strip and split to get an accurate count:
from string import punctuation
with open("test.txt") as f:
quacks = sum(word.lower().strip(punctuation) == "quack"
for line in f for word in line.split())
print(quacks)
7
You need to split each word in the file into individual words or you will get False positives using in or count. word.lower().strip(punctuation) lowers each word and removes any punctuation, sum will sum all the times word.lower().strip(punctuation) == "quack" is True.
In your own code x is already a string so calling str(x)... is unnecessary, you could also just check each line the first time you iterate, there is no need to add the strings to a list and then iterate a second time. Why you only get one returned is most like because all the data is actually on a single line, you are also comparing quack to Quack which will not work, you need to lower the string.
I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).