I have a function that scrapes speeches from millercenter.org and returns the processed speech. However, every one of my speeches has the word "transcript" at the beginning (that's just how it's coded into the HTML). So, all of my text files look like this:
\n <--- there's really just a new line, here, not literally '\n'
transcript
fourscore and seven years ago, blah blah blah
I have these saved in my U:/ drive - how can I iterate through these files and remove 'transcript'? The files look like this, essentially:
Edit:
speech_dict = {}
for filename in glob.glob("U:/FALL 2015/ENGL 305/NLP Project/Speeches/*.txt"):
with open(filename, 'r') as inputFile:
filecontent = inputFile.read();
filecontent.replace('transcript','',1)
speech_dict[filename] = filecontent # put the speeches into a dictionary to run through the algorithm
This is not doing anything to change my speeches. 'transcript' is still there.
I also tried putting it into my text-processing function, but that doesn't work, either:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub(' ',item_str)
item_str_processed_final = item_str_processed.replace('—',' ').replace('transcript','',1)
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return filename, item_str_processed_final # giving back filename and the text itself
Here's an example url I run through processURL: http://millercenter.org/president/harding/speeches/speech-3805
You can use Python's excellent replace() for this:
data = data.replace('transcript', '', 1)
This line will replace 'transcript' with '' (empty string). The final parameter is the number of replacements to make. 1 for only the first instance of 'transcript', blank for all instances.
If you know that the data you want always starts on line x then do this:
with open('filename.txt', 'r') as fin:
for _ in range(x): # This loop will skip x no. of lines.
next(fin)
for line in fin:
# do something with the line.
print(line)
Or let's say you want to remove any lines before transcript:
with open('filename.txt', 'r') as fin:
while next(fin) != 'transcript': # This loop will skip lines until it reads the *transcript* lines.
break
# if you want to skip the empty line after *transcript*
next(fin) # skips the next line.
for line in fin:
# do something with the line.
print(line)
Related
My starting file was .txt one, that looked like:
https://www.website.com/something1/id=39494 notes !!!! other notes
https://www.website2.com/something1/id=596774 ... notes2 !! other notes2
and so on.. so very messy
to clean it up I did:
import re
with open('file.txt', 'r') as filehandle:
places = [current_place.rstrip() for current_place in filehandle.readlines()]
filtered = [x for x in places if x.strip()]
This gave me a list of websites (without spaces in between) but still with notes in the same string.
My goal is the first have a list of "cleaned" websites without any notes afterwords:
https://www.website.com/something1/id=39494
https://www.website2.com/something1/id=596774
For that I thought to target the space after the end of website and get rid of all the words afterwords:
for s in filtered:
f = re.search('\s')
This returns an error, but even if it worked it wouldn't return what I thought.
The second step is to strip the website of some characters and compose it like: https://www.website.com/embed/id=39494
but this would come later.
I just wonder how can I achieve the first step and get rid of the notes after the website and have a clean list.
If each line consists of a URL followed by a space and any other text, you can simply split by the space and take the first element of each line:
urls = []
with open('file.txt') as filehandle:
for line in filehandle:
if not line.strip(): continue # skip empty lines
urls.append(line.split(" ")[0])
# now the variable `urls` should contain all the URLs you are looking for
EDIT: second step
for url in urls:
print('<iframe src="{}"></iframe>'.format(url))
You can use this:
# to read the lines
with open('file.txt', 'r') as f:
strlist = f.readlines()
# list to store the URLs
webs = []
for x in strlist:
webs.append(x.split(' ')[0])
print(webs)
In case if the URL position is not always at the beginning of the line. You can try
https?:\/\/www\.\w+\.com\/\w+\/id=(\d+)
then you can use back reference to get the URL and id.
Code example
with open('file.txt') as file:
for line in file:
m = re.match(r'https?:\/\/www\.\w+\.com\/\w+\/id=(\d+)', line)
if m:
print("URL=%s" % m.group(0))
print("ID=%d" % int(m.group(1)))
Demo
I have the following problem. I am supposed to open a CSV file (its an excel table) and read it without using any library.
I tried already a lot and have now the first row in a tuple and this in a list. But only the first line. The header. But no other row.
This is what I have so far.
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
The output should: be every line in a tuple and all the tuples in a list.
My question is now, how can I read the other lines in python?
I am really sorry, I am new to programming all together and so I have a real hard time finding my mistake.
Thank you very much in advance for helping me out!
This problem was many times on Stackoverflow so you should find working code.
But much better is to use module csv for this.
You have wrong indentation and you use return results after reading first line so it exits function and it never try read other lines.
But after changing this there are still other problems so it still will not read next lines.
You use readline() so you read only first line and your loop will works all time with the same line - and maybe it will never ends because you never set text = ''
You should use read() to get all text which later you split to lines using split("\n") or you could use readlines() to get all lines as list and then you don't need split(). OR you can use for line in file: In all situations you don't need while
def read_csv(path):
with open(path, 'r+') as file:
results = []
text = file.read()
for line in text.split('\n'):
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
lines = file.readlines()
for line in lines:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
for line in file:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
All this version will not work correctly if you will '\n' or , inside item which shouldn't be treated as end of row or as separtor between items. These items will be in " " which also can make problem to remove them. All these problem you can resolve using standard module csv.
Your code is pretty well and you are near goal:
with open(path, 'r+') as file:
results=[]
text = file.read()
#while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
Your Code:
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
So enjoy learning :)
One caveat is that the csv may not end with a blank line as this would result in an ugly tuple at the end of the list like ('',) (Which looks like a smiley)
To prevent this you have to check for empty lines: if line != '': after the for will do the trick.
I am trying to extract data from a .txt file in Python. My goal is to capture the last occurrence of a certain word and show the next line, so I do a reverse () of the text and read from behind. In this case, I search for the word 'MEC', and show the next line, but I capture all occurrences of the word, not the first.
Any idea what I need to do?
Thanks!
This is what my code looks like:
import re
from file_read_backwards import FileReadBackwards
with FileReadBackwards("camdex.txt", encoding="utf-8") as file:
for l in file:
lines = l
while line:
if re.match('MEC', line):
x = (file.readline())
x2 = (x.strip('\n'))
print(x2)
break
line = file.readline()
The txt file contains this:
MEC
29/35
MEC
28,29/35
And with my code print this output:
28,29/35
29/35
And my objetive is print only this:
28,29/35
This will give you the result as well. Loop through lines, add the matching lines to an array. Then print the last element.
import re
with open("data\camdex.txt", encoding="utf-8") as file:
result = []
for line in file:
if re.match('MEC', line):
x = file.readline()
result.append(x.strip('\n'))
print(result[-1])
Get rid of the extra imports and overhead. Read your file normally, remembering the last line that qualifies.
with ("camdex.txt", encoding="utf-8") as file:
for line in file:
if line.startswith("MEC"):
last = line
print(last[4:-1]) # "4" gets rid of "MEC "; "-1" stops just before the line feed.
If the file is very large, then reading backwards makes sense -- seeking to the end and backing up will be faster than reading to the end.
I have a function that loops through a file that Looks like this:
"#" XDI/1.0 XDAC/1.4 Athena/0.9.25
"#" Column.4: pre_edge
Content
That is to say that after the "#" there is a comment. My function aims to read each line and if it starts with a specific word, select what is after the ":"
For example if I had These two lines. I would like to read through them and if the line starts with "#" and contains the word "Column.4" the word "pre_edge" should be stored.
An example of my current approach follows:
with open(file, "r") as f:
for line in f:
if line.startswith ('#'):
word = line.split(" Column.4:")[1]
else:
print("n")
I think my Trouble is specifically after finding a line that starts with "#" how can I parse/search through it? and save its Content if it contains the desidered word.
In case that # comment contain str Column.4: as stated above, you could parse it this way.
with open(filepath) as f:
for line in f:
if line.startswith('#'):
# Here you proceed comment lines
if 'Column.4' in line:
first, remainder = line.split('Column.4: ')
# Remainder contains everything after '# Column.4: '
# So if you want to get first word ->
word = remainder.split()[0]
else:
# Here you can proceed lines that are not comments
pass
Note
Also it is a good practice to use for line in f: statement instead of f.readlines() (as mentioned in other answers), because this way you don't load all lines into memory, but proceed them one by one.
You should start by reading the file into a list and then work through that instead:
file = 'test.txt' #<- call file whatever you want
with open(file, "r") as f:
txt = f.readlines()
for line in txt:
if line.startswith ('"#"'):
word = line.split(" Column.4: ")
try:
print(word[1])
except IndexError:
print(word)
else:
print("n")
Output:
>>> ['"#" XDI/1.0 XDAC/1.4 Athena/0.9.25\n']
>>> pre_edge
Used a try and except catch because the first line also starts with "#" and we can't split that with your current logic.
Also, as a side note, in the question you have the file with lines starting as "#" with the quotation marks so the startswith() function was altered as such.
with open('stuff.txt', 'r+') as f:
data = f.readlines()
for line in data:
words = line.split()
if words and ('#' in words[0]) and ("Column.4:" in words):
print(words[-1])
# pre_edge
I have a filename with thousands of lines of data in it.
I am reading in the filename and editing it.
The following tag is about ~900 lines in or more (it varies per file):
<Report name="test" xmlns:cm="http://www.example.org/cm">
I need to remove that line and everything before it in several files.
so I need to the code to search for that tag and delete it and everything above it
it will not always be 900 lines down, it will vary; however, the tag will always be the same.
I already have the code to read in the lines and write to a file. I just need the logic behind finding that line and removing it and everything before it.
I tried reading the file in line by line and then writing to a new file once it hits on that string, but the logic is incorrect:
readFile = open(firstFile)
lines = readFile.readlines()
readFile.close()
w = open('test','w')
for item in lines:
if (item == "<Report name="test" xmlns:cm="http://www.example.org/cm">"):
w.writelines(item)
w.close()
In addition, the exact string will not be the same in each file. The value "test" will be different. I perhaps need to check for the tag name ""<Report name"
You can use a flag like tag_found to check when lines should be written to the output. You initially set the flag to False, and then change it to True once you've found the right tag. When the flag is True, you copy the line to the output file.
TAG = '<Report name="test" xmlns:cm="http://www.domain.org/cm">'
tag_found = False
with open('tag_input.txt') as in_file:
with open('tag_output.txt', 'w') as out_file:
for line in in_file:
if not tag_found:
if line.strip() == TAG:
tag_found = True
else:
out_file.write(line)
PS: The with open(filename) as in_file: syntax is using what Python calls a "context manager"- see here for an overview. The short explanation of them is that they automatically take care of closing the file safely for you when the with: block is finished, so you don't have to remember to put in my_file.close() statements.
You can use a regular expression to match you line:
regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$'
Get the index of the item that matches the regex:
listIndex = [i for i, item in enumerate(lines) if re.search(regex, item)]
Slice the list:
listLines = lines[listIndex:]
And write to a file:
with open("filename.txt", "w") as fileOutput:
fileOutput.write("\n".join(listLines))
pseudocode
Try something like this:
import re
regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$' # Variable #name
regex2 = '^<Report name=.*xmlns:cm=.*>$' # Variable #name & #xmlns:cm
with open(firstFile, "r") as fileInput:
listLines = fileInput.readlines()
listIndex = [i for i, item in enumerate(listLines) if re.search(regex1, item)]
# listIndex = [i for i, item in enumerate(listLines) if re.search(regex2, item)] # Uncomment for variable #name & #xmlns:cm
with open("out_" + firstFile, "w") as fileOutput:
fileOutput.write("\n".join(lines[listIndex:]))