Using python 2.7 I am trying to scrape title from a page, but cut it off before the closing title tag if i find one of these characters : .-_<| (as I'm just trying to get the name of the company/website) I have some code working but I'm sure there must be a simpler way. I'm open to suggestions as to libraries (beautiful soup, scrappy etc), but I would be most happy to do it without as I am happy to be slowly learning my way around python right now. You can see my code searches individually for each of the characters rather than all at once. I was hoping there was a find( x or x) function but I could not find. Later I will also be doing the same thing but looking for any numbers within 0-9 range.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [{'User-agent' , 'Mozilla/5.0'}]
def findTitle(webaddress):
url = (webaddress)
ourUrl = opener.open(url).read()
ourUrlLower = ourUrl.lower()
x=0
positionStart = ourUrlLower.find("<title>",x)
if positionStart == -1:
return "Insert Title Here"
endTitleSignals = ['.',',','-','_','#','+',':','|','<']
positionEnd = positionStart + 50
for e in endTitleSignals:
positionHolder = ourUrlLower.find(e ,positionStart + 1)
if positionHolder < positionEnd and positionHolder != -1:
positionEnd = positionHolder
return ourUrl[positionStart + 7:positionEnd]
print findTitle('http://www.com)
The regular expression library (re) could help, but if you'd like to learn more about general python instead of specialized libraries, you could do it with sets, which are something you'll want to know about.
import sets
string = "garbage1and2recycling"
charlist = ['1', '2']
charset = sets.Set(charlist)
index = 0
for index in range(len(string)):
if string[index] in charset: break
print(index) # 7
Note that you could do the above using just charlist instead of charset, but that would take longer to run.
Related
I'm trying to use list indices as arguments for a function that performs regex searches and substitutions over some text files. The different search patterns have been assigned to variables and I've put the variables in a list that I want to feed the function as it loops through a given text.
When I call the function using a list index as an argument nothing happens (the program runs, but no substitutions are made in my text files), however, I know the rest of the code is working because if I call the function with any of the search variables individually it behaves as expected.
When I give the print function the same list index as I'm trying to use to call my function it prints exactly what I'm trying to give as my function argument, so I'm stumped!
search1 = re.compile(r'pattern1')
search2 = re.compile(r'pattern2')
search3 = re.compile(r'pattern3')
searches = ['search1', 'search2', 'search2']
i = 0
for …
…
def fun(find)
…
fun(searches[i])
if i <= 2:
i += 1
…
As mentioned, if I use fun(search1) the script edits my text files as wished. Likewise, if I add the line print(searches[i]) it prints search1 (etc.), which is what I'm trying to give as an argument to fun.
Being new to Python and programming, I've a limited investigative skill set, but after poking around as best I could and subsequently running print(searches.index(search1) and getting a pattern1 is not in list error, my leading (and only) theory is that I'm giving my function the actual regex expression rather than the variable it's stored in???
Much thanks for any forthcoming help!
Try to changes your searches list to be [search1, search2, search3] instead of ['search1', 'search2', 'search2'] (in which you just use strings and not regex objects)
Thanks to all for the help. eyl327's comment that I should use a list or dictionary to store my regular expressions pointed me in the right direction.
However, because I was using regex in my search patterns, I couldn't get it to work until I also created a list of compiled expressions (discovered via this thread on stored regex strings).
Very appreciative of juanpa.arrivillaga point that I should have proved a MRE (please forgive, with a highly limited skill set, this in itself can be hard to do), I'll just give an excerpt of a slightly amended version of my actual code demonstrating the answer (one again, please forgive its long-windedness, I'm not presently able to do anything more elegant):
…
# put regex search patterns in a list
rawExps = ['search pattern 1', 'search pattern 2', 'search pattern 3']
# create a new list of compiled search patterns
compiledExps = [regex.compile(expression, regex.V1) for expression in rawExps]
i = 0
storID = 0
newText = ""
for file in filepathList:
for expression in compiledExps:
with open(file, 'r') as text:
thisText = text.read()
lines = thisThis.splitlines()
setStorID = regex.search(compiledExps[i], thisText)
if setStorID is not None:
storID = int(setStorID.group())
for line in lines:
def idSub(find):
global storID
global newText
match = regex.search(find, line)
if match is not None:
newLine = regex.sub(find, str(storID), line) + "\n"
newText = newText + newLine
storID = plus1(int(storID), 1)
else:
newLine = line + "\n"
newText = newText + newLine
# list index number can be used as an argument in the function call
idSub(compiledExps[i])
if i <= 2:
i += 1
write()
newText = ""
i = 0
I have been trying to find and replace across XML text runs using the Python LXML library. My current attempt:
v = ''
position = -1
positions = []
for cur in root.iter('t'):
v = v + cur.text
position = position + len(cur.text)
positions.append(position)
v = v.replace("test string", "this is a test string")
i = 0
start = 0
end = positions[0]
for cur in root.iter('t'):
try:
cur.text = v[start:end]
print cur.text
start = end
i = i + 1
end = positions[i]
except IndexError:
break
Simplified Example XML:
<t>this is a test</t>
<w>asdfasdfsadf</w>
<t>test<t>
<w>ajsdfkladkjsf</w>
<t> string</t>
This approach puts the contents of the tags into a string, and keeps track of the indices where the split occurs. It then does the replacements on the string and writes the string back into the XML at roughly the same position. Unfortunately, once I open the file in Word there are random spaces missing and added elsewhere. I know that this method will probably not work, but what is the best way to go about solving this problem with the lxml parser.
EDIT: I do not want to read the file in as a string and edit it, I want to edit the file and keep all the formatting.
I'm using Python to search a large text file for a certain string, below the string is the data that I am interested in performing data analysis on.
def my_function(filename, variable2, variable3, variable4):
array1 = []
with open(filename) as a:
special_string = str('info %d info =*' %variable3)
for line in a:
if special_string == array1:
array1 = [next(a) for i in range(9)]
line = next(a)
break
elif special_string != c:
c = line.strip()
In the special_string variable, whatever comes after info = can vary, so I am trying to put a wildcard operator as seen above. The only way I can get the function to run though is if I put in the exact string I want to search for, including everything after the equals sign as follows:
special_string = str('info %d info = more_stuff' %variable3)
How can I assign a wildcard operator to the rest of the string to make my function more robust?
If your special string always occurs at the start of a line, then you can use the below check (where special_string does not have the * at the end):
line.startswith(special_string)
Otherwise, please do look at the module re in the standard library for working with regular expressions.
Have you thought about using something like this?
Based on your input, I'm assuming the following:
variable3 = 100000
special_string = str('info %d info = more_stuff' %variable3)
import re
pattern = re.compile('(info\s*\d+\s*info\s=)(.*)')
output = pattern.findall(special_string)
print(output[0][1])
Which would return:
more_stuff
In Python 3.4 I am trying to make a web crawler to check if a certain file is on a website. The problem is that the files can start with approximately 30 different names. (Some have 2 letters only, some have 3). I think my problem is similar to this (Wildcard or * for matching a datetime python 2.7) but it does not seem to work in Python 3.4.
My basic code is like this;
url_test = 'http://www.example.com/' + 'AAA' + '_file.pdf'
What do I need to do to search from a prespecified list of values that should go where AAA is. They can be either 2 or 3 alphanumeric characters. A wildcard operation will also work for me.
Thanks!
On the off chance that I understand the problem correctly, then this should do it:
for item in aaa_list:
print 'http://www.example.com/' + item + '_file.pdf'
or, if you want to have a list of all possible values you can save that too:
urls = ['http://www.example.com/' + item + '_file.pdf' for item in aaa_list]
from itertools import product
import string
for num_letters in [2, 3]:
for chars in product(string.ascii_letters, repeat=num_letters):
prefix = "".join(chars)
url = "http://www.example.com/{}_file.pdf".format(prefix)
# now look for the url
Just some questions regarding Python 3.
def AllMusic():
myList1 = ["bob"]
myList2 = ["dylan"]
x = myList1[0]
z = myList2[0]
y = "-->Then 10 Numbers?"
print("AllMusic")
print("http://www.allmusic.com/artist/"+x+"-"+z+"-mn"+y)
This is my code so far.
I want to write a program that prints out the variable y.
When you go to AllMusic.com. The different artists have unique 10 numbers.
For example, www.allmusic.com/artist/the-beatles-mn0000754032, www.allmusic.com/artist/arcade-fire-mn0000185591.
x is the first word of the artist and y is the second word of the artist. Everything works but I can't figure out a way to find that 10 digit number and return it to me for each artist I input into my python program.
I figured out that when you go to google and type for example "Arcade Fire AllMusic", in the first result just under the heading it gives you the url of the site. www.allmusic.com/artist/arcade-fire-mn0000185591
How can I copy that 10 digit code, 0000185591, into my python program and print it out for me to see.
I wouldn't use Google at all - you can use the search on the site. There are many useful tools to help you do web scraping in python: I'd recommend installing BeautifulSoup. Here's a small script you can experiment with:
import urllib
from bs4 import BeautifulSoup
def get_artist_link(artist):
base = 'http://www.allmusic.com/search/all/'
# encode spaces
query = artist.replace(' ', '%20')
url = base + query
page = urllib.urlopen(url)
soup = BeautifulSoup(page.read())
artists = soup.find_all("li", class_ = "artist")
for artist in artists:
print(artist.a.attrs['href'])
if __name__ == '__main__':
get_artist_link('the beatles')
get_artist_link('arcade fire')
For me this prints out:
/artist/the-beatles-mn0000754032
/artist/arcade-fire-mn0000185591