In Python 3.4 I am trying to make a web crawler to check if a certain file is on a website. The problem is that the files can start with approximately 30 different names. (Some have 2 letters only, some have 3). I think my problem is similar to this (Wildcard or * for matching a datetime python 2.7) but it does not seem to work in Python 3.4.
My basic code is like this;
url_test = 'http://www.example.com/' + 'AAA' + '_file.pdf'
What do I need to do to search from a prespecified list of values that should go where AAA is. They can be either 2 or 3 alphanumeric characters. A wildcard operation will also work for me.
Thanks!
On the off chance that I understand the problem correctly, then this should do it:
for item in aaa_list:
print 'http://www.example.com/' + item + '_file.pdf'
or, if you want to have a list of all possible values you can save that too:
urls = ['http://www.example.com/' + item + '_file.pdf' for item in aaa_list]
from itertools import product
import string
for num_letters in [2, 3]:
for chars in product(string.ascii_letters, repeat=num_letters):
prefix = "".join(chars)
url = "http://www.example.com/{}_file.pdf".format(prefix)
# now look for the url
Related
I have this URL string:
Hdf5File= '/home/Windows-Share/SCS931000126/20170101.h5'
I want to get two desired output from this string:
1- 'SCS931000126'
2- '20170101'
I wrote this regular expression to extract the above mentioned outputs, so I wrote:
import re
print(re.split(r'/', (re.split(r'[a-f]',Hdf5File)[4]))[1])
print(re.split(r'\.', (re.split(r'/', (re.split(r'[a-f]',Hdf5File)[4]))[2]))[0])
This gives me the desired output(if there is a better way to extract these outputs please let me know).
But the case is that this part of the URL /home/Windows-Share/ might change, is there anyway that I only get my desired output which are always at the end of the string regardless of the part of URL that might change?
for example if I have :
Hdf5File='/home/dal/windows-Share/SCS931000126/20170101.h5'
Then i cant re-use my regex. is there any way to this in a more reusable way?
Do you need re.split? You can just as well use str.split for this one:
In [294]: x, y = Hdf5File.split('/')[-2:]
In [296]: x, y.split('.')[0]
Out[296]: ('SCS931000126', '20170101')
While a simple split will work as cᴏʟᴅsᴘᴇᴇᴅ already demonstrated, you can also use os.path to get parts of your url:
import os
Hdf5File= '/home/Windows-Share/SCS931000126/20170101.h5'
f = os.path.basename(Hdf5File)
d = os.path.basename(os.path.dirname(Hdf5File))
print( d, f ) # SCS931000126 20170101.h5
# and to remove the file extension:
f = os.path.splitext(f)[0]
print(f) # 20170101
In the code below, I am trying to use len(list) to count the number of strings in an array in each of the tags variables from the while loop. When i did a sample list parameter on the bottom, list2, it printed 5 which works, but when i did it with my real data,it was counting the characters in the array, not the number of strings. I need help figuring out why that is and i am new to python so the simplest way possible please!
#!/usr/bin/python
import json
import csv
from pprint import pprint
with open('data.json') as data_file:
data = json.load(data_file)
#pprint(data)
# calc number of alert records in json file
x = len(data['alerts'])
count = 0
while (count < x):
tags = str(data['alerts'][count] ['tags']).replace("u\"","\"").replace("u\'","\'")
list = "[" + tags.strip('[]') + "]"
print list
print len(list)
count=count+1
list2 = ['redi', 'asd', 'rrr', 'www', 'qqq']
print len(list2)
Your list construction list = "[" + tags.strip('[]') + "]" creates a string, not a list. So yes, len works, it counts the characters in your string.
Your tags construction looks a bit off, you have a dictionary of data (data['alerts']) which you then convert to string, and strip of the '[]'. Why don't use just get the value itself?
Also list is a horrible name for your variable. This possible clashes with internal values.
list = "[" + tags.strip('[]') + "]"
print list
print len(list)
Ironically, list is a string, not a list. That's why calling len on it "was counting the characters in the array"
you need to make sure that your variable is a list rather than a str,
try:
print(type(yourList))
if it shows that it is a str, then try this:
len(list[yourList)
hope this answers your question
and when you want to establish a list variable, try this:
myList = []
for blah in blahblah:
myList.append(blah)
I think these definitely solved your problem, so I hope you noticed this part.
Using python 2.7 I am trying to scrape title from a page, but cut it off before the closing title tag if i find one of these characters : .-_<| (as I'm just trying to get the name of the company/website) I have some code working but I'm sure there must be a simpler way. I'm open to suggestions as to libraries (beautiful soup, scrappy etc), but I would be most happy to do it without as I am happy to be slowly learning my way around python right now. You can see my code searches individually for each of the characters rather than all at once. I was hoping there was a find( x or x) function but I could not find. Later I will also be doing the same thing but looking for any numbers within 0-9 range.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [{'User-agent' , 'Mozilla/5.0'}]
def findTitle(webaddress):
url = (webaddress)
ourUrl = opener.open(url).read()
ourUrlLower = ourUrl.lower()
x=0
positionStart = ourUrlLower.find("<title>",x)
if positionStart == -1:
return "Insert Title Here"
endTitleSignals = ['.',',','-','_','#','+',':','|','<']
positionEnd = positionStart + 50
for e in endTitleSignals:
positionHolder = ourUrlLower.find(e ,positionStart + 1)
if positionHolder < positionEnd and positionHolder != -1:
positionEnd = positionHolder
return ourUrl[positionStart + 7:positionEnd]
print findTitle('http://www.com)
The regular expression library (re) could help, but if you'd like to learn more about general python instead of specialized libraries, you could do it with sets, which are something you'll want to know about.
import sets
string = "garbage1and2recycling"
charlist = ['1', '2']
charset = sets.Set(charlist)
index = 0
for index in range(len(string)):
if string[index] in charset: break
print(index) # 7
Note that you could do the above using just charlist instead of charset, but that would take longer to run.
Hopefully there isn't a duplicated question that I've looked over because I've been scouring this forum for someone who has posted to a similar to the one below...
Basically, I've created a python script that will scrape the callsigns of each ship from the url shown below and append them into a list. In short it works, however whenever I iterate through the list and display each element there seems to be a '[' and ']' between each of the callsigns. I've shown the output of my script below:
Output
*********************** Contents of 'listOfCallSigns' List ***********************
0 ['311062900']
1 ['235056239']
2 ['305500000']
3 ['311063300']
4 ['236111791']
5 ['245639000']
6 ['235077805']
7 ['235011590']
As you can see, it shows the square brackets for each callsign. I have a feeling that this might be down to an encoding problem within the BeautifulSoup library.
Ideally, I want the output to be without any of the square brackets and just the callsign as a string.
*********************** Contents of 'listOfCallSigns' List ***********************
0 311062900
1 235056239
2 305500000
3 311063300
4 236111791
5 245639000
6 235077805
7 235011590
This script I'm using currently is shown below:
My script
# Importing the modules needed to run the script
from bs4 import BeautifulSoup
import urllib2
import re
import requests
import pprint
# Declaring the url for the port of hull
url = "http://www.fleetmon.com/en/ports/Port_of_Hull_5898"
# Opening and reading the contents of the URL using the module 'urlib2'
# Scanning the entire webpage, finding a <table> tag with the id 'vessels_in_port_table' and finding all <tr> tags
portOfHull = urllib2.urlopen(url).read()
soup = BeautifulSoup(portOfHull)
table = soup.find("table", {'id': 'vessels_in_port_table'}).find_all("tr")
# Declaring a list to hold the call signs of each ship in the table
listOfCallSigns = []
# For each row in the table, using a regular expression to extract the first 9 numbers from each ship call-sign
# Adding each extracted call-sign to the 'listOfCallSigns' list
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4])))
print "\n\n*********************** Contents of 'listOfCallSigns' List ***********************\n"
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row
Does anyone know how to remove the square brackets surrounding each callsign and just display the string?
Thanks in advance! :)
Change the last lines to:
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row[0] # <-- added a [0] here
Alternatively, you can also add the [0] here:
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))[0]) <-- added a [0] here
The explanation here is that re.findall(...) returns a list (in your case, with a single element in it). So, listOfCallSigns ends up being a "list of sublists each containing a single string":
>>> listOfCallSigns
>>> [ ['311062900'], ['235056239'], ['311063300'], ['236111791'],
['245639000'], ['305500000'], ['235077805'], ['235011590'] ]
When you enumerate your listOfCallSigns, the row variable is basically the re.findall(...) that you appended earlier in the code (that's why you can add the [0] after either of them).
So row and re.findall(...) are both of type "list of string(s)" and look like this:
>>> row
>>> ['311062900']
And to get the string inside the list, you need access its first element, i.e.:
>>> row[0]
>>> '311062900'
Hope this helps!
This can also be done by stripping the unwanted characters from the string like so:
a = "string with bad characters []'] in here"
a = a.translate(None, "[]'")
print a
I have an array of strings like
urls_parts=['week', 'weeklytop', 'week/day']
And i need to monitor inclusion of this strings in my url, so this example needs to be triggered by weeklytop part only:
url='www.mysite.com/weeklytop/2'
for part in urls_parts:
if part in url:
print part
But it is of course triggered by 'week' too.
What is the way to do it right?
OOps, let me specify my question a bit.
I need that code not to trigger when url='www.mysite.com/week/day/2' and part='week'
The only url needed to trigger on is when the part='week' and the url='www.mysite.com/week/2' or 'www.mysite.com/week/2-second' for example
This is how I would do it.
import re
urls_parts=['week', 'weeklytop', 'week/day']
urls_parts = sorted(urls_parts, key=lambda x: len(x), reverse=True)
rexes = [re.compile(r'{part}\b'.format(part=part)) for part in urls_parts]
urls = ['www.mysite.com/weeklytop/2', 'www.mysite.com/week/day/2', 'www.mysite.com/week/4']
for url in urls:
for i, rex in enumerate(rexes):
if rex.search(url):
print url
print urls_parts[i]
print
break
OUTPUT
www.mysite.com/weeklytop/2
weeklytop
www.mysite.com/week/day/2
week/day
www.mysite.com/week/4
week
Suggestion to sort by length came from #Roman
Sort you list by len and break from the loop at first match.
try something like this:
>>> print(re.findall('\\weeklytop\\b', 'www.mysite.com/weeklytop/2'))
['weeklytop']
>>> print(re.findall('\\week\\b', 'www.mysite.com/weeklytop/2'))
[]
program:
>>> urls_parts=['week', 'weeklytop', 'week/day']
>>> url='www.mysite.com/weeklytop/2'
>>> for parts in urls_parts:
if re.findall('\\'+parts +r'\b', url):
print (parts)
output:
weeklytop
Why not use urls_parts like this?
['/week/', '/weeklytop/', '/week/day/']
A slight change in your code would solve this issue -
>>> for part in urls_parts:
if part in url.split('/'): #splitting the url string with '/' as delimiter
print part
weeklytop