I am trying to scrape a series of websites that look like the following three examples:
www.examplescraper.com/fghxbvn/17901234.html
www.examplescraper.com/fghxbvn/17911102.html
www.examplescraper.com/fghxbvn/17921823.html
Please, keep in mind that there are 200 of these websites and I'd like to iterate through a loop rather than copying and pasting each website into a script.
Where the base is www.examplescraper.com/fghxbvn/, then there's a year, followed by four digits that do not follow a pattern and then .html.
So in the first website:
base = www.examplescraper.com/fghxbvn/
year = 1790
four random digits = 1234.html
I would like to call (in beautiful soup) a url where url:
url = base + str(year) + str(any four ints) + ".html"
My question:
How do I (in Python) recognize any four digits? They can be any digits. I don't need to generate four ints or return the four ints I just need Python to accept any four ints to feed into beautiful soup.
I don't exactly follow your question, but you can use the re module to easily parse out text of a specific format like you have here. For instance:
>>> import re
>>> url = "www.examplescraper.com/fghxbvn/17901234.html"
>>> re.match( "(\S+/)(\d{4})(\d{4}).html", url ).groups()
('www.examplescraper.com/fghxbvn/', '1790', '1234')
This splits up the URL into a tuple like you described. Be sure to read the documentation on the re module. HTH
Whenever possible when dealing with urls, you should consider using urlparse module.
This works on parsing url. But yours is not a well formed URL for urlparse., (hint: it does not start with scheme/protocol 'http').
For your particular task, you can use regular expressions, something of this sort:
>>> s = 'www.examplescraper.com/fghxbvn/17901234.html'
>>> import re
>>> p = re.compile('(\d{4,4}).html')
>>> p.search(s).groups()[0]
'1234'
>>> s="www.examplescraper.com/fghxbvn/17901234.html"
>>> s.split("/")
['www.examplescraper.com', 'fghxbvn', '17901234.html']
>>> base='/'.join( s.split("/")[0:-1] )
>>> base
'www.examplescraper.com/fghxbvn'
>>> year = s.split("/")[-1][:4]
>>> year
'1790'
>>> fourrandom = s.split("/")[-1][4:]
>>> fourrandom
'1234.html'
>>>
Related
I have a string template that looks like 'my_index-{year}'.
I do something like string_template.format(year=year) where year is some string. Result of this is some string that looks like my_index-2011.
Now. to my question. I have a string like my_index-2011 and my template 'my_index-{year}' What might be a slick way to extract the {year} portion?
[Note: I know of the existence of parse library]
There is this module called parse which provides an opposite to format() functionality:
Parse strings using a specification based on the Python format() syntax.
>>> from parse import parse
>>> s = "my_index-2011"
>>> f = "my_index-{year}"
>>> parse(f, s)['year']
'2011'
And, an alternative option and, since you are extracting a year, would be to use the dateutil parser in a fuzzy mode:
>>> from dateutil.parser import parse
>>> parse("my_index-2011", fuzzy=True).year
2011
Use the split() string function to split the string into two parts around the dash, then grab just the second part.
mystring = "my_index-2011"
year = mystring.split("-")[1]
I assume "year" is 4 digits and you have multiple indexes
import re
res = ''
patterns = [ '%s-[0-9]{4}'%index for index in idx ]
for index,pattern in zip(idx,patterns):
res +=' '.join( re.findall(pattern ,data) ).replace(index+'-','') + ' '
---update---
dummyString = 'adsf-1234 fsfdr lkjdfaif ln ewr-1234 adsferggs sfdgrsfgadsf-3456'
dummyIdx = ['ewr','adsf']
output
1234 1234 3456
Yes, a regex would be helpful here.
In [1]: import re
In [2]: s = 'my_string-2014'
In [3]: print( re.search('\d{4}', s).group(0) )
2014
Edit: I should have mentioned your regex can be more sophisticated. You can haul out a subcomponent of a more specific string, for example:
In [4]: print( re.search('my_string-(\d{4})$', s).group(1) )
2014
Given the problem you presented, I think any "find the year" formula should be expressible in terms of a regular expression.
You are going to want to use the string method split to split on "-", and then catch the last element as your year:
year = "any_index-2016".split("-")[-1]
Because you caught the last element (using -1 as the index), your index can have hyphens in them, and you will still extract the year appropriately.
This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 4 years ago.
I have a string - Python :
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
Expected output is :
"Atlantis-GPS-coordinates"
I know that the expected output is ALWAYS surrounded by "/bar/" on the left and "/" on the right :
"/bar/Atlantis-GPS-coordinates/"
Proposed solution would look like :
a = string.find("/bar/")
b = string.find("/",a+5)
output=string[a+5,b]
This works, but I don't like it.
Does someone know a beautiful function or tip ?
You can use split:
>>> string.split("/bar/")[1].split("/")[0]
'Atlantis-GPS-coordinates'
Some efficiency from adding a max split of 1 I suppose:
>>> string.split("/bar/", 1)[1].split("/", 1)[0]
'Atlantis-GPS-coordinates'
Or use partition:
>>> string.partition("/bar/")[2].partition("/")[0]
'Atlantis-GPS-coordinates'
Or a regex:
>>> re.search(r'/bar/([^/]+)', string).group(1)
'Atlantis-GPS-coordinates'
Depends on what speaks to you and your data.
What you haven't isn't all that bad. I'd write it as:
start = string.find('/bar/') + 5
end = string.find('/', start)
output = string[start:end]
as long as you know that /bar/WHAT-YOU-WANT/ is always going to be present. Otherwise, I would reach for the regular expression knife:
>>> import re
>>> PATTERN = re.compile('^.*/bar/([^/]*)/.*$')
>>> s = '/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/'
>>> match = PATTERN.match(s)
>>> match.group(1)
'Atlantis-GPS-coordinates'
import re
pattern = '(?<=/bar/).+?/'
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
result = re.search(pattern, string)
print string[result.start():result.end() - 1]
# "Atlantis-GPS-coordinates"
That is a Python 2.x example. What it does first is:
1. (?<=/bar/) means only process the following regex if this precedes it (so that /bar/ must be before it)
2. '.+?/' means any amount of characters up until the next '/' char
Hope that helps some.
If you need to do this kind of search a bunch it is better to 'compile' this search for performance, but if you only need to do it once don't bother.
Using re (slower than other solutions):
>>> import re
>>> string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
>>> re.search(r'(?<=/bar/)[^/]+(?=/)', string).group()
'Atlantis-GPS-coordinates'
svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz
from the following string I need to fetch Rev1233. So i was wondering if we can have any regexpression to do that. I like to do following string.search ("Rev" uptill next /)
so far I split this using split array
s1,s2,s3,s4,s5 = string ("/",4)
You don't need a regex to do this. It is as simple as:
str = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
str.split('/')[-2]
Here is a quick python example
>>> impot re
>>> s = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
>>> p = re.compile('.*/(Rev\d+)/.*')
>>> p.match(s).groups()[0]
'Rev1223'
Find second part from the end using regex, if preferred:
/(Rev\d+)/[^/]+$
http://regex101.com/r/cC6fO3/1
>>> import re
>>> m = re.search('/(Rev\d+)/[^/]+$', 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz')
>>> m.groups()[0]
'Rev1223'
Regex to retrieve the last portion of a string:
https://play.google.com/store/apps/details?id=com.lima.doodlejump
I'm looking to retrieve the string followed by id=
The following regex didn't seem to work in python
sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
re.search("id=(.*?)", sampleURL).group(1)
The above should give me an output:
com.lima.doodlejump
Is my search group right?
Your regular expression
(.*?)
will not work because, it will match between zero and unlimited times, as few times as possible (becasue of the ?). So, you have the following choices of RegEx
(.*) # Matches the rest of the string
(.*?)$ # Matches till the end of the string
But, you don't need RegEx at all here, simply split the string like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
print data.split("id=", 1)[-1]
Output
com.lima.doodlejump
If you really have to use RegEx, you can do like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
import re
print re.search("id=(.*)", data).group(1)
Output
com.lima.doodlejump
I'm surprised that nobody has mentioned urlparse yet...
>>> s = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> urlparse.urlparse(s)
ParseResult(scheme='https', netloc='play.google.com', path='/store/apps/details', params='', query='id=com.lima.doodlejump', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse(s).query)
{'id': ['com.lima.doodlejump']}
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id']
['com.lima.doodlejump']
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id'][0]
'com.lima.doodlejump'
The HUGE advantage here is that if the url query string gets more components then it could easily break the other solutions which rely on a simple str.split. It won't confuse urlparse however :).
Just split it in the place you want:
id = url.split('id=')[1]
If you print id, you'll get:
com.lima.doodlejump
Regex isn't needed here :)
However, in case there are multiple id=s in your string, and you only wanted the last one:
id = url.split('id=')[-1]
Hope this helps!
This works:
>>> import re
>>> sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> re.search("id=(.+)", sampleURL).group(1)
'com.lima.doodlejump'
>>>
Instead of capturing non-greedily for zero or more characters, this code captures greedily for one or more.
I have the following string from which I want to extract the q and geocode values.
?since_id=261042755432763393&q=salvia&geocode=39.862712%2C-75.33958%2C10mi
I've tried the following regular expression.
expr = re.compile('\[\=\](.*?)\[\&\]')
vals = expr.match(str)
However, vals is None. I'm also not sure how to find something before, say, q= versus =.
No need for a regex (using Python 3):
>>> from urllib.parse import parse_qs
>>> query = parse_qs(str[1:])
>>> query
{'q': ['salvia'], 'geocode': ['39.862712,-75.33958,10mi'], 'since_id': ['261042755432763393']}
>>> query['q']
['salvia']
>>> query['geocode']
['39.862712,-75.33958,10mi']
Obviously, str contains your input.
Since (according to your tag) you are using Python 2.7, I think you need to change the import statement to this, though:
from urlparse import parse_qs
and if you were using Python before version 2.6, the import statement is
from cgi import parse_qs
I think this can be easily done without regex:
string = '?since_id=261042755432763393&q=salvia&geocode=39.862712%2C-75.33958%2C10mi'
parts = string[1:].split('&') # the [1:] is to leave out the '?'
pairs = {}
for part in parts:
try:
key, value = part.split('=')
pairs[key] = value
except:
pass
And pairs should contain all the key-value pairs of the string.