Regular Expression to extract parts of Twitter query

Regular Expression to extract parts of Twitter query - python

I have the following string from which I want to extract the q and geocode values.
?since_id=261042755432763393&q=salvia&geocode=39.862712%2C-75.33958%2C10mi
I've tried the following regular expression.
expr = re.compile('\[\=\](.*?)\[\&\]')
vals = expr.match(str)
However, vals is None. I'm also not sure how to find something before, say, q= versus =.

No need for a regex (using Python 3):
>>> from urllib.parse import parse_qs
>>> query = parse_qs(str[1:])
>>> query
{'q': ['salvia'], 'geocode': ['39.862712,-75.33958,10mi'], 'since_id': ['261042755432763393']}
>>> query['q']
['salvia']
>>> query['geocode']
['39.862712,-75.33958,10mi']
Obviously, str contains your input.
Since (according to your tag) you are using Python 2.7, I think you need to change the import statement to this, though:
from urlparse import parse_qs
and if you were using Python before version 2.6, the import statement is
from cgi import parse_qs

I think this can be easily done without regex:
string = '?since_id=261042755432763393&q=salvia&geocode=39.862712%2C-75.33958%2C10mi'
parts = string[1:].split('&') # the [1:] is to leave out the '?'
pairs = {}
for part in parts:
try:
key, value = part.split('=')
pairs[key] = value
except:
pass
And pairs should contain all the key-value pairs of the string.

Related

input format data in json with double curly braces

I get json from the other server like
{"field":"zxczxczcx_{{name}}_qweqweqwe"}
So the question is how to format that value?
I've tried
d = {"field":"zxczxczcx_{{name}}_qweqweqwe"}
d['field'].format('any_string')
But it just remove one pair of curly braces and output
"zxczxczcx_{name}_qweqweqwe"

Maybe you can use the replace method?
d = {"field":"zxczxczcx_{{name}}_qweqweqwe"}
d['field'] = d['field'].replace('{{name}}','any_string')
print(d)
Based on your comments (this uses the re module (regular expressions) to find the {{x}} pattern) :
import re
tokens_to_replace = re.findall('{{.*}}',d['field'])
for token in tokens_to_replace:
d['field'] = d['field'].replace(token,d[token[2:-2]])
tokens_to_replace will have value: ['{{name}}']

I want to extract data using regular expression in python

I have a string = "ProductId%3D967164%26Colour%3Dbright-royal" and i want to extract data using regex so output will be 967164bright-royal.
I have tried with this (?:ProductId%3D|Colour%3D)(.*) in python with regex, but getting output as 967164%26Colour%3Dbright-royal.
Can anyone please help me to find out regex for it.

You don't need a regex here, use urllib.parse module:
from urllib.parse import parse_qs, unquote
qs = "ProductId%3D967164%26Colour%3Dbright-royal"
d = parse_qs(unquote(qs))
print(d)
# Output:
{'ProductId': ['967164'], 'Colour': ['bright-royal']}
Final output:
>>> ''.join(i[0] for i in d.values())
'967164bright-royal'
Update
>>> ''.join(re.findall(r'%3D(\S*?)(?=%26|$)', qs))
'967164bright-royal'

The alternative matches on the first part, you can not get a single match for 2 separate parts in the string.
If you want to capture both values using a regex in a capture group:
(?:ProductId|Colour)%3D(\S*?)(?=%26|$)
Regex demo
import re
pattern = r"(?:ProductId|Colour)%3D(\S*?)(?=%26|$)"
s = "ProductId%3D967164%26Colour%3Dbright-royal"
print(''.join(re.findall(pattern, s)))
Output
967164bright-royal

If you must use a regular expression and you can guarantee that the string will always be formatted the way you expect, you could try this.
import re
pattern = r"ProductId%3D(\d+)%26Colour%3D(.*)"
string = "ProductId%3D967164%26Colour%3Dbright-royal"
matches = re.match(pattern, string)
print(f"{matches[1]}{matches[2]}")

regex matching and get into a python list

I have the following saved as a string in a variable:
window.dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]
How do I extract the values of each element?
Goal would be to have something like this:
articleCondition = 'new'
categoryNr = '12345'
...

In python there are many ways to get value from a string, you can use regex, Python eval function and even more ways that I may not know.
Method 1
value = 'window.dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]'
value = value.split('=')[1]
data = eval(value)[0]
articleCondition = data['articleCondition']
Method 2
using regex
import re
re.findall('"articleCondition":"(\w*)"',value)
for regex you can be more creative to make a generall pattern.

You are having a list of dictionary. Use the dictionary key to get the value.
Ex:
dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]
print(dataLayer[0]["articleCondition"])
print(dataLayer[0]["categoryNr"])
Output:
New
12345

Use json. Your string is:
>>> s = 'window.dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]'
You can get the right hand side of the  = with a split:
>>> s.split('=')[1]
'[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]'
Then parse it with the json module:
>>> import json
>>> t = json.loads(s.split('=')[1])
>>> t[0]['articleCondition']
'New'
Please note that this works because you have double quotes in the RHS. Single quotes are not allowed in JSON.

python regex find characters from and end of the string

svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz
from the following string I need to fetch Rev1233. So i was wondering if we can have any regexpression to do that. I like to do following string.search ("Rev" uptill next /)
so far I split this using split array
s1,s2,s3,s4,s5 = string ("/",4)

You don't need a regex to do this. It is as simple as:
str = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
str.split('/')[-2]

Here is a quick python example
>>> impot re
>>> s = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
>>> p = re.compile('.*/(Rev\d+)/.*')
>>> p.match(s).groups()[0]
'Rev1223'

Find second part from the end using regex, if preferred:
/(Rev\d+)/[^/]+$
http://regex101.com/r/cC6fO3/1
>>> import re
>>> m = re.search('/(Rev\d+)/[^/]+$', 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz')
>>> m.groups()[0]
'Rev1223'

How to recognize any four ints in Python

I am trying to scrape a series of websites that look like the following three examples:
www.examplescraper.com/fghxbvn/17901234.html
www.examplescraper.com/fghxbvn/17911102.html
www.examplescraper.com/fghxbvn/17921823.html
Please, keep in mind that there are 200 of these websites and I'd like to iterate through a loop rather than copying and pasting each website into a script.
Where the base is www.examplescraper.com/fghxbvn/, then there's a year, followed by four digits that do not follow a pattern and then .html.
So in the first website:
base = www.examplescraper.com/fghxbvn/
year = 1790
four random digits = 1234.html
I would like to call (in beautiful soup) a url where url:
url = base + str(year) + str(any four ints) + ".html"
My question:
How do I (in Python) recognize any four digits? They can be any digits. I don't need to generate four ints or return the four ints I just need Python to accept any four ints to feed into beautiful soup.

I don't exactly follow your question, but you can use the re module to easily parse out text of a specific format like you have here. For instance:
>>> import re
>>> url = "www.examplescraper.com/fghxbvn/17901234.html"
>>> re.match( "(\S+/)(\d{4})(\d{4}).html", url ).groups()
('www.examplescraper.com/fghxbvn/', '1790', '1234')
This splits up the URL into a tuple like you described. Be sure to read the documentation on the re module. HTH

Whenever possible when dealing with urls, you should consider using urlparse module.
This works on parsing url. But yours is not a well formed URL for urlparse., (hint: it does not start with scheme/protocol 'http').
For your particular task, you can use regular expressions, something of this sort:
>>> s = 'www.examplescraper.com/fghxbvn/17901234.html'
>>> import re
>>> p = re.compile('(\d{4,4}).html')
>>> p.search(s).groups()[0]
'1234'

>>> s="www.examplescraper.com/fghxbvn/17901234.html"
>>> s.split("/")
['www.examplescraper.com', 'fghxbvn', '17901234.html']
>>> base='/'.join( s.split("/")[0:-1] )
>>> base
'www.examplescraper.com/fghxbvn'
>>> year = s.split("/")[-1][:4]
>>> year
'1790'
>>> fourrandom = s.split("/")[-1][4:]
>>> fourrandom
'1234.html'
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expression to extract parts of Twitter query - python

Related

input format data in json with double curly braces

I want to extract data using regular expression in python

regex matching and get into a python list

python regex find characters from and end of the string

How to recognize any four ints in Python

Categories

Resources