I have this url = 'http://www.bhaskar.com/uttar_pradesh/lucknow/='. after the "=" sign one Hindi word which denotes the word searched for is given. I want to be able to add that as a parameter to this url, so that I will only need to change the word each time and not the whole url. I tried to use this:
>>> url = 'http://www.bhaskar.com/uttar_pradesh/lucknow/='
>>> word = 'word1'
>>> conj = url + word
but this gives me the Hindi word in unicode. like this:
>>> conj
'http://www.bhaskar.com/uttar_pradesh/lucknow/=\xe0\xa6\xb8\xe0\xa6\xb0'
Can anyone help?
but this gives me the Bengali word in unicode
No, it does not :)
When you type temp in the terminal, it displays an unique interpretation of the string. When you type print(temp), however, you are getting a more user-friendly representation of the same string. In the end, however, the string pointed by temp is the same all the time, it is only presented in different ways. See, for example, if you get the second one and put it in a variable and print it:
>>> temp2 = 'http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=\xe0\xa6\xb8\xe0\xa6\xb0'
>>> print(temp2)
http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=সর
Actually, you can create the string by using escaped values in all characters, not only the Bengali one:
>>> temp3 = '\x68\x74\x74\x70\x3a\x2f\x2f\x77\x77\x77\x2e\x63\x66\x69\x6c\x74\x2e\x69\x69\x74\x62\x2e\x61\x63\x2e\x69\x6e\x2f\x69\x6e\x64\x6f\x77\x6f\x72\x64\x6e\x65\x74\x2f\x66\x69\x72\x73\x74\x3f\x6c\x61\x6e\x67\x6e\x6f\x3d\x33\x26\x71\x75\x65\x72\x79\x77\x6f\x72\x64\x3d\xe0\xa6\xb8\xe0\xa6\xb0'
>>> print(temp3)
http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=সর
In the end, all these strings are the same:
>>> temp == temp2
True
>>> temp == temp3
True
So, don't worry, you have the correct string in the variable. You are only getting a problem if the escaped string is displayed elsewhere. Finish your program, run it until the end and you'll see there will be no errors.
Related
How most effectively do I cut out a part of a word if the character '=#=' appears and then finish cutting the word if the character '=#=' appears? For example:
From a large string
'321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
The python code returns:
'I-LOVE-STACK-OVER-FLOW'
Any help will be appreciated.
Using split():
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
st = '=#='
ed = '=#='
print((s.split(st))[1].split(ed)[0])
Using regex:
import re
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
print(re.search('%s(.*)%s' % (st, ed), s).group(1))
OUTPUT:
I-LOVE-STACK-OVER-FLOW
In addition to #DirtyBit's answer, if you want to also handle cases of more than 2 '=#='s, you can split the string, and then add every other element:
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319=#=|I-ALSO-LOVE-SO=#=3123123'
parts = s.split('=#=')
print(''.join([parts[i] for i in range(1,len(parts),2)]))
Output
I-LOVE-STACK-OVER-FLOW|I-ALSO-LOVE-SO
The explanation is in the code.
import re
ori_list = re.split("=#=",ori_str)
# you can imagine your goal is to find the string wrapped between signs of "=#="
# so after the split, the even number position must be the parts outsides of "=#="
# and the odd number position is what you want
for i in range(len(ori_list)):
if i%2 == 1:#odd position
print(ori_list[i])
I am programming a parser for an old dictionary and I'm trying to find a pattern like re.findall("{.*}", string) in a string.
A control print after the check proves, that only a few strings match, although all strings contain a pattern like {...}.
Even copying the string and matching it interactively in the idle shell
gives a match, but inside the rest of the code, it simply does not.
Is it possible that this problem is caused by the actual python interpreter?
I cannot figure out any other problem...
thanks for your help
the code snippet looks like that:
for aParse in chunklist:
aSigle = aParse[1]
aParse = aParse[0]
print("to be parsed", aParse)
aContext = Context()
aContext._init_("")
aContext.ID = contextID
aContext.source = aSigle
# here, aParse is the string containing {Abriss}
# which is part of a lexicon entry
metamatches = re.findall("\{.*\}", aParse)
print("metamatches: ", metamatches)
for meta in metamatches:
aMeta = meta.replace("{", "").replace("}", "")
aMeta = aMeta.split()
for elem in aMeta:
...
Try this:
re = {0: "{.test1}",1: "{.test1}",2: "{.test1}",3: "{.test1}"}
for value in re.itervalues():
if "{" in value:
value = value.replace("{"," ")
print value
or if you want to remove both "{}”
for value in re.itervalues():
if "{" in value:
value = value.strip('{}')
print value
Try this
data=re.findall(r"\{([^\}]*)}",aParse,re.I|re.S)
DEMO
So, in a really simplified scenario, a lexical entry looks like that:
"headword" {meta, meaning} context [reference for context].
So, I was chunking (split()) the entry at [...] with a regex. that works fine so far. then, after separating the headword, I tried to find the meta/meaning with a regex that finds all patterns of the form {...}. Since that regex didn't work, I replaced it with this function:
def findMeta(self, string, alist):
opened = 0
closed = 0
for char in enumerate(string):
if char[1] == "{":
opened = char[0]
elif char[1] == "}":
closed = char[0]
meta = string[opened:closed+1]
alist.append(meta)
string.replace(meta, "")
Now, its effectively much faster and the meaning component is correctly analysed. The remaining question is: in how far are the regex which I use to find other information (e.g. orthographic variants, introduced by "s.}") reliable? should they work or is it possible that the IDLE shell is simply not capable of parsing a 1000 line program correctly (and compiling all regex)? an example for a string whose meta should actually have been found is: " {stm.} {der abbruch thut, den armen das gebührende vorenthält} [Renn.]"
the algorithm finds the first, saying this word is a noun, but the second, it's translation, is not recognized.
... This is medieval German, sorry for that! Thank you for all your help.
Simplifying my task, lets say I want to find any words written in Hebrew in some web page.
So I know that Hebrew char codes are U+05D0 to U+05EA.
I want to write something like:
expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"
web_handle = urllib2.urlopen(url)
website_text = website_handle.read()
matches = sre.findall(exp, website_text)
for item in matches:
print item
The output I would expect is:
עברית
But instead the out put is a lot of Chinese/Japanese chars.
You can just use standard representation of unicode in python within a character class :
re.findall([\u05D0-\u05EA], website_text,re.U)
The expression should be:
expr = u"[\u05D0-\u05EA]+"
Notice the 'u' at the beginning.
I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..
I want to remove all strange characters from a string to make it "url safe". Therefor, I have a function that goes like this:
def urlize(url, safe=u''):
intab = u"àáâãäåòóôõöøèéêëçìíîïùúûüÿñ" + safe
outtab = u"aaaaaaooooooeeeeciiiiuuuuyn" + safe
trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
return url.lower().translate(trantab).strip()
This works just great, but now I want to reuse that funcion to allow special characters. For example, the quotation mark.
urlize(u'This is sóme randóm "text" that í wánt to process',u'"')
...and that throws the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: expected a character buffer object
I have tried, but did not work:
urlize(u'text',u'\"')
intab = u"àáâãäåòóôõöøèéêëçìíîïùúûüÿñ%s" , safe
--EDIT--
The full function looks like this
def urlize(url, safe=u''):
intab = u"àáâãäåòóôõöøèéêëçìíîïùúûüÿñ" + safe
outtab = u"aaaaaaooooooeeeeciiiiuuuuyn" + safe
trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
translated_url = url.lower().translate(trantab).strip()
pos = 0
stop = len(translated_url)
new_url= ''
last_division_char = False
while pos < stop:
if not translated_url[pos].isalnum() and translated_url[pos] not in safe:
if (not last_division_char) and (pos != stop -1):
new_url+='-'
last_division_char = True
else:
new_url+=translated_url[pos]
last_division_char = False
pos+=1
return new_url
--EDIT-- Goal
What I want is to normalize text so that I can put it on the url myself, and use it like an Id. For example, if I want to show the products of a category, I'd rather put "ninos-y-bebes" instead of "niños-y-bebés" (spanish for kids and babies). I really don't want all the áéíóúñ (which are the special characters in spanish) in my url, but I don't want to get rid of them either. That's why I would like to replace all characters that looks the same (not 100% all of them, I dont care) and then delete all non alfanumeric characters left.
The unidecode module is a safer option (it will handle other special simbols like "degree"):
>>> from unidecode import unidecode
>>> s = u'This is sóme randóm "text" that í wánt to process'
>>> unidecode(s)
'This is some random "text" that i want to process'
>>> import urllib
>>> urllib.urlencode(dict(x=unidecode(s)))[2:]
'This+is+some+random+%22text%22+that+i+want+to+process'
[ update ]
i think i'm already doing that -> u"aaaaaaooooooeeeeciiiiuuuuyn" – Marco Bruggmann
Fair enough, if you are willing to keep track of every unicode character out there for your translation table (accented characters are not the only issues, there are a whole lot of symbols to rain on your parade).
Worst, many unicode symbols may be visually identical to their ASCII counterparts, leading to hard to diagnose errors.
[ update ]
What about something like:
>>> safe_chars = 'abcdefghijklmnopqrstuvwxyz01234567890-_'
>>> filter(lambda x: x in safe_chars, "i think i'm already doing that")
'ithinkimalreadydoingthat'
[ update ]
#Daenyth I tried it, but I only get errors: from urllib import urlencode => urlencode('google.com/';) => TypeError: not a valid non-string sequence or mapping object – Marco Bruggmann
The urlencode function is intended to produce QUERYSTRING formated output (a=1&b=2&c=3). It expects key/value pairs:
>>> urllib.urlencode(dict(url='google.com/'))
'url=google.com%2F'
>>> help(urllib.urlencode)
Help on function urlencode in module urllib:
urlencode(query, doseq=0)
Encode a sequence of two-element tuples or dictionary into a URL query string.
If any values in the query arg are sequences and doseq is true, each
sequence element is converted to a separate parameter.
If the query arg is a sequence of two-element tuples, the order of the
parameters in the output will match the order of parameters in the
input.
(END)
[ update ]
That will works without a doubt, but what I want is to normalize text so that I can put it on the url myself, and use it like an Id. For example, if I want to show the products of a category, I'd rather put "ninos-y-bebes" instead of "niños-y-bebés" (spanish for kids and babies). I really don't want all the áéíóúñ (which are the special characters in spanish) in my url, but I don't want to get rid of them either. That's why I would like to replace all characters that looks the same (not 100% all of them, I dont care) and then delete all non alfanumeric characters left.
Ok, Marco, what you want is a routine to create the so called slugs, isn't it?
You can do it in one line:
>>> s = u'This is sóme randóm "text" that í wánt to process'
>>> allowed_chars = 'abcdefghijklmnopqrstuwvxyz01234567890'
>>> ''.join([ x if x in allowed_chars else '-' for x in unidecode(s.lower()) ])
u'this-is-some-random--text--that-i-want-to-process'
>>> s = u"Niños y Bebés"
>>> ''.join([ x if x in allowed_chars else '-' for x in unidecode(s.lower()) ])
u'ninos-y-bebes'
>>> s = u"1ª Categoria, ½ docena"
>>> ''.join([ x if x in allowed_chars else '-' for x in unidecode(s.lower()) ])
u'1a-categoria--1-2-docena'