Remove \n from python string - python

I have scraped a webpage using beautiful soup.
I'm trying to get rid of a '\n' character which isnt eliminated despite whatever I try.
My effort so far:
wr=str(loc[i-1]).strip()
wr=wr.replace(r"\[|'u|\\n","")
print(wr)
Output:
[u'\nWong; Voon Hon (Singapore, SG
Kandasamy; Ravi (Singapore, SG
Narasimalu; Srikanth (Singapore, SG
Larsen; Gerner (Hinnerup, DK
Abeyasekera; Tusitha (Aarhus N, DK
How do I eliminate the [u'\n? What am I doing wrong?
The full code is here.

You need to escape the newline character (double "\"):
rep=["[","u'","\\n"]
for r in rep:
wr=wr.replace(r,"")
This is the same as #cricket_007's answer, however, the second part from his answer does not work for me. To my knowledge, str.replace() does not support these kind of regular expression lookups.

You need to escape the backslash or use a raw string. Otherwise, it's a newline character, not a literal \n
Also, I don't think beautifulsoup is outputting unicode strings. You see the string representation in python as u'blah'
And you shouldn't need a list of elements to remove. The expression can be
r"\[|'u|\n"

Related

Read only Arabic text from file in Python [duplicate]

When using Regex in Python, it's easy to use brackets to represent a range of characters a-z, but this doesn't seem to be working for other languages, like Arabic:
import re
pattern = '[ي-ا]'
p = re.compile(pattern)
This results in a long error report that ends with
raise error("bad character range")
sre_constants.error: bad character range
how can this be fixed?
Since Arabic character is rendered from right to left, the correct string below, which reads "from ا to ي" is rendered backward (try to select the string if you want to confirm):
'[ا-ي]'
Console output:
>>> re.compile('[ا-ي]')
<_sre.SRE_Pattern object at 0x6001f0a80>
>>> re.compile('[ا-ي]', re.DEBUG)
in
range (1575, 1610)
<_sre.SRE_Pattern object at 0x6001f0440>
So your pattern '[ي-ا]', is actually "from ي to ا", which is an invalid range, since the code point of ا is smaller than code point of ي.
To prevent confusion, Ignacio Vazquez-Abrams's suggestion of using Unicode escape is a good alternative to the solution I provide above.
Use Unicode escapes instead.
>>> re.compile('[\u0627-\u064a]')
<_sre.SRE_Pattern object at 0x237f460>
The approved answer does work, however the unicode [\u0627-\u064a] does not include variations of the letters 'ا' such as 'أ', 'آ' or 'إ', and the letter 'و' with its' variation 'ؤ'. (I wanted to comment/suggest-edit to the approved answer but there's a queue)
So in case someone is (re)visiting this question and needs those letter variations, a unicode that worked better for me was [\u0600-\u06FF], making the answer:
pattern = re.compile('[\u0600-\u06FF]')

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don't understand how to solve it.
I need addresse1 to be a string without any special characters. Addresse1 is for example: "32 rue d'Athènes Paris France".
addresse1= collect.replace(' ','+').replace('\n','')
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
here I got a string without any accent... Ho no... It is not a string but a bytes. So I ve done what was suggested and 'decode:
addresse1=addresse1.decode('utf-8')
But then addresse1 is exactly the same than at the begining... What do I have to do? What am I doing wrong? Or what i don't understand with unicode? Or is there a better solution?
Thanks,
Stéphane.
with 3rd party package: unidecode
3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.
is there a better solution?
It depends what you are trying to do.
If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.
If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).
If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.
The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.
ETA:
url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...
Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:
params = dict(
address=address1.encode('utf-8'),
key=googlekey
)
url2 = '...?' + urllib.urlencode(params)
(in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)
data2 = urllib.request.urlopen(url2).read().decode('utf-8')
data3=json.loads(data2)
json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:
data3 = json.load(urllib.request.urlopen(url2))
Generally, there are two approaches: (1) regular expressions and (2) str.translate.
1) regular expressions
Decompose string and replace characters from the Unicode block \u0300-\u036f:
import unicodedata
import re
word = unicodedata.normalize("NFD", word)
word = re.sub("[\u0300-\u036f]", "", word)
It removes accents, circumflex, diaeresis, and so on:
pingüino > pinguino
εἴκοσι εἶσι > εικοσι εισι
For some languages, it could be another block, such as [\u0559-\u055f] for Armenian script.
2) str.translate
First, create replacement table (case-sensitive) and then apply it.
repl = str.maketrans(
"áéúíó",
"aeuio"
)
word.translate(repl)
Multi-char replacements are made as following:
repl = {
ord("æ"): "ae",
ord("œ"): "oe",
}
word.translate(repl)
I had a similar problem where I was generating tags that users might have to type with their phone.
Without using 3rd party packages you can simplify bobinces's answer above:
collect = "32 rue d'Athènes Paris France"
unicode_collect = unicodedata.normalize('NFD', collect)
address1 = unicode_collect.encode('ascii', 'ignore').decode('utf-8')
address1:
"32 rue d'Athenes Paris France"
You can use the translate() method from python.
Here's an example copied from tutorialspoint.com:
#!/usr/bin/python
from string import maketrans # Required to call maketrans function.
intab = "aeiou"
outtab = "12345"
trantab = maketrans(intab, outtab)
str = "this is string example....wow!!!";
print str.translate(trantab)
This outputs:
th3s 3s str3ng 2x1mpl2....w4w!!!
So you can define what characters you wish to replace more easily than with replace()

Extracting a URL's in Python from XML

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905
Really nice, i got all url's from a XML document containing http://www.blabla.com with
>>> s = '<link href="http://www.blabla.com/blah" />
<link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']
But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.
First i thought that this is the clue
re.findall(r'(https?://\S+\")', s)
or this
re.findall(r'(https?://\S+\Z")', s)
but it isn't.
Can somebody help me out and tell me how to omit the double quote at the end?
Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?
>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/#href')
['http://www.blabla.com/blah', 'http://www.blabla.com']
You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:
re.findall(r'(https?://[^\s"]+)', s)
This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."
You want the double quotes to appear as a look-ahead:
re.findall(r'(https?://\S+)(?=\")', s)
This way they won't appear as part of the match. Also, yes the ? means the character is optional.
See example here: http://regexr.com?347nk
I used to extract URLs from text through this piece of code:
url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls
It works great!
Thanks. I just read this https://stackoverflow.com/a/13057368/326905
and checked out this which is also working.
re.findall(r'"(https?://\S+)"', urls)

How to remove \xa0 from string in Python?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):
EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?
\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
string = string.replace(u'\xa0', u' ')
When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.
Read up on http://docs.python.org/howto/unicode.html.
Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now
There's many useful things in Python's unicodedata library. One of them is the .normalize() function.
Try:
new_str = unicodedata.normalize("NFKD", unicode_str)
Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.
After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.
Assume we have our raw html as following:
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
So lets try to clean this HTML string:
from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.
Method # 1 (Recommended):
The first one is BeautifulSoup's get_text method with strip argument as True
So our code becomes:
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks
Method # 2:
The other option is to use python's library unicodedata
import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'
I have also detailed these methods on this blog which you may want to refer.
Try using .strip() at the end of your line
line.strip() worked well for me
try this:
string.replace('\\xa0', ' ')
I ran into this same problem pulling some data from a sqlite3 database with python. The above answers didn't work for me (not sure why), but this did: line = line.decode('ascii', 'ignore') However, my goal was deleting the \xa0s, rather than replacing them with spaces.
I got this from this super-helpful unicode tutorial by Ned Batchelder.
Try this code
import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()
Python recognize it like a space character, so you can split it without args and join by a normal whitespace:
line = ' '.join(line.split())
I end up here while googling for the problem with not printable character. I use MySQL UTF-8 general_ci and deal with polish language. For problematic strings I have to procced as follows:
text=text.replace('\xc2\xa0', ' ')
It is just fast workaround and you probablly should try something with right encoding setup.
In Beautiful Soup, you can pass get_text() the strip parameter, which strips white space from the beginning and end of the text. This will remove \xa0 or any other white space if it occurs at the start or end of the string. Beautiful Soup replaced an empty string with \xa0 and this solved the problem for me.
mytext = soup.get_text(strip=True)
It's the equivalent of a space character, so strip it
print(string.strip()) # no more xa0
0xA0 (Unicode) is 0xC2A0 in UTF-8. .encode('utf8') will just take your Unicode 0xA0 and replace with UTF-8's 0xC2A0. Hence the apparition of 0xC2s... Encoding is not replacing, as you've probably realized now.
You can try string.strip()
It worked for me! :)
Generic version with the regular expression (It will remove all the control characters):
import re
def remove_control_chart(s):
return re.sub(r'\\x..', '', s)
This is how I solved this issue as I encountered \xao in html encoded string.
I discovered a None breaking space is inserted to ensure that a word and subsequent HTML markup is not separated due to resizing of a page.
This
presents a problem for the parsing code as it introduced codec encoding issues. What made it hard was that we
are not privy to the encoding used. From Windows machines it can be latin-1 or CP1252 (Western ISO),
but more recent OSes have standardized to UTF-8. By normalizing unicode data, we strip \xa0
my_string = unicodedata.normalize('NFKD', my_string).encode('ASCII', 'ignore')

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
Always work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the re.U flag so \w matches unicode letters:
#coding: utf-8
import re
location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)
print location # prints öäå
It would help if you could dump the strings before and after each step.
Check your value of re.UNICODE first, see this

Categories