Python URL Characters - python

I really new to Python and coding in general, but I have been making some good strides.
I am able to pull some data off of the web through an API, and the result should be a string. What I am seeing though, are some instances such as "& amp;"" and " &quot". (I modified the character sets so it would print properly to the screen)
I figure there is a way to clean this string and remove the characters such that it looks like it does on a computer screen. I tried searching for urldecoding, but admittedly I dont even know if that is the solution.
Any help on how to remove these "extra" characters and produce a readable string will be greatly appreciated!
Many thanks in advance,
Brock

xml.sax.saxutils.unescape(data[, entities]): Unescape '&amp', '&lt', and '&gt' in a string of data.
You can unescape other strings of data by passing a dictionary as the optional entities parameter. The keys and values must all be strings; each key will be replaced with its corresponding value. '&amp', '&lt', and '&gt' are always unescaped, even if entities is provided.

Related

Hidden characters in integer-like string

I scraped data about fundraising from the web and put it into a table.
As I start to clean the data , I see that some elements, for instance "2 000000", are read "2\xa0000000" by the machine.
1/ What does that mean ?
2/ How can I remove it ? (as I want to transform the whole column to integers)
Best,
To fix a DataFrame column, use:
df['col'] = df['col'].str.replace('\D', '').astype(int)
The issue is that you have escape sequences read in as Unicode characters in the string. The easiest way to remove those characters without using replace on each specific showing is using the unicodedata package.
Specifically:
from unicodedata import normalize
string1 = "2\xa0000000"
new_string = normalize('NFKD', string1)
print(new_string)
Output:
2 000000
This package was already built into my machine, but you may need to install it if you used a different method to build your python package than I. I find this better because this normalization works across a lot of various formatting, so you do not need to use replace each time you see something else that is not formatted correctly. It's an escape sequence
Character of hex code A0 is non-breaking space. So to speak, you can just treat it as a space in most cases. According to my experience, it mostly come up when I process some data generated from Microsoft Office products, or from the web when people put the HTML code on it.
Unfortunately, python split() (for example, I don't know how you process your data) will not treat that as space. But as it is just a distinct character, you can solve the issue with:
longstring.replace('\xA0', ' ').split()
PS: Read again your question, seems it should be ignored to produce the number two million as an data entity. So you might want to replace '\xA0' with empty string.

Replacing strings in a text and ignoring certain parts

I found many programs online to replace text in a string or file with words prescribed in a dictionary. For example, https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python
But I was wondering how to get the program to ignore certain parts of the text. For instance, I would like it to ignore parts that are ensconced within say % signs (%Please ignore this%). Better still, how do I get it to ignore the text within but remove the % sign at the end of the run.
Thank you.
This could very easily be done with regular expressions, although they may not be supported by any online programs you find. You will probably need to write something yourself and then use regex as your dict's search key's.
Good place to start playing around with regex is: http://regexr.com
Well in the replacing dictionary just have any word you want to be ignored such as teh be replaced with the but %teh% be replaced with teh. For the program in the link you could have
wordDic = {
'booster': 'rooster',
'%booster%': 'booster'
}

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.
To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

Python/Scrapy question: How to get cleaner results?

My task for a project is to data mine a website for specific names. My experience with python isn't high. When I scraped all the names, they come out in this format:
[u'Bob Joe']
[u'Tim Tom']
[u'Anne Frank']
[u'superman']
How can I clean up these values? What does the 'u' signify? Is my xpath wrong? Would I have to clean it up in a scrapy pipeline (I'd like to avoid this)? I just want the names and not the extra junk around it.
In Python 2, the 'u' prefix indicates that it's a Unicode string. [u'Bob Joe'] is a list containing a Unicode string.

How to compare unicode strings with entity ref to non-unicode string

I am evaluating hundreds of thousands of html files. I am looking for particular parts of the files. There can be small variations in the way the files were created
For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space:
u'KEY1A\x97RISKFACTORS'
In another file I could have:
'KEY1ARISKFACTORS'
I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. But every substitution I try to run the first string to remove the '\97 does not seem to work
There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like:
key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}
I am assuming that since when I run
S1='A'
S2=u'A'
S1==S2
I get
True
I should be able to compare these once the html entities are handled
What I specifically tried to do is
new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')
I got an error
Sorry, I have been at this since last night. SLott pointed out something and I see I used the wrong label I hope this makes more sense
You are correct that if S1='A' and S2 = u'A', then S1 == S2. Instead of assuming this though, you can do a simple test:
key_dict= {u'A':'Value1',
'A':'Value2'}
print key_dict
print u'A' == 'A'
This outputs:
{u'A': 'Value2'}
True
That resolved, let's look at:
new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')
There's a problem here, \x97 is the value you're trying to replace in the target string. However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. Even if the target and search string were both ascii or unicode, you'd still not find the '\x97'. Second problem is that you are trying to search for a non-unicode string in a unicode string. The easiest solution, and one that makes the most sense is to simply search for u'\x97':
print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')
Outputs:
KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES
Why not the obvious .replace(u'\x97','')? Where does the idea of that '|' come from?
>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'

Categories