Python/Scrapy question: How to get cleaner results? - python

My task for a project is to data mine a website for specific names. My experience with python isn't high. When I scraped all the names, they come out in this format:
[u'Bob Joe']
[u'Tim Tom']
[u'Anne Frank']
[u'superman']
How can I clean up these values? What does the 'u' signify? Is my xpath wrong? Would I have to clean it up in a scrapy pipeline (I'd like to avoid this)? I just want the names and not the extra junk around it.

In Python 2, the 'u' prefix indicates that it's a Unicode string. [u'Bob Joe'] is a list containing a Unicode string.

Related

Hidden characters in integer-like string

I scraped data about fundraising from the web and put it into a table.
As I start to clean the data , I see that some elements, for instance "2 000000", are read "2\xa0000000" by the machine.
1/ What does that mean ?
2/ How can I remove it ? (as I want to transform the whole column to integers)
Best,
To fix a DataFrame column, use:
df['col'] = df['col'].str.replace('\D', '').astype(int)
The issue is that you have escape sequences read in as Unicode characters in the string. The easiest way to remove those characters without using replace on each specific showing is using the unicodedata package.
Specifically:
from unicodedata import normalize
string1 = "2\xa0000000"
new_string = normalize('NFKD', string1)
print(new_string)
Output:
2 000000
This package was already built into my machine, but you may need to install it if you used a different method to build your python package than I. I find this better because this normalization works across a lot of various formatting, so you do not need to use replace each time you see something else that is not formatted correctly. It's an escape sequence
Character of hex code A0 is non-breaking space. So to speak, you can just treat it as a space in most cases. According to my experience, it mostly come up when I process some data generated from Microsoft Office products, or from the web when people put the HTML code on it.
Unfortunately, python split() (for example, I don't know how you process your data) will not treat that as space. But as it is just a distinct character, you can solve the issue with:
longstring.replace('\xA0', ' ').split()
PS: Read again your question, seems it should be ignored to produce the number two million as an data entity. So you might want to replace '\xA0' with empty string.

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.
To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
Always work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the re.U flag so \w matches unicode letters:
#coding: utf-8
import re
location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)
print location # prints öäå
It would help if you could dump the strings before and after each step.
Check your value of re.UNICODE first, see this

Wildcards in a Search and Replace Dictionary

I am fairly new to Python so if my terminology is wrong I apologize. I am running Python 2.6.5, I am not sure about updating to 3.0 since Python was initially downloaded with my spatial analysis software.
I am writing a program to search and replace column headers in multiple comma delimited text files. Since there are over a hundred headers and they are the same in all the files I decided to create a dictionary and 'pickle' to save all the replacements (got the idea from reading other posts). My issue comes in when I noticed there are tabs and spaces within the text file column headings, for example:
..."Prev Roll #: ","Prev Prime/Sub","Frontage : ","Depth : ","Area : ","Unit of Measure : ",...
So I thought why not just stick in a wildcard at the end of my key term so the search will match it no matter how many spaces are dividing the name and the colon. I was trying the * wildcard, but it doesn't work, when I run it no matches/replacements are made. Am I using the wildcard correctly? Is what I'm trying to do even possible? Or should I do away with the dictionary pickle?
Below is a sample of what I'm trying to do
import cPickle as pickle
general_D = { ....
"Prev Prime/Sub" : "PrvPrimeSub",
"Frontage*" : "Frontage",
"Depth*" : "Depth",
"Area*" : "Area",
"Unit of Measure*" : "UnitMeasure",
Thanks for the input!
Use the csv module to parse and write your comma-separated data.
Use the string strip() method to remove unwanted spaces and tabs.
Do not include * in your dict key names. They will not glob as you
hope. They just represent literal *s there.
It is probably better to use json instead of pickle. JSON is
human-readable, independent of programming language. Pickle may have
problems even across different versions of Python.

Python URL Characters

I really new to Python and coding in general, but I have been making some good strides.
I am able to pull some data off of the web through an API, and the result should be a string. What I am seeing though, are some instances such as "& amp;"" and " &quot". (I modified the character sets so it would print properly to the screen)
I figure there is a way to clean this string and remove the characters such that it looks like it does on a computer screen. I tried searching for urldecoding, but admittedly I dont even know if that is the solution.
Any help on how to remove these "extra" characters and produce a readable string will be greatly appreciated!
Many thanks in advance,
Brock
xml.sax.saxutils.unescape(data[, entities]): Unescape '&amp', '&lt', and '&gt' in a string of data.
You can unescape other strings of data by passing a dictionary as the optional entities parameter. The keys and values must all be strings; each key will be replaced with its corresponding value. '&amp', '&lt', and '&gt' are always unescaped, even if entities is provided.

Categories