I have a task to check if page is in English, and the following code is not compiling because it writes:
expected string or bytes-like object
train = pd.read_csv('UDSclub/wikipedia_train.csv')
def get_language(page):
res = re.search('[a-z][a-z].wikipedia.org', page)
if res:
return res.group(0)[0:2]
return 'na'
train['language'].apply(get_language)
I am trying to create new column in DataFrame where it will be marked if page is in English of not
Try this:
train = pd.read_csv('UDSclub/wikipedia_train.csv')
def get_language(x):
res = re.search('[a-z][a-z].wikipedia.org', x['page'])
if res:
language = res.group(0)[0:2]
language = 'na'
x['language'] = language
return x
train.apply(get_language)
Related
I'm trying to apply a preprocessor to remove html tags from the fileds of a Tensorflow dataset. I use this in the implementation of t5 which uses segio.
in my_preprocessor if I use text = row[field_index], it works okay and in training, it prints several rows and continues, but if I use text = normalize_text(row[field_index]) it hangs on an infinite or long loop as if it wants to apply it on the whole dataset rows!
def normalize_text(text):
"""Lowercase and remove quotes and tags from a TensorFlow string."""
text = tf.strings.lower(text)
text = tf.strings.regex_replace(text,"'(.*)'", r"\1")
text = tf.strings.regex_replace(text,"<[^>]+>", " ")
return text
def make_add_field_names_preprocessor(
field_names: Sequence[str], field_indices: Optional[Sequence[int]] = None,
) -> Callable:
def my_preprocessor(ds):
def to_inputs_and_targets(*row):
ret = {}
count=0
tf.print("=======================")
for field_name, field_index in zip(field_names, field_indices):
# if I use this, it works okay
text = row[field_index]
# if I use the following it falls to a long or endless loop
text = normalize_text(row[field_index])
tf.print(count, "Row:",text)
ret[field_name] = text
count+=1
return ret
return ds.map(to_inputs_and_targets,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
return my_preprocessor
Update: The preprocessor gets two field names, "inputs" for row[1], and "targets" for row[2]. I noticed that the problem exists many when I call normalize_text on row[2]
got some functions with sqlstatements. My first func is fine because i get only 1 result.
My second function returns a large list of errorcodes and i dont know how to get them back for response.
TypeError: <sqlalchemy.engine.result.ResultProxy object at 0x7f98b85ef910> is not JSON serializable
Tried everything need help.
My Code:
def topalarms():
customer_name = request.args.get('customer_name')
machine_serial = request.args.get('machine_serial')
#ts = request.args.get('ts')
#ts_start = request.args.get('ts')
if (customer_name is None) or (machine_serial is None):
return missing_param()
# def form_response(response, session):
# response['customer'] = customer_name
# response['serial'] = machine_serial
# return do_response(customer_name, form_response)
def form_response(response, session):
result_machine_id = machine_id(session, machine_serial)
if not result_machine_id:
response['Error'] = 'Seriennummer nicht vorhanden/gefunden'
return
#response[''] = result_machine_id[0]["id"]
machineid = result_machine_id[0]["id"]
result_errorcodes = error_codes(session, machineid)
response['ErrorCodes'] = result_errorcodes
return do_response(customer_name, form_response)
def machine_id(session, machine_serial):
stmt_raw = '''
SELECT
id
FROM
machine
WHERE
machine.serial = :machine_serial_arg
'''
utc_now = datetime.datetime.utcnow()
utc_now_iso = pytz.utc.localize(utc_now).isoformat()
utc_start = datetime.datetime.utcnow() - datetime.timedelta(days = 30)
utc_start_iso = pytz.utc.localize(utc_start).isoformat()
stmt_args = {
'machine_serial_arg': machine_serial,
}
stmt = text(stmt_raw).columns(
#ts_insert = ISODateTime
)
result = session.execute(stmt, stmt_args)
ts = utc_now_iso
ts_start = utc_start_iso
ID = []
for row in result:
ID.append({
'id': row[0],
'ts': ts,
'ts_start': ts_start,
})
return ID
def error_codes(session, machineid):
stmt_raw = '''
SELECT
name
FROM
identifier
WHERE
identifier.machine_id = :machineid_arg
'''
stmt_args = {
'machineid_arg': machineid,
}
stmt = text(stmt_raw).columns(
#ts_insert = ISODateTime
)
result = session.execute(stmt, stmt_args)
errors = []
for row in result:
errors.append(result)
#({'result': [dict(row) for row in result]})
#errors = {i: result[i] for i in range(0, len(result))}
#errors = dict(result)
return errors
My problem is func error_codes somethiing is wrong with my result.
my Output should be like this:
ABCNormal
ABCSafety
Alarm_G01N01
Alarm_G01N02
Alarm_G01N03
Alarm_G01N04
Alarm_G01N05
I think you need to take a closer look at what you are doing correctly with your working function and compare that to your non-working function.
Firstly, what do you think this code does?
for row in result:
errors.append(result)
This adds to errors one copy of the result object for each row in result. So if you have six rows in result, errors contains six copies of result. I suspect this isn't what you are looking for. You want to be doing something with the row variable.
Taking a closer look at your working function, you are taking the first value out of the row, using row[0]. So, you probably want to do the same in your non-working function:
for row in result:
errors.append(row[0])
I don't have SQLAlchemy set up so I haven't tested this: I have provided this answer based solely on the differences between your working function and your non-working function.
You need a json serializer. I suggest using Marshmallow: https://marshmallow.readthedocs.io/en/stable/
There are some great tutorials online on how to do this.
Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!
I am trying to develop an Infobox parser in Python which supports all the languages of Wikipedia. The parser will get the infobox data and will return the data in a Dictionary.
The keys of the Dictionary will be the property which is described (e.g. Population, City name, etc...).
The problem is that Wikipedia has slightly different page contents for each language. But the most important thing is that the API response structure for each language can also be different.
For example, the API response for 'Paris' in English contains this Infobox:
{{Infobox French commune |name = Paris |commune status = [[Communes of France|Commune]] and [[Departments of France|department]] |image = <imagemap> File:Paris montage.jpg|275px|alt=Paris montage
and in Greek, the corresponding part for 'Παρίσι' is:
[...] {{Πόλη (Γαλλία) | Πόλη = Παρίσι | Έμβλημα =Blason paris 75.svg | Σημαία =Mairie De Paris (SVG).svg | Πλάτος Σημαίας =120px | Εικόνα =Paris - Eiffelturm und Marsfeld2.jpg [...]
In the second example, there isn't any 'Infobox' occurrence after the {{. Also, in the API response the name = Paris is not the exact translation for Πόλη = Παρίσι. (Πόλη means city, not name)
Because of such differences between the responses, my code fails.
Here is the code:
class WikipediaInfobox():
# Class to get and parse the Wikipedia Infobox Data
infoboxArrayUnprocessed = [] # Maintains the order which the data is displayed.
infoboxDictUnprocessed = {} # Still Contains Brackets and Wikitext coding. Will be processed more later...
language="en"
def getInfoboxDict(self, infoboxRaw): # Get the Infobox in Dict and Array form (Unprocessed)
if infoboxRaw.strip() == "":
return {}
boxLines = [line.strip().replace(" "," ") for line in infoboxRaw.splitlines()]
wikiObjectType = boxLines[0]
infoboxData = [line[1:] for line in boxLines[1:]]
toReturn = {"wiki_type":wikiObjectType}
for i in infoboxData:
key = i.split("=")[0].strip()
value = ""
if i.strip() != key + "=":
value=i.split("=")[1].strip()
self.infoboxArrayUnprocessed.append({key:value})
toReturn[key]=value
self.infoboxDictUnprocessed = toReturn
return toReturn
def getInfoboxRaw(self, pageTitle, followRedirect = False, resetOld=True): # Get Infobox in Raw Text
if resetOld:
infoboxDict = {}
infoboxDictUnprocessed = {}
infoboxArray = []
infoboxArrayUnprocessed = []
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "%s" % urllib.quote(pageTitle.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://" + self.language + ".wikipedia.org/w/api.php?%s" % qs
tree = etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
if len(revs) == 0:
return ""
if "#REDIRECT" in revs[-1].text and followRedirect == True:
redirectPage = revs[-1].text[revs[-1].text.find("[[")+2:revs[-1].text.find("]]")]
return self.getInfoboxRaw(redirectPage,followRedirect,resetOld)
elif "#REDIRECT" in revs[-1].text and followRedirect == False:
return ""
infoboxRaw = ""
if "{{Infobox" in revs[-1].text: # -> No Multi-language support:
infoboxRaw = revs[-1].text.split("{{Infobox")[1].split("}}")[0]
return infoboxRaw
def __init__(self, pageTitle = "", followRedirect = False): # Constructor
if pageTitle != "":
self.language = guess_language.guessLanguage(pageTitle)
if self.language == "UNKNOWN":
self.language = "en"
infoboxRaw = self.getInfoboxRaw(pageTitle, followRedirect)
self.getInfoboxDict(infoboxRaw) # Now the parsed data is in self.infoboxDictUnprocessed
Some parts of this code was found on this blog...
I don't want to reinvent the wheel, so maybe someone has a nice solution for multi-language support and neat parsing of the Infobox section of Wikipedia.
I have seen many alternatives, like DBPedia or some other parsers that MediaWiki recommends, but I haven't found anything that suits my needs, yet. I also want to avoid scraping the page with BeautifulSoup, because it can fail on some cases, but if it is necessary it will do.
If something isn't clear enough, please ask. I want to help as much as I can.
Wikidata is definitely the first choice these days if you want to get structured data, anyway if in the future you need to parse data from wikipedia articles, especially as you are using Python, I can recommand mwparserfromhell which is a python library aimed at parsing wikitext and that has an option to extract templates and their attributes. That won't directly fix your issue as the multiple templates in multiple languages will definitely be different but that might be useful if you continue trying to parse wikitext.
I am trying to gather a bunch of links using xpath which need to be scraped from the next page however, I keep getting the error that can only parse strings? I tried looking at the type of lk and it was a string after I casted it? What seems to be wrong?
def unicode_to_string(types):
try:
types = unicodedata.normalize("NFKD", types).encode('ascii', 'ignore')
return types
except:
return types
def getData():
req = "http://analytical360.com/access-points"
page = urllib2.urlopen(req)
tree = etree.HTML(page.read())
i = 0
for lk in tree.xpath('//a[#class="sabai-file sabai-file-image sabai-file-type-jpg "]//#href'):
print "Scraping Vendor #" + str(i)
trees = etree.HTML(urllib2.urlopen(unicode_to_string(lk)))
for ll in trees.xpath('//table[#id="archived"]//tr//td//a//#href'):
final = etree.HTML(urllib2.urlopen(unicode_to_string(ll)))
You should pass in strings not urllib2.orlopen.
Perhaps change the code like so:
trees = etree.HTML(urllib2.urlopen(unicode_to_string(lk)).read())
for i, ll in enumerate(trees.xpath('//table[#id="archived"]//tr//td//a//#href')):
final = etree.HTML(urllib2.urlopen(unicode_to_string(ll)).read())
Also, you don't seem to increment i.