I defined a class to handle blocks of tweets so I could manage them a little easier
class twitter_block(object):
def __init__(self):
self.tweets = []
self.df = pd.DataFrame()
self.tag = ''
def load(self, data):
self.tweets = [x for x in data]
then defined a method as part of a pipeline:
def clean(self):
HTTP_PATTERN = '^https?:\/\/.*[\r\n]*'
AT_PATTERN = '#\w+ ?'
# tke away links
self.tweets = [re.sub(HTTP_PATTERN, '', str(x), flags=re.MULTILINE) for x in self.tweets]
# take away # signs
self.tweets = [re.sub(AT_PATTERN,'',str(x)) for x in self.tweets]
but when I call this:
tweet = load_data('The_Donald.json')
block = twitter_block(tag='donald')
block.load(data=tweet)
block.clean()
block.print()
it returns the 1504 tweets that I loaded into the block object same as before, no cleaning links or anything. Although, actually it does remove # signs... but this method,
def smilecheck(self):
#save a tweet if there is a smiley there
smiley_pattern = '^(:\(|:\))+$'
for tweet in self.tweets:
if re.match(smiley_pattern, str(tweet)):
pass
else:
self.tweets.remove(tweet)
does not remove the tweets without smileys, returns 1504 tweets, the same as I put in... any help guys? im sure this is a problem with the way I am approaching objects
I believe the problem is that you are using re.match() instead of re.search()
Where you want to find the tweets that contain a smiley anywhere in the tweet, re.match() searches only from the beginning of the string.
See python -- re.match vs re.search
Related
I'm trying to apply a preprocessor to remove html tags from the fileds of a Tensorflow dataset. I use this in the implementation of t5 which uses segio.
in my_preprocessor if I use text = row[field_index], it works okay and in training, it prints several rows and continues, but if I use text = normalize_text(row[field_index]) it hangs on an infinite or long loop as if it wants to apply it on the whole dataset rows!
def normalize_text(text):
"""Lowercase and remove quotes and tags from a TensorFlow string."""
text = tf.strings.lower(text)
text = tf.strings.regex_replace(text,"'(.*)'", r"\1")
text = tf.strings.regex_replace(text,"<[^>]+>", " ")
return text
def make_add_field_names_preprocessor(
field_names: Sequence[str], field_indices: Optional[Sequence[int]] = None,
) -> Callable:
def my_preprocessor(ds):
def to_inputs_and_targets(*row):
ret = {}
count=0
tf.print("=======================")
for field_name, field_index in zip(field_names, field_indices):
# if I use this, it works okay
text = row[field_index]
# if I use the following it falls to a long or endless loop
text = normalize_text(row[field_index])
tf.print(count, "Row:",text)
ret[field_name] = text
count+=1
return ret
return ds.map(to_inputs_and_targets,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
return my_preprocessor
Update: The preprocessor gets two field names, "inputs" for row[1], and "targets" for row[2]. I noticed that the problem exists many when I call normalize_text on row[2]
In my python file, I have created a class called Download. The code where the class is:
import requests, json, os, pytube, threading
class Download:
def __init__(self, url, json=False, get=False, post=False, put=False, unwanted="", wanted="", unwanted2="", wanted2="", unwanted3="", wanted3=""):
self.url = url
self.json = json
self.get = get
self.post = post
self.put = put
self.unwanted = unwanted
self.wanted = wanted
self.unwanted2 = unwanted2
self.wanted2 = wanted2
self.unwanted3 = unwanted3
self.wanted3 = wanted3
def downloadJson(self):
if self.get is True:
downloadJson = requests.get(self.url)
downloadJson = str(downloadJson.content)
downloadJsonS = str(downloadJson) # This saves the downloaded JSON file as string
if self.json is True:
with open("downloadedJson.json", "w") as writeDownloadedJson:
writeDownloadedJson.write(json.dumps(downloadJson))
writeDownloadedJson.close()
with open("downloadedJson.json", "r") as replaceUnwanted:
a = replaceUnwanted.read()
x = a.replace(self.unwanted, self.wanted)
# y = a.replace(self.unwanted2, self.wanted2)
# z = a.replace(self.unwanted3, self.wanted3)
print(x)
with open("downloadedJson.json", "w") as writeUnwanted:
# writeUnwanted.write(y)
# writeUnwanted.write(z)
writeUnwanted.write(x)
else:
# with open("downloadedJson.json", "w")as j:
# j.write(downloadJsonS)
# j.close()
pass
I have written all this by myself, and I understand how it works. My objective is to remove all the unwanted characters that come in the JSON file once downloaded, such as: \\n, \' or \n. I have many arguments in the __init__() function, like the __init__(unwanted="", wanted="", unwanted2="") etcetera.
By this, when adding any character to the unwanted parameter, such as: \\n, it should replace all these characters by a space. This is done properly, and it works. The lines of code that are comments are the lines of code that I was using, but that did not work. It would only replace the characters from only 1 argument.
Is there any way of passing all the unwanted characters in each for each argument, using threads. If it is not possible using threads, is there any alternative?
By the way, the file where I am executing the class: (main.py):
from downloader import Download
with open("url.txt", "r")as url:
x = Download(url.read(), get=True, json=True, unwanted="\\n")
x.downloadJson()
Thanks
You could apply the replacements one after another:
x = a.replace(self.unwanted, self.wanted)
x = x.replace(self.unwanted2, self.wanted2)
x = x.replace(self.unwanted3, self.wanted3)
You could also chain the replacement together, but that would quickly become hard to read:
x = a.replace(...).replace(...).replace(...)
Btw, instead of having multiple unwantedN and wantedN,
it would be probably a lot easier to use a list of (unwanted, wanted) pairs, something like this:
def __init__(self, url, json=False, get=False, post=False, put=False, replacements=[]):
self.url = url
self.json = json
self.get = get
self.post = post
self.put = put
self.replacements = replacements
And then you could perform the replacements in a loop:
x = a
for unwanted, wanted in self.replacements:
x = x.replace(unwanted, wanted)
I create a url with {} format to change the url on the fly.
It works totally fine on my PC.
But once I upload and run it from scrapinghub one(state) of the many substitutions(others work fine) does not work, it returns %7B%7D& in the url which is encoded curly braces.
Why does this happen? What do I miss when referencing State variable?
This is the url from my code:
def __init__(self):
self.state = 'AL'
self.zip = '35204'
self.tax_rate = 0
self.years = [2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017]
def parse_m(self, response):
r = json.loads(response.text)
models = r['models']
year = response.meta['year']
make = response.meta['make']
for model in models:
for milage in [40000,50000,60000,70000,80000,90000,100000]:
url = '****/vehicles/?year={}&make={}&model={}&state={}&mileage={}&zip={}'.format(year,make, model, self.state, milage, self.zip)
and this is the url i see in the log of scrapinghub:
***/vehicles/?year=2010&make=LOTUS&model=EXIGE%20S&state=%7B%7D&mileage=100000&zip=35204
This is not a scrapinghub issue. It has to be your code only. If I do below
>>> "state={}".format({})
'state={}'
This would end up being
state=%7B%7D
I would add
assert type(self.state) is str
to my code to ensure this situation doesn't happen and if it does then you get an AssertionError
Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!
Apologies if this isn't totally clear - I'm a Python copy-the-code-and-try-to-make-it-work developer.
I'm using the Google NLP API in Python 2.7.
When I use analyze_entities(), I can get and print the name, entity type and salience.
Mentions is supposed to contain the noun type: PROPER or COMMON, per this page:
https://cloud.google.com/natural-language/docs/reference/rest/v1beta1/Entity#EntityMention
I can't get mention type from the returned dictionary.
Here's my hideous code:
def entities_text(text, client):
"""Detects entities in the text."""
language_client = client
# Instantiates a plain text document.
document = language_client.document_from_text(text)
# Detects entities in the document. You can also analyze HTML with:
# document.doc_type == language.Document.HTML
entities = document.analyze_entities()
return entities
articles = os.listdir('articles')
for f in articles:
language_client = language.Client()
fname = "articles/" + f
thisfile = open(fname,'r')
content = thisfile.read()
entities = entities_text(content, language_client)
for e in entities:
name = e.name.strip()
type = e.entity_type.strip()
if e.name.strip()[0].isupper() and len(e.name.strip()) > 2:
print name, type, e.salience, e.mentions
That returns this:
RELATED OTHER 0.0019081507 [u'RELATED']
Zoe 3 PERSON 0.0016676666 [u'Zoe 3']
Where the value in [] is the mentions.
If I try to get mentions.type, I get an attribute not found error.
I'd appreciate any input.
1) Do not call the "AnalyzeEntities" function, but call the "AnnotateText" one instead.
2) Check for "Proper". Examine its value, it should be "PROPER" and not "PROPER_UNKNOWN" nor "NOT_PROPER".