Python Search and Scrape [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a problem I'd like to know if it's worth spending the time trying to solve with Python. I have a large CSV file of scientific names of fishes. I would like to cross-reference that CSV file with a large database of fish morphology information (www.fishbase.ca) and have the code return the maximum length of each fish. Basically, I need to create code that will search the fishbase website for each fish, then find the maximum length info on the page and return it to me in a CSV file. The last two parts are relatively straightforward, but the first part is where I'm stuck. Thanks in advance.

It looks like you can generate the url directly from the genus and species, ie
rainbow trout (oncorhynchus mykiss) becomes
http://www.fishbase.ca/summary/Oncorhynchus-mykiss.html
so something like
def make_url(genus, species):
return (
"http://www.fishbase.ca/summary/{}-{}.html"
.format(genus.title(), species.lower())
)
Looking at the page source, the html is severely unsemantic; while parsing html with regular expressions is evil and awful, I really think it's the easiest method in this case:
import re
fishlength = re.compile("max length : ([\d.]+) ([cm]{1,2})", re.I).search
def get_length_in_cm(html):
m = fishlength(html)
if m: # match found
value = float(m.group(1))
unit = m.group(2)
if unit == "cm":
return value
elif unit == "m":
return value * 100.
else:
raise ValueError("Unknown unit: {}".format(unit))
else:
raise ValueError("Length not found")
then grabbing each page,
import csv
import requests
from time import sleep
DELAY = 2
GENUS_COL = 4
SPECIES_COL = 5
with open("fish.csv") as inf:
next(inf) # skip header row
for row in csv.reader(inf):
url = make_url(row[GENUS_COL], row[SPECIES_COL])
# should add error handling, in case
# that page doesn't exist
html = requests.get(url).text
length = get_length_in_cm(html)
# now store the length value somewhere
# be nice, don't pound their site
sleep(DELAY)

So, in order to use the information in other web applications you will need to use an API to get hold of their data.
Fishbase.ca (or .org) does not have an official public-facing API. There is some chat in 2013 about creating a RESTful API which would be just the ticket for what you need, but this hasn't happened yet (don't hold your breath).
An alternative is using the name of the fish you need to lookup, dropping that into the URI (eg www.fishbase.ca/fish/Rainbow+Trout) and then using Xquery or similar to drill down the DOM to find the maximum length.
Unfortunately, fishbase does not have the sort of URIs needed for this method either, this is the URI for Rainbow Trout - uses an ID rather than a name to easily look up.
I would suggest looking into another data provider looking for either of these two APIs.
Regarding the second method: the site owners may not appreicate you using their website in this manner. Ask them beforehand if you can.

Related

String Operations and Manipulation in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I'm building a scraping spider and I would like some help on how to extract the right information out of each response in Python
response.css(".print-acta-temp::text").get()
'TEMPORADA 2021-2022'
I would like to know how to collect only the 2021-2022. Should I use the str command?
response.css(".print-acta-data::text").get()
'Data: 14-05-2022, 19:00h'
I need to extract only the date into one variable and the time into another variable.
response.css(".print-acta-comp::text").get()
' CADET PRIMERA DIVISIÓ - GRUP 2'
I need to collect the data before the first space, the data collected between the 2 spaces and finally the number into another variable.
response.css(".print-acta-jornada::text").get()
'Jornada 28'
I need to collect the data after the first space.
if you trust the website to produce the data you want exactly followed by 'TEMPORADA ' all the time you can use
tu_string = 'TEMPORADA 2021-2022'
nueva_string = tu_string.replace('TEMPORADA ','')
print (nueva_string)
like, there's regex and all of that, but you can worry about learning that later, tbh.
I need to collect the data before the first space, the data collected
between the 2 spaces and finally the number into another variable.
a simple way to do this is to split
teva_string = 'CADET PRIMERA DIVISIÓ - GRUP 2'
teva_lista = teva_string.split(' ')
print (teva_lista)
Any decision on how to parse a string is going to depend on one's assumptions about what form the strings are going to take. In the particular case of 'TEMPORADA 2021-2022', doing my_string.split(' ')[1] will get the years. 'Data: 14-05-2022, 19:00h'.split(' ') will get the list ['Data: 14-05-2022,, '19:00h'], while 'Data: 14-05-2022, 19:00h'.split('-') will get ['Data: 14-05-2022', ' 19:00h']. You can also use datetime libraries or regular expressions, with the latter allowing for more customization if the form of your data varies.

What is the best way to fetch all data (posts and their comments) from Reddit? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a requirement to analyse all the comments about a subreddit, (e.g. r/dogs, say from 2015 onward) using the Python Reddit API Wrapper (PRAW). I'd like the data to be stored in a JSON.
I have found a way to get all of the top, new, hot, etc. submissions as a JSON using PRAW:
new_submissions = reddit.subreddit("dogs").new()
top_submissions = reddit.subreddit("dogs").top("all")
However, these are only the submissions for the subreddit, and do not include comments. How can I get the comments for these posts as well?
You can use the Python Pushshift.io API Wrapper (PSAW) to get all the most recent submissions and comments from a specific subreddit, and can even do more complex queries (such as searching for specific text inside a comment). The docs are available here.
For example, you can use the get_submissions() function to get the top 1000 submissions from r/dogs from 2015:
import datetime as dt
import praw
from psaw import PushshiftAPI
r = praw.Reddit(...)
api = PushshiftAPI(r)
start_epoch=int(dt.datetime(2015, 1, 1).timestamp()) # Could be any date
submissions_generator = api.search_submissions(after=start_epoch, subreddit='dogs', limit=1000) # Returns a generator object
submissions = list(submissions_generator) # You can then use this, store it in mongoDB, etc.
Alternatively, to get the first 1000 comments from r/dogs in 2015, you can use the search_comments() function:
start_epoch=int(dt.datetime(2015, 1, 1).timestamp()) # Could be any date
comments_generator = api.search_comments(after=start_epoch, subreddit='dogs', limit=1000) # Returns a generator object
comments = list(comments_generator)
As you can see, PSAW still uses PRAW, and so returns PRAW objects for submissions and comments, which may be handy.

Return data from for loop without return statement [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have got a list of 1000 meals. I want to fetch the top 10 meals by calling an API. The API runs through an algorithm. The algorithm is slow and it takes time to do the calculation on all 1000 meals. So to minimize the time, the algorithms create a CSV file in chunks of the top 100 meals (i.e 10 CSV files for 1000 meals) and merge it all in the end with 1000 meals into a single merge_All.csv.
So here how the API should work from the client-side.
The client called an API for the first time. The API starts the algorithm and will create the 1st CSV chunk and return the result to the client from the 1st CSV. I.e (Top 10 from 1-100 meals)
Now, the Client is not satisfied with the result. It calls the API again. Now the API should return the Top 10 meals from the 2nd CSV generated. I.e (Top 10 from 100-200 meals).
There might be a case where the client waits for some time to get better results, in the meantime the
API might have created a 7th CSV file. So when clients call API it should return the top 10 meals from the 7th CSV. (i.e 600-700 meals).
If the API has generated merge_All.csv. Then the client should get the top 10 users from that CSV at calling time.
The solution I am looking for is:
I want to keep this for loop inside 'evaluate_combinations' running till it generates all temp CSV files and final CSV file, and at the same time return the first 'possible_food_list_temp' so that I can read the first temp CSV file data in API method and return data to the user, Currently, the problem is 'evaluate_combinations' is returning only first temp list and generating only one CSV file which is wrong, instead, it should run till all combination count in for loop and generate more than 1 temporary CSV file
I have tried with yield method but it requires a while loop.
Code Extract for better understanding:
def evaluate_combinations(...params)
for c in combinations:
if loop_cnt % iter_limit == 0 or loop_cnt == total_comb:
possible_food_list += possible_food_list_temp
# This method generate_food method do some calculation and creates a CSV file with a bunch of data
generate_food(possible_food_list_temp, choice, N, item_ids, items)
if len(possible_food_list_temp) > 0:
return possible_food_list_temp
print('\nTemporary output generated!')
else:
print('\nOutput not generated. Trying other combinations...')
possible_food_list_temp = []
return possible_food_list
You described your structural error quite neatly, but didn't recognize it:
I want to keep this for loop inside 'evaluate_combinations' running
till it generates all temp CSV files and final CSV file, and at the
same time return the first 'possible_food_list_temp'...
You expect your module to carry out two different executions at the same time. This requires multi-processing / multi-threading. Your code does not include any such attempt.
It's time to do a little more research. Work through the tutorials and examples, starting with subprocesses. Spawn a child process to do your 100-meal chunking and make its results available. You may want another process to accumulate the results and keep them available on request. Your function's purpose no reduces to grabbing the latest CSV and reporting back to the caller.
Does that get you moving?

Fetching place details (specifically reviews) with GooglePlaces in Python 3

I am completely new to this module and Python in general, yet wanted to start some sort of a fun project in my spare time.
I have a specific question concerning the GooglePlaces module for Python - how do I retrieve the reviews of a place by only knowing its Place ID.
So far I have done...
from googleplaces import GooglePlaces, types, lang
google_places = GooglePlaces('API KEY')
query_result = google_places.get_place(place_id="ChIJB8wSOI11nkcRI3C2IODoBU0")
print(query_result) #<Place name="Starbucks", lat=48.14308250000001, lng=11.5782337>
print(query_result.get_details()) # Prints None
print(query_result.rating) # Prints the rating of 4.3
I am completely lost here, because I cannot get access to the object's details. Maybe I am missing something, yet would be very thankful for any guidance through my issue.
If you are completly lost just read the docs :)
Example from https://github.com/slimkrazy/python-google-places:
for place in query_result.places:
# Returned places from a query are place summaries.
# The following method has to make a further API call.
place.get_details()
# Referencing any of the attributes below, prior to making a call to
# get_details() will raise a googleplaces.GooglePlacesAttributeError.
print place.details # A dict matching the JSON response from Google.
See the Problem with your code now?
print(query_result.get_details()) # Prints None
should be
query_result.get_details() # Fetch details
print(query_result.details) # Prints details dict
Regarding the results, the Google Docs states:
reviews[] a JSON array of up to five reviews. If a language parameter
was specified in the Place Details request, the Places Service will
bias the results to prefer reviews written in that language. Each
review consists of several components:

Filter two-dimensional table to find most used product by client on large table [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a csv file with the following structure. I would normally use pandas to do this but I don't think it's appropriate here as I'm trying to run this on up to 100 million lines. What data structure should I use?
Client, Product, Usage
"John","Coke",3
"John","Pepsi",5
"Sam","Budweiser",7
"Sam","Pepsi",6
I want to output a table that gives me the most used product by client, so:
Client, Most_used_product
"John","Pepsi"
"Sam","Budweiser"
How can I do this?
Assuming your client products csv has no repeats like this:
"John","Coke",3
"John","Pepsi",5
"John","Coke",5
and also ignoring the headings like Client, Product and Usage. The following code should work:
import csv
most_used_products = dict()
with open('your_csv_filename.csv', 'rb') as csvfile:
products_reader = csv.reader(csvfile, delimiter=',')
for client, product, str_usage in products_reader:
usage = int(str_usage)
if client not in most_used_products:
most_used_products[client] = (product, usage)
else:
used_product, product_usage = most_used_products[client]
if usage > product_usage:
most_used_products[client] = (product, usage)
for client, product_info in most_used_products.items():
product_name, _ = product_info
print '"%s","%s"' % (client, product_name)
Holding 100 million entries and then sorting is a little bit tricky unless you use a database format like SQLite. Python has an interface for SQLite if this is really what you want to do.
Perhaps instead of loading the whole thing into memory and sorting it you could iterate over the file, line by line, only holding the maximum entry for each person. This would cut down your memory requirements and would mean you wouldn't have to sort a very large data structure, which is computationally expensive.
Just using straightforward python it might look something like this:
clientDict = {}
def addToDict(client,prod,num,clientDict):
clientDict[client] = {"num":num,"prod":prod}
with open("test.csv","r") as csvFile:
for line in csvFile:
(client,prod,num) = line.split(',')
num = int(num)
if client in clientDict:
if clientDict[client]["num"] <= num:
addToDict(client,prod,num,clientDict)
else:
addToDict(client,prod,num,clientDict)
This would return a dictionary with each client as a key and there favourite product and quantity as a value. Assuming that your table doesn't have too many unique clients this is more efficient than loading the whole file into memory.
This solution also doesn't take into account what happens if a person has an entry with identical numbers (i.e John, Coke,4 and John,Pepsi,4) nor does it take into account cumulative entries (i.e John, Coke, 4 will simply replace John, Coke, 2). The program is easily modified to take into account these discrepancies.

Categories