BeautifulSoup randomly gets stuck in for loop - python

I’ve been using BeautifulSoup to extract multiple pages of reviews from websites, and it’s worked wonders mostly but on large datasets has constantly been getting stuck at seemingly random points.
My code is always along the lines of the following.
for x in range(len(reviews)):
reviewsoups.append(BeautifulSoup(requests.get(reviews[x]).text, ‘html.parser))
I’ve never gotten any errors or anything (except the random ConnectionReset error), but it just seems as though the loop gets stuck randomly to the point where I consistently have to interrupt the kernel (which often takes 10+ minutes to actually work) and restart the process from the index where the loop got stuck.
It seems as though in some cases, if I try and use my laptop whatsoever while the code is running (like opening Chrome etc) that aggravates the situation.
Can anyone help? It’s just incredibly irritating having to sit by my laptop waiting just in case something like this happens.
Thanks in advance.

I think I’ve found a solution.
So I was trying to ‘soup’ 9000 URLs. What I did was iteratively created variables using the globals() function, with the idea of having each variable store 100 soups, so that would be 90 variables with 100 soups each rather than one list with 9000.
I noticed that the first few hundred were very quick and then slowed down, however running 100 at a time and not constantly elongating an already massive list made a difference.
I also got no crashes.
Bear in mind I only tried this with the last 1000 or so after I got stuck at the 8000 mark, but it was so much quicker and no technical issues.
Next time I will initialise a for loop that incorporates each variable and appends eg the 1056th soup to the 11th variable as the 56th element if that makes sense.

Related

Web Scraping 24/7 without Ip-spoofing (theoretical) - Python

Now before I ask the actual question I want to mention that IP-Spoofing is illegal in a number of Countries so please do not attempt to use any code provided in here or in the answers bellow. Moreover I am not confident myself that I know exactly what is considered legal and what not in the subject I'm about to open therefor I want to state that every piece of code provided here has not and will not be actually tested. Everything just should 'theoretically work'!
Having mentioned that, a friend of mine asked me if I could make a bot that would monitor a market 24/7 and shall it find an item in the desired price (or lower) it will automatically buy it. Now the only problem in theory that I think my code would have is in this line here:
request(url, some_var_name, timeout=1)
#think that timeout will need to be at least 2 realistically but anyway!
This runs in a loop that's supposed to run 24/7 sending 1 request per second (optimally) to the server. After a while the most likely outcome is that this will be viewed as a DDoS attack and to prevent it the IP will be blocked so we wont be able to get the prices from the request anymore. An easy solution would be Proxies, just have a list of them bought from somewhere or even get some for free possibly infecting your own computer with something or getting all your requests monitored by someone but yeah.. That's one (illegal I think) way to achieve that.. Now questions I have are:
Is there a way to achieve the same result without proxies? And is there a way to make the whole thing legal so I can actually test it out on a real market?
Thanks in advance y'all, if at the end I find out that there's nothing to be done here at the very least it was a fun challenge to take on!

PsychoPy Video Task Freezes After Several Iterations of For Loop

EDIT: I've since taken a different approach with my task such that the solution to this isn't necessary anymore, but I'm leaving the question up for posterity. Because of this post (Manual pyglet loop freezes after a few iterations) describing a similar issue, I believe something inherent to pyglet was the issue. The solution recommended in the cited thread did not work for me.
Brand new to Python and PsychoPy. I inherited a script that will run a task, but we're encountering a major issue. The overview is that participants see a set of instructions, and then begin a loop in which they see background information about a movie clip, then watch that movie clip (using MovieStim3 and pyglet) while we collect live, ongoing ratings from them using a joystick, and then that loop continues to reiterate through the 9 stimuli.
The problem is that the script freezes on the 6th iteration of the for loop every time. It will play the video through and freeze on the last frame. The first five iterations will work perfectly, though, it seems. No error messages are produced and I have to force quit in order to get psychopy to close. It also fails to produce a log or any data.
Here's what I've tried so far that hasn't worked:
Changing the order of the films
Reducing all films to a length of 5s
Changing the films from .mp4 to .mov
Changing which films we show
I think an additional piece of information that's helpful to know is that if I reduce my stimuli list to 5 films, the script executes perfectly from start to finish. Naturally, I assumed that the person who originally coded this must have limited the number of possible iterations to 5, but neither they nor I can find such a parameter. Knowing this, I thought a simple solution might be to simply make two separate loops and two separate stimuli lists (both under 6 items) and have it iterate through those sequentially. However, I run into the exact same issue. This makes me think it's got to be something outside of my stimuli presentation loop, however, I'm simply at a loss to figure out what it might be. If anyone could offer any assistance or guidance, it would be immensely appreciated.
Because I can't isolate the problem to one area and there's a character limit here, I'm going to share the whole script and the associated stimuli via my github repository, but if there's anything more targeted that would be helpful for me to share here, please don't hesitate to ask me. I'm sure it's also very bulky code, so apologies for that as well. If you've gotten this far, thanks so much for your time.

Making webpage hit count/trending section for blog

I am trying to make this simple feature of sorting by the most viewed articles in the last 24 hours.
I did this by saving the timestamp every time the webpage gets a request.
Now the most "trending" article is the one that has maximum timestamps saved in T-24 hours.
However, I don't think that this is the right way to do it as the list keeps getting bigger, and it must be slowing down the webpages (which will become noticeable after a critical number of views).
[I'm using Django]
You can use django-hitcount. It is pretty easy and useful, I used it in many projects.
https://pypi.org/project/django-hitcount/

Can you run python wikipedia api multiple times at the same time?

I have a list of all wikipedia articles and I want to scrape a body for research purposes. My script is working fine, but at the current speed this will take 40 days straight.
My questions is:
Can I run this script, lets say 10 times parallel in different terminal windows. If I just set it like this:
Script 1:
start point: 0
end point: len(list)/10
Script 2:
start start point: len(list)/10
end point: len(list)/(10*2)
...
Script 10
and so on.
This could leave me with 4 days, which is reasonable imo.
Does my approach work? Is there a better approach?
Thanks.
Possible yes, ideal no. Why do you think its running so slowly? Also are you using the wiki api or are you scraping the site? There's factors that affect either of the two so knowing what your actually doing will help us give a better answer.

python or database?

i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.
can i continue to use python on a 2gig file or should i move the data into a database?
I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.
Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.
ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.
If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.
Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).
I'd only put it into a relational database if:
The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.
If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.
If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.
2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.
If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".
I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.
I always reach for a database for larger datasets.
A database gives me some stuff for "free"; that is, I don't have to code it.
searching
sorting
indexing
language-independent connections
Something like SQLite might be the answer for you.
Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.
At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.
This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.
Hope that helps with your decision-making!

Categories