my API's parsing data with beautifulSoup4 and python is too slow - python

I have made the API parsing the GITHUB contribution data of each account and arranging by month, week or day and decorating with JSON.
responding to just one request takes approximately 2 second. (1800ms)
Link to my GitHub repository.
contributions.py in repository is the python code that does the above things.
THE POINT OF QUESTION : What makes my API slow?
just too many data to parse (about 365)?
the way api make the JSON String?
Thank you for answering and helping me in advance.

"Why is my code slow?" is a really hard question to answer. There's basically an unlimited number of possible reasons that could be. I may not be able to answer the question, but I can provide some suggestions to hopefully help you answer it for yourself.
There are dozens of questions to ask... What kind of hardware are you using? What kind of network/internet connection do you have? Is it just slow on the first request, or all requests? Is it just slow on the call to one type of request (daily, weekly, monthly) or all? etc. etc.
You are indicating overall request times being ~1800ms, but as you pointed out, there are a lot of things happening during the processing of that request. In my experience, often times one of the best ways to find out is to add some timing code to narrow down the scope of the slowness.
For example, one quick and dirty way to do this is to use the python time module. I quickly added some code to the weekly contributions method:
import time
# [...]
#app.route("/contributions/weekly/<uname>")
def contributionsWeekly(uname):
before = time.time()
rects = getContributionsElement(uname)
after = time.time()
timeToGetContribs = after - before
# [...]
print(' timeToGetContribs: ' + str(timeToGetContribs))
print('timeToIterateRects: ' + str(timeToIterateRects))
print(' timeToBuildJson: ' + str(timeToBuildJson))
Running this code locally produced the following results:
timeToGetContribs: 0.8678717613220215
timeToIterateRects: 0.011543750762939453
timeToBuildJson: 1.5020370483398438e-05
(Note the e-05 on the end of the last time... very tiny amount of time).
From these results, we know that the time to get the contributions is taking the majority of the full request. Now we can drill down into that method to try to further isolate the most time consuming part. The next set of results shows:
timeToOpenUrl: 0.5734567642211914
timeToInstantiateSoup: 0.3690469264984131
timeToFindRects: 0.0023255348205566406
From this it appears that the majority of the time is spent actually opening the URL and retrieving the HTML (meaning that network latency, internet connection speed, GitHub server response time, etc are the likely suspects). The next heaviest is the time it actually takes to instantiate the BeautifulSoup parser.
Take all of these concrete numbers with a grain of salt. These are on my hardware (12 year old PC) and my local internet connection. On your system, the numbers will likely vary, and may even be significantly different. The point is, the best way to track down slowness is to go through some basic troubleshooting steps to identify where the slowness is occurring. After you've identified the problem area(s), you can likely search for more specific answers, or ask more targeted questions.

Related

Web Scraping 24/7 without Ip-spoofing (theoretical) - Python

Now before I ask the actual question I want to mention that IP-Spoofing is illegal in a number of Countries so please do not attempt to use any code provided in here or in the answers bellow. Moreover I am not confident myself that I know exactly what is considered legal and what not in the subject I'm about to open therefor I want to state that every piece of code provided here has not and will not be actually tested. Everything just should 'theoretically work'!
Having mentioned that, a friend of mine asked me if I could make a bot that would monitor a market 24/7 and shall it find an item in the desired price (or lower) it will automatically buy it. Now the only problem in theory that I think my code would have is in this line here:
request(url, some_var_name, timeout=1)
#think that timeout will need to be at least 2 realistically but anyway!
This runs in a loop that's supposed to run 24/7 sending 1 request per second (optimally) to the server. After a while the most likely outcome is that this will be viewed as a DDoS attack and to prevent it the IP will be blocked so we wont be able to get the prices from the request anymore. An easy solution would be Proxies, just have a list of them bought from somewhere or even get some for free possibly infecting your own computer with something or getting all your requests monitored by someone but yeah.. That's one (illegal I think) way to achieve that.. Now questions I have are:
Is there a way to achieve the same result without proxies? And is there a way to make the whole thing legal so I can actually test it out on a real market?
Thanks in advance y'all, if at the end I find out that there's nothing to be done here at the very least it was a fun challenge to take on!

Making webpage hit count/trending section for blog

I am trying to make this simple feature of sorting by the most viewed articles in the last 24 hours.
I did this by saving the timestamp every time the webpage gets a request.
Now the most "trending" article is the one that has maximum timestamps saved in T-24 hours.
However, I don't think that this is the right way to do it as the list keeps getting bigger, and it must be slowing down the webpages (which will become noticeable after a critical number of views).
[I'm using Django]
You can use django-hitcount. It is pretty easy and useful, I used it in many projects.
https://pypi.org/project/django-hitcount/

Data analysis of log files – How to find a pattern?

My company has slightly more than 300 vehicle based windows CE 5.0 mobile devices that all share the same software and usage model of Direct Store Delivery during the day then doing a Tcom at the home base every night. There is an unknown event(s) that results in the device freaking out and rebooting itself in the middle of the day. Frequency of this issue is ~10 times per week across the fleet of computers that all reboot daily, 6 days a week. The math is 300*6=1800 boots per week (at least) 10/1800= 0.5%. I realize that number is very low, but it is more than my boss wants to have.
My challenge, is to find a way to scan through several thousand logfille.txt files and try to find some sort of pattern. I KNOW there is a pattern here somewhere. I’ve got a couple ideas of where to start, but I wanted to throw this out to the community and see what suggestions you all might have.
A bit of background on this issue. The application starts a new log file at each boot. In an orderly (control) log file, you see the app startup, do its thing all day, and then start a shutdown process in a somewhat orderly fashion 8-10 hours later. In a problem log file, you see the device startup and then the log ends without any shutdown sequence at all in a time less than 8 hours. It then starts a new log file which shares the same date as the logfile1.old that it made in the rename process. The application that we have was home grown by windows developers that are no longer with the company. Even better, they don’t currently know who has the source at the moment.
I’m aware of the various CE tools that can be used to detect memory leaks (DevHealth, retail messages, etc..) and we are investigating that route as well, however I’m convinced that there is a pattern to be found, that I’m just not smart enough to find. There has to be a way to do this using Perl or Python that I’m just not seeing. Here are two ideas I have.
Idea 1 – Look for trends in word usage.
Create an array of every unique word used in the entire log file and output a count of each word. Once I had a count of the words that were being used, I could run some stats on them and look for the non-normal events. Perhaps the word “purple” is being used 500 times in a 1000 line log file ( there might be some math there?) on a control and only 4 times on a 500 line problem log? Perhaps there is a unique word that is only seen in the problem files. Maybe I could get a reverse “word cloud”?
Idea 2 – categorize lines into entry-type and then look for trends in the sequence of type of entry type?
The logfiles already have a predictable schema that looks like this = Level|date|time|system|source|message
I’m 99% sure there is a visible pattern here that I just can’t find. All of the logs got turned up to “super duper verbose” so there is a boatload of fluff (25 logs p/sec , 40k lines per file) that makes this even more challenging. If there isn’t a unique word, then this has almost got to be true. How do I do this?
Item 3 – Hire a windows CE platform developer
Yes, we are going down that path as well, but I KNOW there is a pattern I’m missing. They will use the tools that I don’t have) or make the tools that we need to figure out what’s up. I suspect that there might be a memory leak, radio event or other event that platform tools I’m sure will show.
Item 4 – Something I’m not even thinking of that you have used.
There have got to be tools out there that do this that aren’t as prestigious as a well-executed python script, and I’m willing to go down that path, I just don’t know what those tools are.
Oh yeah, I can’t post log files to the web, so don’t ask. The users are promising to report trends when they see them, but I’m not exactly hopeful on that front. All I need to find is either a pattern in the logs, or steps to duplicate
So there you have it. What tools or techniques can I use to even start on this?
was wondering if you'd looked at the ELK stack? It's an acronym for elasticsearch, kibana and log stash and fits your use case closely; it's often used for analysis of large numbers of log files.
Elasticsearch and kibana gives you a UI that lets you interactively explore and chart data for trends. Very powerful and quite straight forward to set up on a Linux platform and there's a Windows version too. (Took me a day or two of setup but you get a lot of functional power from it). Software is free to download and use. You could use this in a style similar to idea 1 / 2
https://www.elastic.co/webinars/introduction-elk-stack
http://logz.io/learn/complete-guide-elk-stack/
On the question of Python / idea 4 (which elk could be considered part of) I haven't done this for log files but I have used Regex to search and extract text patterns from documents using Python. That may also help you find patterns if you had some leads on the sorts of patterns you are looking for.
Just a couple of thoughts; hope they help.
There is no input data at all to this problem so this answer will be basically pure theory, a little collection of ideas you could consider.
To analize patterns out of a bunch of many logs you could definitely creating some graphs displaying relevant data which could help to narrow the problem, python is really very good for these kind of tasks.
You could also transform/insert the logs into databases, that way you'd be able to query the relevant suspicious events much faster and even compare massively all your logs.
A simpler approach could be just focusing on a simple log showing the crash, instead wasting a lot of efforts or resources trying to find some kind of generic pattern, start by reading through one simple log in order to catch suspicious "events" which could produce the crash.
My favourite approach for these type of tricky problems is different from the previous ones, instead of focusing on analizing or even parsing the logs I'd just try to reproduce the bug/s in a deterministic way locally (you don't even need to have the source code). Sometimes it's really difficult to replicate the production environment in your the dev environment but definitely is time well invested. All the effort you put into this process will help you to solve not only these bugs but improving your software much faster. Remember, the more times you're able to iterate the better.
Another approach could just be coding a little script which would allow you to replay logs which crashed, not sure if that'll be easy in your environment though. Usually this strategy works quite well with production software using web-services where there will be a lot of tuples with data-requests and data-retrieves.
In any case, without seeing the type of data from your logs I can't be more specific nor giving much more concrete details.

How to properly unit test a web app?

I'm teaching myself backend and frontend web development (I'm using Flaks if it matters) and I need few pointers for when it comes to unit test my app.
I am mostly concerned with these different cases:
The internal consistency of the data: that's the easy one - I'm aiming for 100% coverage when it comes to issues like the login procedure and, most generally, checking that everything that happens between the python code and the database after every request remain consistent.
The JSON responses: What I'm doing atm is performing a test-request for every get/post call on my app and then asserting that the json response must be this-and-that, but honestly I don't quite appreciate the value in doing this - maybe because my app is still at an early stage?
Should I keep testing every json response for every request?
If yes, what are the long-term benefits?
External APIs: I read conflicting opinions here. Say I'm using an external API to translate some text:
Should I test only the very high level API, i.e. see if I get the access token and that's it?
Should I test that the returned json is what I expect?
Should I test nothing to speed up my test suite and don't make it dependent from a third-party API?
The outputted HTML: I'm lost on this one as well. Say I'm testing the function add_post():
Should I test that on the page that follows the request the desired post is actually there?
I started checking for the presence of strings/html tags in the row response.data, but then I kind of gave up because 1) it takes a lot of time and 2) I would have to constantly rewrite the tests since I'm changing the app so often.
What is the recommended approach in this case?
Thank you and sorry for the verbosity. I hope I made myself clear!
Most of this is personal opinion and will vary from developer to developer.
There are a ton of python libraries for unit testing - that's a decision best left to you as the developer of the project to find one that fits best with your tool set / build process.
This isn't exactly 'unit testing' per se, I'd consider it more like integration testing. That's not to say this isn't valuable, it's just a different task and will often use different tools. For something like this, testing will pay off in the long run because you'll have piece of mind that your bug fixes and feature additions aren't impacting your end to end code. If you're already doing it, I would continue. These sorts of tests are highly valuable when refactoring down the road to ensure consistent functionality.
I would not waste time testing 3rd party APIs. It's their job to make sure their product behaves reliably. You'll be there all day if you start testing 3rd party features. A big reason to use 3rd party APIs is so you don't have to test them. If you ever discover that your app is breaking because of a 3rd party API it's probably time to pick a different API. If your project scales to a size where you're losing thousands of dollars every time that API fails you have a whole new ball of issues to deal with (and hopefully the resources to address them) at that time.
In general, I don't test static content or html. There are tools out there (web scraping tools) that will let you troll your own website for consistent functionality. I would personally leave this as a last priority for the final stages of refinement if you have time. The look and feel of most websites change so often that writing tests isn't worth it. Look and feel is also really easy to test manually because it's so visual.

Reading RSS Feeds: What Aggregators Do That I'm Not

I drop the following feed into Google Reader, and it update normally.
http://www.indeed.ca/rss?q=&l=Hamilton%2C+ON
However, when I use any of a number of approaches suggested thither and yon on the 'net that simply involve reading from this source and parsing the XML I receive the same 20 items.
What is Google Reader doing that I should be in my code so that I receive new items?
Thanks for your advice. Incidentally, I'm coding in Python.
RSS aggregators "poll" the sources, i.e., they repeat the HTTP query periodically on each source, and check if anything new appears in the results. That's unfortunate, as polling always is, as it wastes resources in an unending series of "are we there yet?" questions (kind of like taking a toddler along in a long car drive;-), and nevertheless implies delays (if you poll a given source every hour, say, you'll wait up to an hour to see some results).
Unfortunately, in the RSS architecture itself, there are no alternatives, no way to ask for a "callback" when new stuff appears or opt for a saner "publish-subscribe architecture".
A good effort to remedy that is pubsubhubbub, but it inevitably requires cooperation (above and beyond the RSS standards) from RSS sources and aggregators -- so it needs very wide takeup before it can be called "a solution" to the problem, though, technically, it already is (for cooperating sites;-).
So back to your question, you're doing nothing wrong: you just need to poll periodically, like RSS aggregators do, in order to get to see new results eventually.
1) Have you tried with other RSS feeds?
2) If so, it sounds like some kind of cache... Are you behind some proxy?

Categories