I have a list of all wikipedia articles and I want to scrape a body for research purposes. My script is working fine, but at the current speed this will take 40 days straight.
My questions is:
Can I run this script, lets say 10 times parallel in different terminal windows. If I just set it like this:
Script 1:
start point: 0
end point: len(list)/10
Script 2:
start start point: len(list)/10
end point: len(list)/(10*2)
...
Script 10
and so on.
This could leave me with 4 days, which is reasonable imo.
Does my approach work? Is there a better approach?
Thanks.
Possible yes, ideal no. Why do you think its running so slowly? Also are you using the wiki api or are you scraping the site? There's factors that affect either of the two so knowing what your actually doing will help us give a better answer.
Related
EDIT: I've since taken a different approach with my task such that the solution to this isn't necessary anymore, but I'm leaving the question up for posterity. Because of this post (Manual pyglet loop freezes after a few iterations) describing a similar issue, I believe something inherent to pyglet was the issue. The solution recommended in the cited thread did not work for me.
Brand new to Python and PsychoPy. I inherited a script that will run a task, but we're encountering a major issue. The overview is that participants see a set of instructions, and then begin a loop in which they see background information about a movie clip, then watch that movie clip (using MovieStim3 and pyglet) while we collect live, ongoing ratings from them using a joystick, and then that loop continues to reiterate through the 9 stimuli.
The problem is that the script freezes on the 6th iteration of the for loop every time. It will play the video through and freeze on the last frame. The first five iterations will work perfectly, though, it seems. No error messages are produced and I have to force quit in order to get psychopy to close. It also fails to produce a log or any data.
Here's what I've tried so far that hasn't worked:
Changing the order of the films
Reducing all films to a length of 5s
Changing the films from .mp4 to .mov
Changing which films we show
I think an additional piece of information that's helpful to know is that if I reduce my stimuli list to 5 films, the script executes perfectly from start to finish. Naturally, I assumed that the person who originally coded this must have limited the number of possible iterations to 5, but neither they nor I can find such a parameter. Knowing this, I thought a simple solution might be to simply make two separate loops and two separate stimuli lists (both under 6 items) and have it iterate through those sequentially. However, I run into the exact same issue. This makes me think it's got to be something outside of my stimuli presentation loop, however, I'm simply at a loss to figure out what it might be. If anyone could offer any assistance or guidance, it would be immensely appreciated.
Because I can't isolate the problem to one area and there's a character limit here, I'm going to share the whole script and the associated stimuli via my github repository, but if there's anything more targeted that would be helpful for me to share here, please don't hesitate to ask me. I'm sure it's also very bulky code, so apologies for that as well. If you've gotten this far, thanks so much for your time.
The task I need to do is to write some python codes to make computer can execute another python code which is provided by my colleague at 8 am every day. (The codes he provide me could be used to automatically download something from the internet or processing some excel sheets)
Actually, I have already found a previous post in StackOverflow which could very match what I want:
Python script to do something at the same time every day
However, I am a python beginner, so I don't know what is the "exact / practical" way to execute these codes. My questions could stupid but hope someone still can help me.
Let me describe my problems below...
Scenario Background:
Each day, My colleague will push ctrl+L to lock his computer (not log off or shut down) before leaving the office and the computer will go into sleep mode in the final.
He will come to the office at around 9 am, but he hopes my python code can automatically execute some of his python codes at 8 am each day.
The final result should be like this:
Automatically wake up the computer (I think this step can be done by windows setting, instead of using python. Would it be easier?)
Even the computer is still locked, my python code should still automatically execute another python code at 8 am.
So my codes are as follow (Based on that previous post in StackOverflow):
from datetime import datetime
from threading import Timer
x=datetime.today()
y=x.replace(day=x.day+1, hour=8, minute=0, second=0)
delta_t=y-x
secs=delta_t.seconds+1
def daily_task():
# Put my Colleague's codes here
t = Timer(secs, daily_task)
t.start()
My real question is:
I write these codes in Spyder (This is the only environment I know how to use...) Should I just press F5 and execute all my codes and then it's done?
Or should I run these codes on cmd or something like that?
Will my codes still work after I close down my Spyder?
If my code will still work after I close Spyder, then what is the exact way to stop my code?
Sorry, I know my question could be very stupid or even ask the wrong questions.
But I usually only use python to do some very simple data processing and analysis, and never use it to do such a practical thing. So I totally have no idea how to do it even though I have already googled it.
I have made the API parsing the GITHUB contribution data of each account and arranging by month, week or day and decorating with JSON.
responding to just one request takes approximately 2 second. (1800ms)
Link to my GitHub repository.
contributions.py in repository is the python code that does the above things.
THE POINT OF QUESTION : What makes my API slow?
just too many data to parse (about 365)?
the way api make the JSON String?
Thank you for answering and helping me in advance.
"Why is my code slow?" is a really hard question to answer. There's basically an unlimited number of possible reasons that could be. I may not be able to answer the question, but I can provide some suggestions to hopefully help you answer it for yourself.
There are dozens of questions to ask... What kind of hardware are you using? What kind of network/internet connection do you have? Is it just slow on the first request, or all requests? Is it just slow on the call to one type of request (daily, weekly, monthly) or all? etc. etc.
You are indicating overall request times being ~1800ms, but as you pointed out, there are a lot of things happening during the processing of that request. In my experience, often times one of the best ways to find out is to add some timing code to narrow down the scope of the slowness.
For example, one quick and dirty way to do this is to use the python time module. I quickly added some code to the weekly contributions method:
import time
# [...]
#app.route("/contributions/weekly/<uname>")
def contributionsWeekly(uname):
before = time.time()
rects = getContributionsElement(uname)
after = time.time()
timeToGetContribs = after - before
# [...]
print(' timeToGetContribs: ' + str(timeToGetContribs))
print('timeToIterateRects: ' + str(timeToIterateRects))
print(' timeToBuildJson: ' + str(timeToBuildJson))
Running this code locally produced the following results:
timeToGetContribs: 0.8678717613220215
timeToIterateRects: 0.011543750762939453
timeToBuildJson: 1.5020370483398438e-05
(Note the e-05 on the end of the last time... very tiny amount of time).
From these results, we know that the time to get the contributions is taking the majority of the full request. Now we can drill down into that method to try to further isolate the most time consuming part. The next set of results shows:
timeToOpenUrl: 0.5734567642211914
timeToInstantiateSoup: 0.3690469264984131
timeToFindRects: 0.0023255348205566406
From this it appears that the majority of the time is spent actually opening the URL and retrieving the HTML (meaning that network latency, internet connection speed, GitHub server response time, etc are the likely suspects). The next heaviest is the time it actually takes to instantiate the BeautifulSoup parser.
Take all of these concrete numbers with a grain of salt. These are on my hardware (12 year old PC) and my local internet connection. On your system, the numbers will likely vary, and may even be significantly different. The point is, the best way to track down slowness is to go through some basic troubleshooting steps to identify where the slowness is occurring. After you've identified the problem area(s), you can likely search for more specific answers, or ask more targeted questions.
My company has slightly more than 300 vehicle based windows CE 5.0 mobile devices that all share the same software and usage model of Direct Store Delivery during the day then doing a Tcom at the home base every night. There is an unknown event(s) that results in the device freaking out and rebooting itself in the middle of the day. Frequency of this issue is ~10 times per week across the fleet of computers that all reboot daily, 6 days a week. The math is 300*6=1800 boots per week (at least) 10/1800= 0.5%. I realize that number is very low, but it is more than my boss wants to have.
My challenge, is to find a way to scan through several thousand logfille.txt files and try to find some sort of pattern. I KNOW there is a pattern here somewhere. I’ve got a couple ideas of where to start, but I wanted to throw this out to the community and see what suggestions you all might have.
A bit of background on this issue. The application starts a new log file at each boot. In an orderly (control) log file, you see the app startup, do its thing all day, and then start a shutdown process in a somewhat orderly fashion 8-10 hours later. In a problem log file, you see the device startup and then the log ends without any shutdown sequence at all in a time less than 8 hours. It then starts a new log file which shares the same date as the logfile1.old that it made in the rename process. The application that we have was home grown by windows developers that are no longer with the company. Even better, they don’t currently know who has the source at the moment.
I’m aware of the various CE tools that can be used to detect memory leaks (DevHealth, retail messages, etc..) and we are investigating that route as well, however I’m convinced that there is a pattern to be found, that I’m just not smart enough to find. There has to be a way to do this using Perl or Python that I’m just not seeing. Here are two ideas I have.
Idea 1 – Look for trends in word usage.
Create an array of every unique word used in the entire log file and output a count of each word. Once I had a count of the words that were being used, I could run some stats on them and look for the non-normal events. Perhaps the word “purple” is being used 500 times in a 1000 line log file ( there might be some math there?) on a control and only 4 times on a 500 line problem log? Perhaps there is a unique word that is only seen in the problem files. Maybe I could get a reverse “word cloud”?
Idea 2 – categorize lines into entry-type and then look for trends in the sequence of type of entry type?
The logfiles already have a predictable schema that looks like this = Level|date|time|system|source|message
I’m 99% sure there is a visible pattern here that I just can’t find. All of the logs got turned up to “super duper verbose” so there is a boatload of fluff (25 logs p/sec , 40k lines per file) that makes this even more challenging. If there isn’t a unique word, then this has almost got to be true. How do I do this?
Item 3 – Hire a windows CE platform developer
Yes, we are going down that path as well, but I KNOW there is a pattern I’m missing. They will use the tools that I don’t have) or make the tools that we need to figure out what’s up. I suspect that there might be a memory leak, radio event or other event that platform tools I’m sure will show.
Item 4 – Something I’m not even thinking of that you have used.
There have got to be tools out there that do this that aren’t as prestigious as a well-executed python script, and I’m willing to go down that path, I just don’t know what those tools are.
Oh yeah, I can’t post log files to the web, so don’t ask. The users are promising to report trends when they see them, but I’m not exactly hopeful on that front. All I need to find is either a pattern in the logs, or steps to duplicate
So there you have it. What tools or techniques can I use to even start on this?
was wondering if you'd looked at the ELK stack? It's an acronym for elasticsearch, kibana and log stash and fits your use case closely; it's often used for analysis of large numbers of log files.
Elasticsearch and kibana gives you a UI that lets you interactively explore and chart data for trends. Very powerful and quite straight forward to set up on a Linux platform and there's a Windows version too. (Took me a day or two of setup but you get a lot of functional power from it). Software is free to download and use. You could use this in a style similar to idea 1 / 2
https://www.elastic.co/webinars/introduction-elk-stack
http://logz.io/learn/complete-guide-elk-stack/
On the question of Python / idea 4 (which elk could be considered part of) I haven't done this for log files but I have used Regex to search and extract text patterns from documents using Python. That may also help you find patterns if you had some leads on the sorts of patterns you are looking for.
Just a couple of thoughts; hope they help.
There is no input data at all to this problem so this answer will be basically pure theory, a little collection of ideas you could consider.
To analize patterns out of a bunch of many logs you could definitely creating some graphs displaying relevant data which could help to narrow the problem, python is really very good for these kind of tasks.
You could also transform/insert the logs into databases, that way you'd be able to query the relevant suspicious events much faster and even compare massively all your logs.
A simpler approach could be just focusing on a simple log showing the crash, instead wasting a lot of efforts or resources trying to find some kind of generic pattern, start by reading through one simple log in order to catch suspicious "events" which could produce the crash.
My favourite approach for these type of tricky problems is different from the previous ones, instead of focusing on analizing or even parsing the logs I'd just try to reproduce the bug/s in a deterministic way locally (you don't even need to have the source code). Sometimes it's really difficult to replicate the production environment in your the dev environment but definitely is time well invested. All the effort you put into this process will help you to solve not only these bugs but improving your software much faster. Remember, the more times you're able to iterate the better.
Another approach could just be coding a little script which would allow you to replay logs which crashed, not sure if that'll be easy in your environment though. Usually this strategy works quite well with production software using web-services where there will be a lot of tuples with data-requests and data-retrieves.
In any case, without seeing the type of data from your logs I can't be more specific nor giving much more concrete details.
I have been in this problem for long time and i want to know how its done in real / big companies project.
Suppose i have the project to build a website. now i divide the project into sub tasks and do it.
But u know that suppose i have task1 in hand like export the page to pdf. Now i spend 3 days to do that , came accross various problems , many stack overflow questions and in the end i solve it.
Now 4 months after someone told me that there is some error in the code.
Now by that i comepletely forgot about(60%) how i did it and why i do this way. I document the code but i can't write the whole story of that in the code.
Then i have to spend much time on code to find what was the problem so that i added this line etc.
I want to know that is there any way that i can log steps in completeing the project.
So that i can see how i end up with code , what erros i got , what questions i asked on So and etc.
How people do it in real time. Which software to use.
I know in our project management softaware called JIRA we have tasks but that does not cover what steps i took to solve that tasks.
what is the besy way so that when i look backt at my 2 year old project , i know how i solve particular task
If you are already using JIRA consider integrating it with your SCM.
When committing your changes to SCM refer to your JIRA issue number in comments. Like the following:
PORTAL-778 fixed the alignment issue with PDF exports
JIRA periodically connects to your SCM and parses the comments. You can easily find out changes made for a particular issue.
Please see the following link for more information
Integrating JIRA with Subversion
Every time you revisit code, make a list of the information you are not finding. Then the next time you create code, make sure that information is present. It can be in comments, Wiki, bugs or even text notes in a separate file. Make the notes useful for other people, so private notebooks aren't a good idea except for personal notes.