Collecting large amounts of data efficiently

Collecting large amounts of data efficiently - python

I have a program that creates a solar system, integrates until a close encounter between adjacent planets occur (or until 10e+9 years), then writes two data points to a file. The try and except acts as a flag when planets get too close. This process is repeated 16,000 times. This is all being done by importing the module REBOUND, which is a software package that integrates the motion of particles under the influence of gravity.
for i in range(0,16000):
def P_dist(p1, p2):
x = sim.particles[p1].x - sim.particles[p2].x
y = sim.particles[p1].y - sim.particles[p2].y
z = sim.particles[p1].z - sim.particles[p2].z
dist = np.sqrt(x**2 + y**2 + z**2)
return dist
init_periods = [sim.particles[1].P,sim.particles[2].P,sim.particles[3].P,sim.particles[4].P,sim.particles[5].P]
try:
sim.integrate(10e+9*2*np.pi)
except rebound.Encounter as error:
print(error)
print(sim.t)
for j in range(len(init_periods)-1):
distance = P_dist(j, j+1)
print(j,":",j+1, '=', distance)
if distance <= .01: #returns the period ratio of the two planets that had the close enecounter and the inner orbital period between the two
p_r = init_periods[j+1]/init_periods[j]
with open('good.txt', 'a') as data: #opens a file writing the x & y values for the graph
data.write(str(math.log10(sim.t/init_periods[j])))
data.write('\n')
data.write(str(p_r))
data.write('\n')
Whether or not there is a close encounter depends mostly on a random value I have assigned, and that random value also controls how long a simulation can run. For instance, I chose the random value to be a max of 9.99 and a close encounter happened at approximately 11e+8 years(approximately 14 hours). The random values range from 2-10, and close encounters happen more often on the lower side. Every iteration, if a close encounter occurs, my code will write to the file where I believe may be taking up a lot of simulation time. Since the majority of my simulation time is taken up by trying to locate close encounters, I'd like to shed some time by finding a way to collect the data needed without having to append to the file every iteration.
Since I'm attempting to plot the data collected from this simulation, would creating two arrays and outputting data into those be faster? Or is there a way to only have to write to a file once, when all 16000 iterations are complete?
sim is a variable holding all of the information about the solar system.
This is not the full code, I left out the part where I created the solar system.
count = 0
data = open('good.txt', 'a+')
....
if distance <= .01:
count+=1
while(count<=4570)
data.write(~~~~~~~)
....
data.close()

The problem isn't that you write every time you find a close encounter. It's that, for each encounter, you open the file, write one output record, and close the file. All the opening and appending is slow. Try this, instead: open the file once, and do only one write per record.
# Near the top of the program
data = open('good.txt', 'a')
...
if distance <= .01: #returns the period ratio of the two planets that had the close enecounter and the inner orbital period between the two
# Write one output record
p_r = init_periods[j+1]/init_periods[j]
data.write(str(math.log10(sim.t/init_periods[j])) + '\n' +
str(p_r) + '\n')
...
data.close()
This should work well, as writes will get buffered, and will often run in parallel with the next computation.

Related

Picking up where program left off after error encountered

I am running a program row-wise on a pandas dataframe that takes a long time to run.
The problem is, the VPN connection to the database can suddenly be lost, so I lose all my progress.
Currently, what I am doing is splitting the large dataframe into smaller chunks (500 rows at a time), and running the program on each chunk in a for loop. The result of the processing of each chunk is saved to my hard drive.
However, the chunks are still 500 rows each, so I can still lose a lot of progress when the connection is lost. Plus, I have to manually check to see where I got up to and adjust the code to pick up where the connection was lost.
What is the best way to write the code to "remember" which row the program is up to and pick up exactly where it left off once I re-establish the connection?
Current:
size = 500
list_of_dfs = np.split(large_df, range(size, len(large_df), size))
together_list = []
for count, chunk in enumerate(list_of_dfs):
# Process
chunk_processed = process_chunk(chunk)
chunk_processed.to_csv(f"processed_{count}.csv")
together_list.append(chunk_processed)
# merge lists together into one df
all_chunks_together = pd.concat(together_list)
Thanks in advance

You could use the existing csv files to remember where to pick up:
size = 500
list_of_dfs = np.split(large_df, range(size, len(large_df), size))
together_list = []
for count, chunk in enumerate(list_of_dfs):
csv_file = f"processed_{count}.csv"
if os.path.isfile(csv_file):
chunk_processed = from_csv(csv_file)
else:
chunk_processed = process_chunk(chunk)
chunk_processed.to_csv(csv_file)
together_list.append(chunk_processed)
# merge lists together into one df
all_chunks_together = pd.concat(together_list)
You would still have to re-start your program manually every time it loses the connection. To avoid this, you could catch the exception (assuming you're getting one on connection loss) and continue like in this example:
import random
random.seed(64)
l = []
while len(l) < 3:
try:
l = []
for n in range(3):
l.append(n)
x = 1 / random.randint(0,1) # div by 0 error with 50% probability
except:
print("error, trying again")
pass
print(l)
which yields
error, trying again
error, trying again
error, trying again
error, trying again
error, trying again
error, trying again
error, trying again
[0, 1, 2]
The downside of this approach is that you potentially re-read the csv files quite often. But assuming this is fast and you can wait, it may be fine. At least you would have no manual work to do anymore.

Python - Comparing each item in array into another ( Dynamo )

I'm new to programming and only started writing my first lines of code last week.
I'm writing a script in a program called dynamo, this is to be used in my project. After some research, it appears like I need to use python.
What I need to script to do is look at a bunch of lines ( In a program called Revit), pick up the geometry of this line and then detect if any other line has a start point or end point that is in contact with that geometry. I then want to Split that line at that point, this can be done byCurve.SplitByPoints but I need some kind of way to compare ALL lines to ALL start/end points then the output be in a way that the output can be used to split the curve by the point. I can have the line and the point in which to cut in.
I tried to explain that the best I could...
code :
import clr
clr.AddReference('ProtoGeometry')
from Autodesk.DesignScript.Geometry import *
dataEnteringNode = IN
Line = IN[0] #Line
LPS = IN[1] # Line Point Start
LPE = IN[2] # Line Point End
LPC = IN[3] # Line Point Combined // Maybe not needed
T = 100 # Tolerance of Intersection
INT1 = [] # Blank Variable for First Loop Results
INT2 = [] # Blank Variable for First Loop Results
result1 = [] # Blank Variable for Second Loop Results
result2 =[] # Blank Variable for Second Loop Results
for i in range (0,len(LPS)):
distance = Curve.DistanceTo(LPS[i],Line[i])
INT1.append(distance)
for i in range (0,len(LPE)):
distance = Curve.DistanceTo(LPE[i],Line[i])
INT2.append(distance)
for i in range (0,len(INT1)):
if INT1 > T:
result1.append('T1')
else:
result1.append('F1')
for i in range (0,len(INT2)):
if INT2 > T:
result2.append('T2')
else:
result2.append('F2')
Assign your output to the OUT variable.
OUT = result1, result2
EDIT:
Sorry, I knew explaining this would be tricky for me.
I'll attempt to simplify it.
I want something like:
if curve intersect with StartPoint or EndPoint
Curve.split points(Curve,Intersecting_Point)
So im hoping something similar will have it so, when a start or end point intersects a curve, the curve will be split into 2 curves at that point.
So I want to above for to work on a range of lines. I drew a diagram and attempted to upload, but for some reason, it now says I need 10 rep to post an image. meaning I cant upload a new diagram and had to remove the ones I had in?
Thanks for the help! I'm sorry for my explaination skills

If-statement seemingly ignored by Write operation

What I am trying to do here is write the latitude and longitude of the sighting of a pokemon to a text file if it doesn't already exist. Since I am using an infinite loop, I added an if-state that prevents an already existent pair of coordinates to be added.
Note that I also have a list Coordinates that stores the same information. The list works as no repeats are added.(By checking) However, the text file has the same coordinates appended over and over again even though it theoretically shouldn't as it is contained within the same if-block as the list.
import requests
pokemon_url = 'https://pogo.appx.hk/top'
while True:
response = requests.get(pokemon_url)
response.raise_for_status()
pokemon = response.json()[0:]
Sighting = 0
Coordinates = [None] * 100
for num in range(len(pokemon)):
if pokemon[num]['pokemon_name'] == 'Aerodactyl':
Lat = pokemon[num]['latitude']
Long = pokemon[num]['longitude']
if (Lat, Long) not in Coordinates:
Coordinates[Sighting] = (Lat, Long)
file = open("aerodactyl.txt", "a")
file.write(str(Lat) + "," + str(Long) + "\n")
file.close()
Sighting += 1
For clarity purposes, this is the output

You need to put your Sighting and Coordinates variables outside of the while loop if you do not want them to reset on every iteration.
However, there are a lot more things wrong with the code. Without trying it, here's what I spot:
You have no exit condition for the while loop. Please don't do this to the poor website. You'll essentially be spamming requests.
file.close should be file.close(), but overall you should only need to open the file once, not on every single iteration of the loop. Open it once, and close once you're done (assuming you will add an exit condition).
Slicing from 0 (response.json()[0:]) is unnecessary. By default the list starts at index 0. This may be a convoluted way to get a new list, but that seems unnecessary here.
Coordinates should not be a hard-coded list of 100 Nones. Just use a set to track existing coordinates.
Get rid of Sighting altogether. It doesn't make sense if you're re-issuing the request over and over again. If you want to iterate through the pokémon from one response, use enumerate if you need the index.
It's generally good practice to use snake case for Python variables.

Try this:
#!/usr/bin/env python
import urllib2
import json
pokemon_url = 'https://pogo.appx.hk/top'
pokemon = urllib2.urlopen(pokemon_url)
pokeset = json.load(pokemon)
Coordinates = [None] * 100
for num in range(len(pokeset)):
if pokeset[num]['pokemon_name'] == 'Aerodactyl':
Lat = pokeset[num]['latitude']
Long = pokeset[num]['longitude']
if (Lat, Long) not in Coordinates:
Coordinates.append((Lat, Long))
file = open("aerodactyl.txt", "a")
file.write(str(Lat) + "," + str(Long) + "\n")
file.close

get playing wav audio level as output

I want to make a speaking mouth which moves or emits light or something when a playing wav file emits sound. So I need to detect when a wav file is speaking or when it is in a silence between words. Currently I'm using a pygame script that I have found
import pygame
pygame.mixer.init()
pygame.mixer.music.load("my_sentence.wav")
pygame.mixer.music.play()
while pygame.mixer.music.get_busy() == True:
continue
I guess I could make some checking at the while loop to look the sounds output level, or something like that, and then send it to one of the gpio outputs. But I don't know how to achieve that.
Any help would be much appreciated

You'll need to inspect the WAV file to work out when the voice is present. The simplest way to do this is look for loud and quiet periods. Because sound works with waves, when it's quiet the values in the wave file won't change very much, and when it's loud they'll be changing a lot.
One way of estimating loudness is the variance. As you can see the the article, this can be defined as E[(X - mu)^2], which could be written average((X - average(X))^2). Here, X is the value of the signal at a given point (the values stored in the WAV file, called sample in the code). If it's changing a lot, the variance will be large.
This would let you calculate the loudness of an entire file. However, you want to track how loud the file is at any given time, which means you need a form of moving average. An easy way to get this is with a first-order low-pass filter.
I haven't tested the code below so it's extremely unlikely to work, but it should get you started. It loads the WAV file, uses low-pass filters to track the mean and variance, and works out when the variance goes above and below a certain threshold. Then, while playing the WAV file it keeps track of the time since it started playing, and prints out whether the WAV file is loud or quiet.
Here's what you might still need to do:
Fix all my deliberate mistakes in the code
Add something useful to react to the loud/quiet changes
Change the threshold and reaction_time to get good results with your audio
Add some hysteresis (a variable threshold) to stop the light flickering
I hope this helps!
import wave
import struct
import time
def get_loud_times(wav_path, threshold=10000, time_constant=0.1):
'''Work out which parts of a WAV file are loud.
- threshold: the variance threshold that is considered loud
- time_constant: the approximate reaction time in seconds'''
wav = wave.open(wav_path, 'r')
length = wav.getnframes()
samplerate = wav.getframerate()
assert wav.getnchannels() == 1, 'wav must be mono'
assert wav.getsampwidth() == 2, 'wav must be 16-bit'
# Our result will be a list of (time, is_loud) giving the times when
# when the audio switches from loud to quiet and back.
is_loud = False
result = [(0., is_loud)]
# The following values track the mean and variance of the signal.
# When the variance is large, the audio is loud.
mean = 0
variance = 0
# If alpha is small, mean and variance change slower but are less noisy.
alpha = 1 / (time_constant * float(sample_rate))
for i in range(length):
sample_time = float(i) / samplerate
sample = struct.unpack('<h', wav.readframes(1))
# mean is the average value of sample
mean = (1-alpha) * mean + alpha * sample
# variance is the average value of (sample - mean) ** 2
variance = (1-alpha) * variance + alpha * (sample - mean) ** 2
# check if we're loud, and record the time if this changes
new_is_loud = variance > threshold
if is_loud != new_is_loud:
result.append((sample_time, new_is_loud))
is_loud = new_is_loud
return result
def play_sentence(wav_path):
loud_times = get_loud_times(wav_path)
pygame.mixer.music.load(wav_path)
start_time = time.time()
pygame.mixer.music.play()
for (t, is_loud) in loud_times:
# wait until the time described by this entry
sleep_time = start_time + t - time.time()
if sleep_time > 0:
time.sleep(sleep_time)
# do whatever
print 'loud' if is_loud else 'quiet'

Trialhandler and time measuring in psychopy

For a go-NoGo Task I want to organize pictures with the data.TrialHandler class from psychopy:
trials = data.TrialHandler(ImageList, nReps=1, method='random')
Now I want to code a loop in which psychopy is going into the dictionary, is presenting the first set of pictures (e.g. A_n) and afterwards is going to the second set until the sixth set. I tried the following:
import glob, os, random, sys, time
import numpy.random as rnd
from psychopy import visual, core, event, monitors, gui, logging, data
im_a = glob.glob('./a*') # upload pictures of the a-type, gives out a List of .jpg-files
im_n = glob.glob('./n*') # n-type
im_e = glob.glob('./e*') # e-type
# combining Lists of Pictures
A_n = im_a + im_n
N_a = im_n + im_a
A_e = im_a + im_e
E_a = im_e + im_a
E_n = im_e + im_n
N_e = im_n + im_e
# making a Dictionary of Pictures and Conditions
PicList = [A_n, N_a, A_e, E_a, E_n, N_e] # just the six Combinations
CondList = [im_a,im_n,im_a,im_e,im_e,im_n] # images that are in the GO-Condition
ImageList = []
for imagelist, condition in zip(PicList, CondList):
ImageList.append({'imagelist':imagelist,'condition':condition}) # to associate the picturelist with the GO Conditionlist
for the header I ask an extra question: Combining and associating multiple dictionaries
# Set Experiment
win = visual.Window(color='white',units='pix', fullscr=False)
fixCross=visual.TextStim(win,text='+',color='black',pos=(0.0,0.0), height=40)
corrFb = visual.TextStim(win,text='O',height=40,color='green',pos=[0,0])
incorrFb = visual.TextStim(win,text='X',height=40, color='red',pos=[0,0])
# Start Experiement
trials = data.TrialHandler(ImageList, nReps=1, method='random')
rt_clock = core.Clock()
bitmap = visual.ImageStim(win)
for liste in ImageList[0:5]: # to loop through all 6 conditions
keys = []
for i,Pictures in enumerate(liste): # to loop through all pictures in each condition
bitmap.setImage(Pictures) # attribute error occurs, not if I use Pictures[0][0], even though in this case every pictures is the same
bitmap.draw()
win.flip()
rt_clock.reset()
resp = False
while rt_clock.getTime() < 2.0: # timelimit is defined 2 s
if not resp:
resp = event.getKeys(keyList=['space'])
rt = rt_clock.getTime()
if bool(resp) is (Pictures in CondList): # at this point python should have access to the Dictionary in which the associated GO Pictures are saved
corrFb.draw()
accu=1 # doesn't work yet
else:
incorrFb.draw()
accu=0
win.flip()
core.wait(0.5)
trials.addData('rt_'+str(i), rt) # is working well when loop: if trial in trials: ...; in this case trialHAndler is not used, therefor trials.addData is not working
trials.addData('accu_'+str(i), accu)
trials.saveAsExcel(datanames)
core.quit()
There are a few problems in this code: first it only presents one pictuere for six times, but not six different pictures [1]
and secondly a totally different problem [2] ist the time measuring and the saving of the accuracy which the trialhandler is doing, but for each trial. So it adds up all the RT's for each trial. I want to get the RT's for each image. I tried a few things like an extra stimulus.trialhandler for the stimuli and an extraloop in the end which gives me the last RT but not each. --> is answered below!!!
for stimuli in stimulus: stimulus.addData('rt', rt)
I know these four questions are a lot for one question, but maybe somebody can give me some good ideas of how I can solve these... Thanks everybody!

The reason for your problem labelled [1] is that you set the image to PicList[0][0] which never changes. As Mike is suggesting above you need::
for i,thisPic in enumerate(PicList):
bitmap.setImage(thisPic) #not PicList[0][0]
But maybe you need to go back to basics so that you actually use the trial handler to handle your trials ;-)
Create a single list of dictionaries where one dictionary represents one trial, and then run through those in order (tell the TrialHandler to use the list 'sequential' rather than 'random'). So the loops that you're currently using should just be to create your list of condition dicts, not to run the trials. Then pass that one list to the trial handler::
trials = TrialHandler(trialTypes = myCondsListInOrder, nReps=1, method='sequential')
for thisTrial in trials:
pic = thisTrial['pic']
stim.setImage(pic)
...
trials.addData('rt', rt)
trials.addData('acc',acc)
Also, I would output your data not using the excel format, but the 'long wide' format::
trials.saveAsWideText('mydataFile.csv')
best wishes,
Jon

(A) This isn't relevant to your question but will improve performance.
The line:
bitmap = visual.ImageStim(win)
Shouldn't occur within the loop. i.e. you should initialise each stimulus only once, then within a loop you just update that the properties of that stimulus, e.g. with bitmap.setImage(…). So shift this initialisation line to the top, where you create the TextStims.
(B) [deleted: I hadn't paid attention to the first code block.]
(C)
bitmap.draw(pictures)
This line doesn't take any arguments. It should just be bitmap.draw(). And anyway, it isn't clear what 'pictures' refers to. Remember that Python is case sensitive. This isn't the same thing as 'Pictures' defined in the loop above. I'm guessing that you want to update what picture is being shown? In that case, then you need to be doing the bitmap.setImage(…) line within this loop, not above, where you will always be drawing a fixed picture as that is the only one that gets set on each trial.
(D) Re the RTs, you are saving this only once per trial (check the indentation). If you want to save one per image, you need to indent these lines again. Also, you only get one line per trial in the data output. If you want to record multiple RTs per trial, you will need to give them unique names, e.g. rt_1, rt_2, …, rt_6 so they each appear in a separate column. e.g. you could use an enumerator for this loop:
for i, picture in enumerate(Piclist)
# lots of code
# then save data:
trials.addData('rt_'+str(i), rt)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Collecting large amounts of data efficiently - python

Related

Picking up where program left off after error encountered

Python - Comparing each item in array into another ( Dynamo )

If-statement seemingly ignored by Write operation

get playing wav audio level as output

Trialhandler and time measuring in psychopy

Categories

Resources