Picking up where program left off after error encountered - python

I am running a program row-wise on a pandas dataframe that takes a long time to run.
The problem is, the VPN connection to the database can suddenly be lost, so I lose all my progress.
Currently, what I am doing is splitting the large dataframe into smaller chunks (500 rows at a time), and running the program on each chunk in a for loop. The result of the processing of each chunk is saved to my hard drive.
However, the chunks are still 500 rows each, so I can still lose a lot of progress when the connection is lost. Plus, I have to manually check to see where I got up to and adjust the code to pick up where the connection was lost.
What is the best way to write the code to "remember" which row the program is up to and pick up exactly where it left off once I re-establish the connection?
Current:
size = 500
list_of_dfs = np.split(large_df, range(size, len(large_df), size))
together_list = []
for count, chunk in enumerate(list_of_dfs):
# Process
chunk_processed = process_chunk(chunk)
chunk_processed.to_csv(f"processed_{count}.csv")
together_list.append(chunk_processed)
# merge lists together into one df
all_chunks_together = pd.concat(together_list)
Thanks in advance

You could use the existing csv files to remember where to pick up:
size = 500
list_of_dfs = np.split(large_df, range(size, len(large_df), size))
together_list = []
for count, chunk in enumerate(list_of_dfs):
csv_file = f"processed_{count}.csv"
if os.path.isfile(csv_file):
chunk_processed = from_csv(csv_file)
else:
chunk_processed = process_chunk(chunk)
chunk_processed.to_csv(csv_file)
together_list.append(chunk_processed)
# merge lists together into one df
all_chunks_together = pd.concat(together_list)
You would still have to re-start your program manually every time it loses the connection. To avoid this, you could catch the exception (assuming you're getting one on connection loss) and continue like in this example:
import random
random.seed(64)
l = []
while len(l) < 3:
try:
l = []
for n in range(3):
l.append(n)
x = 1 / random.randint(0,1) # div by 0 error with 50% probability
except:
print("error, trying again")
pass
print(l)
which yields
error, trying again
error, trying again
error, trying again
error, trying again
error, trying again
error, trying again
error, trying again
[0, 1, 2]
The downside of this approach is that you potentially re-read the csv files quite often. But assuming this is fast and you can wait, it may be fine. At least you would have no manual work to do anymore.

Related

Why does my for loop with if else clause run so slow?

TL,DR:
I'm trying to understand why the below for loop is incredibly slow, taking hours to run on a dataset of 160K entries.
I have a working solution using a function and .apply(), but I want to understand why my homegrown solution is so bad. I'm obviously a huge beginner with Python:
popular_or_not = []
counter = 0
for id in df['id']:
if df['popularity'][df['id'] == id].values == 0:
popular_or_not.append(0)
else:
popular_or_not.append(1)
counter += 1
df['popular_or_not'] = popular_or_not
df
In more detail:
I'm currently learning Python for data science, and I'm looking at this dataset on Kaggle: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks
I'm interesting in predicting/modelling the popularity score. It is not normally distributed:
plt.bar(df['popularity'].value_counts().index, df['popularity'].value_counts().values)
I would like to add a column, to say whether a track is popular or not, with popular tracks being those that get a score of 5 and above and unpopular being the others.
I have tried the following solution, but it runs incredibly slowly, and I'm not sure why. It runs fine on a very small subset, but would take a few hours to run on the full dataset:
popular_or_not = []
counter = 0
for id in df['id']:
if df['popularity'][df['id'] == id].values == 0:
popular_or_not.append(0)
else:
popular_or_not.append(1)
counter += 1
df['popular_or_not'] = popular_or_not
df
This alternative solution works fine:
def check_popularity(score):
if score > 5:
return 1
else:
#pdb.set_trace()
return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)
I think understanding why my first solution doesn't work might be an important part of my Python learning.
Thanks everyone for your comments. I'm going to summarize them below as an answer to my question, but please feel free to jump in if anything is incorrect:
The reason my initial for loop was so slow is that I was checking df['id'] == id 160k times. This is typically a very slow operation.
For this type of operation, instead of iterating over a pandas dataframe thousands of times, it's always a good idea to think of applying vectorization - a bunch of tools and methods to process a whole column in a single instruction at C speed. This is what I did with the following code:
def check_popularity(score):
if score > 5:
return 1
else:
#pdb.set_trace()
return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)
By using .apply and a pre-defined function. I get the same result, but in seconds instead of in hours.

How to speed up execute_async insertion to Cassandra using the Python Driver

I'm attempting to load data into Cassandra using the python driver. The fastest I've been able to get is around 6k writes/second. My csv that I'm reading from has around 1.15 million rows leading to an overall insertion time of around 3 minutes and 10 seconds. I really need to get this time down to 2 minutes or less to keep up with data as it comes in.
My data consists of 1.15 million rows with 52 columns.
Currently I'm using the session.execute_async function to insert the data. Making changes to how many asnyc requests I allow at one time does seem to speed it up. It seems that blocking after about 5-6k requests leads to the fastest insertion rate.
I did attempt batch inserts but they were abysmally slow.
Here is my current method for inserting the data into Cassandra.
# insert data into cassandra table
execution_profile = ExecutionProfile(request_timeout=10)
profiles = {'node1': execution_profile}
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
cluster = Cluster(['11.111.11.11'], 9042, auth_provider=auth_provider, execution_profiles=profiles)
session = cluster.connect() # connect to your keyspace
# Read csv rows into cassandra
count = 0
futures = []
with open('/massaged.csv') as f:
next(f) #skip the header row
for line in f:
query = SimpleStatement("INSERT INTO hrrr.hrrr_18hr( loc_id,utc,sfc_vis,sfc_gust,sfc_pres,sfc_hgt,sfc_tmp,sfc_snow_0Xacc,sfc_cnwat,sfc_weasd,sfc_snowc,sfc_snod,two_m_tmp,two_m_pot,two_m_spfh,two_m_dpt,two_m_rh,ten_m_ugrd,ten_m_vgrd,ten_m_wind_1hr_max,ten_m_maxuw_1hr_max,ten_m_maxvw_1hr_max,sfc_cpofp,sfc_prate,sfc_apcp_0Xacc,sfc_weasd_0Xacc,sfc_frozr_0Xacc,sfc_frzr_0Xacc,sfc_ssrun_1hr_acc,sfc_bgrun_1hr_acc,sfc_apcp_1hr_acc,sfc_weasd_1hr_acc,sfc_frozr_1hr_acc,sfc_csnow,sfc_cicep,sfc_cfrzr,sfc_crain,sfc_sfcr,sfc_fricv,sfc_shtfl,sfc_lhtfl,sfc_gflux,sfc_vgtyp,sfc_cape,sfc_cin,sfc_dswrf,sfc_dlwrf,sfc_uswrf,sfc_ulwrf,sfc_vbdsf,sfc_vddsf,sfc_hpbl) VALUES (%s)" %(line), consistency_level=ConsistencyLevel.ONE)
futures.append(session.execute_async(query, execution_profile='node1'))
count += 1
if count % 5000 == 0:
for f in futures:
f.result() # blocks until remaining inserts are completed.
futures = []
print("rows processed: " + str(count))
# Catch any remaining async requests that haven't finished
for f in futures:
f.result() # blocks until remaining inserts are completed.
print("rows processed: " + str(count))
I need to get my insertion time down to around 2 minutes or less (roughly 10K insertions per second). Should I be using multiprocessing to achieve this or am I using the execute_async function incorrectly?
UPDATE
As per Alex's suggestion, I attempted to implement a prepared statement. This is what I came up with but it seems to be significantly slower? Any thoughts on what I've done wrong?
hrrr_prepared = session.prepare("INSERT INTO hrrr.hrrr_18hr( loc_id,utc,...,sfc_hpbl) VALUES (?, ..., ?)")
for row in range(0, len(data)):
futures.append(session.execute_async(hrrr_prepared, tuple(data.iloc[row])))
count += 1
if count % 5000 == 0:
for f in futures:
f.result() # blocks until remaining inserts are completed.
futures = []
print("rows processed: " + str(count))
NOTE: I put the "..." in the prepared statement for readability, the actual code does not have that.
The big speedup should come from the use of the prepared statements instead of using SimpleStatement - for prepared statement it's parsed only once (outside of loop), and then only data is sent to server together with query ID. With SimpleStatement, query will be parsed each time.
Also, potentially you can improve throughput if you won't wait for all futures to completion, but have some kind of "counting semaphore" that won't allow you to exceed max number of "in-flight" requests, but you could send new request as soon as some of them are executed. I'm not Python expert, so can't say exactly how to do this, but you can look into Java implementation to understand an idea.

My shuffle algorithm crashes when more trials than objects

I am trying to make an experiment where a folder is scanned for images. For each trial, a target is shown and some (7) distractor images. Afterward, in half the trials people are shown the target image and in the other half, they are shown an image that wasn't in the previous display.
My current code sort of works, but only if there are fewer trials than objects:
repeats = 20
# Scan dir for images
jpgs = []
for path, dirs, files in os.walk(directory):
for f in files:
if f.endswith('.jpg'):
jpgs.append(f)
# Shuffle up jpgs
np.random.shuffle(jpgs)
# Create list with target and probe object, Half random, half identical
display = []
question = []
sameobject = []
position = np.repeat([0,1,2,3,4,5,6,7], repeats)
for x in range(1,(repeats*8)+1):
display.append(jpgs[x])
if x % 2 == 0:
question.append(jpgs[-x])
sameobject.append(0)
else:
question.append(jpgs[x])
sameobject.append(1)
# Concatonate objects together
together = np.c_[display,question,position,sameobject]
np.random.shuffle(together)
for x in together:
# Shuffle and set image
np.random.shuffle(jpgs)
myList = [i for i in jpgs if i != together[trial,0]]
myList = [i for i in myList if i != together[trial,1]]
# Set correct image for target
myList[int(together[trial,2])] = together[trial,0]
First of all, I am aware that this is horrible code. But it gets the job done coarsely. With 200 jpgs and a repeat of 20, it works. If repeat is set to 30 it crashes.
Here is an example with repeat too high:
File "H:\Code\Stims\BetaObjectPosition.py", line 214, in <module>
display.append(jpgs[x])
IndexError: list index out of range
Is there a way to update my code in a way that allows more trials while all objects are used as evenly as possible (one object should not be displayed 3 times while another is displayed 0) over an entire experiment?
Full, reproducible example
Bonus points if anyone can see an obvious way to balance the way the 7 distractor images are selected too.
Thanks for taking your time to read this. I hope you can help me onwards.
The solution that changes your code the least should be to change each call of jpgs[x] to jpgs[x % len(jpgs)]1. This should get rid of the IndexError; it basically wraps the list index "around the edges", making sure it's never to large. Although I'm not sure how it will interact with the jpgs[-x] call.
An alternative would be to implement a class that produces a longer sequence of objects from a shorter one.
Example:
from random import shuffle
class InfiniteRepeatingSequence(object):
def __init__(self, source_list):
self._source = source_list
self._current = []
def next(self):
if len(self._current) == 0:
# copy the source
self._current = self._source[:]
shuffle(self._current)
# get and remove an item from a list
return self._current.pop()
This class repeats the list indefinitely. It makes sure to use each element once before re-using the list.
It can easily be turned into an iterator (try changing next to __next__). But be careful since the class above produces an infinite sequence of elements.
1 See "How does % work in Python?" for an explanation about the modulo operator.
Edit: Added link to modulo question.

Collecting large amounts of data efficiently

I have a program that creates a solar system, integrates until a close encounter between adjacent planets occur (or until 10e+9 years), then writes two data points to a file. The try and except acts as a flag when planets get too close. This process is repeated 16,000 times. This is all being done by importing the module REBOUND, which is a software package that integrates the motion of particles under the influence of gravity.
for i in range(0,16000):
def P_dist(p1, p2):
x = sim.particles[p1].x - sim.particles[p2].x
y = sim.particles[p1].y - sim.particles[p2].y
z = sim.particles[p1].z - sim.particles[p2].z
dist = np.sqrt(x**2 + y**2 + z**2)
return dist
init_periods = [sim.particles[1].P,sim.particles[2].P,sim.particles[3].P,sim.particles[4].P,sim.particles[5].P]
try:
sim.integrate(10e+9*2*np.pi)
except rebound.Encounter as error:
print(error)
print(sim.t)
for j in range(len(init_periods)-1):
distance = P_dist(j, j+1)
print(j,":",j+1, '=', distance)
if distance <= .01: #returns the period ratio of the two planets that had the close enecounter and the inner orbital period between the two
p_r = init_periods[j+1]/init_periods[j]
with open('good.txt', 'a') as data: #opens a file writing the x & y values for the graph
data.write(str(math.log10(sim.t/init_periods[j])))
data.write('\n')
data.write(str(p_r))
data.write('\n')
Whether or not there is a close encounter depends mostly on a random value I have assigned, and that random value also controls how long a simulation can run. For instance, I chose the random value to be a max of 9.99 and a close encounter happened at approximately 11e+8 years(approximately 14 hours). The random values range from 2-10, and close encounters happen more often on the lower side. Every iteration, if a close encounter occurs, my code will write to the file where I believe may be taking up a lot of simulation time. Since the majority of my simulation time is taken up by trying to locate close encounters, I'd like to shed some time by finding a way to collect the data needed without having to append to the file every iteration.
Since I'm attempting to plot the data collected from this simulation, would creating two arrays and outputting data into those be faster? Or is there a way to only have to write to a file once, when all 16000 iterations are complete?
sim is a variable holding all of the information about the solar system.
This is not the full code, I left out the part where I created the solar system.
count = 0
data = open('good.txt', 'a+')
....
if distance <= .01:
count+=1
while(count<=4570)
data.write(~~~~~~~)
....
data.close()
The problem isn't that you write every time you find a close encounter. It's that, for each encounter, you open the file, write one output record, and close the file. All the opening and appending is slow. Try this, instead: open the file once, and do only one write per record.
# Near the top of the program
data = open('good.txt', 'a')
...
if distance <= .01: #returns the period ratio of the two planets that had the close enecounter and the inner orbital period between the two
# Write one output record
p_r = init_periods[j+1]/init_periods[j]
data.write(str(math.log10(sim.t/init_periods[j])) + '\n' +
str(p_r) + '\n')
...
data.close()
This should work well, as writes will get buffered, and will often run in parallel with the next computation.

Tracking how many elements processed in generator

I have a problem in which I process documents from files using python generators. The number of files I need to process are not known in advance. Each file contain records which consumes considerable amount of memory. Due to that, generators are used to process records. Here is the summary of the code I am working on:
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs = read_records(fd)
recs_p = (process_records(r) for r in recs)
write_records(recs_p)
My process_records function checks for the content of each record and only returns the records which has a specific sender. My problem is the following: I want to have a count on number of elements being returned by read_records. I have been keeping track of number of records in process_records function using a list:
def process_records(r):
if r.sender('sender_of_interest'):
records_list.append(1)
else:
records_list.append(0)
...
The problem with this approach is that records_list could grow without bounds depending upon the input. I want to be able to consume the content of records_list once it grows to certain point and then restart the process. For example, after 20 records has been processed, I want to find out how many records are from 'sender_of_interest' and how many are from other sources and empty the list. Can I do this without using a lock?
You could make your generator a class with an attribute that contains a count of the number of records it has processed. Something like this:
class RecordProcessor(object):
def __init__(self, recs):
self.recs = recs
self.processed_rec_count = 0
def __call__(self):
for r in self.recs:
if r.sender('sender_of_interest'):
self.processed_rec_count += 1
# process record r...
yield r # processed record
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs_p = RecordProcessor(read_records(fd))
write_records(recs_p)
print 'records processed:', recs_p.processed_rec_count
Here's the straightforward approach. Is there some reason why something this simple won't work for you?
seen=0
matched=0
def process_records(r):
seen = seen + 1
if r.sender('sender_of_interest'):
matched = match + 1
records_list.append(1)
else:
records_list.append(0)
if seen > 1000 or someOtherTimeBasedCriteria:
print "%d of %d total records had the sender of interest" % (matched, seen)
seen = 0
matched = 0
If you have the ability to close your stream of messages and re-open them, you might want one more total seen variable, so that if you had to close that stream and re-open it later, you could go to the last record you processed and pick up there.
In this code "someOtherTimeBasedCriteria" might be a timestamp. You can get the current time in milliseconds when you begin processing, and then if the current time now is more than 20,000ms more (20 sec) then reset the seen/matched counters.

Categories