How to estimate dataframe real size in pyspark? - python

How to determine a dataframe size?
Right now I estimate the real size of a dataframe as follows:
headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size
It is too slow and I'm looking for a better way.

Currently I am using the below approach, but not sure if this is the best way:
df.persist(StorageLevel.Memory)
df.count()
On the spark-web UI under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory:
df.unpersist()

nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
JavaObj = _to_java_object_rdd(df.rdd)
nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

Related

How to apply dask method to apply functions on files in list?

first of all, thanks for this community and all advice we can retrieve, it's really appreciate!
This is my first venture into parallel processing and I have been looking into Dask by my own but I am having trouble actually coding it... to be honest I am really lost
In on of my project, I want to trigger URL and retrieve observations data (meteorological station) from xml files.
For each URL, I run some different process in order to: retreive data from URL, parsing XML information to dataframe, apply a filter and store data in MySQL database.
So i need to loop these process over thousands of URL (station)...
I wrote a sequential code , and it take 300s to finish computation which is really to long and not efficient.
As we are applying the same process for each station, I think I can speed-up all the computations, but I don't know where to start. I used delayed from dask but I don't think it's the best approach.
This is my code so far:
First I have some functions.
def xml_to_dataframe(ood_xml):
tmp_file = wget.download(ood_xml)
prstree = ETree.parse(tmp_file)
root = prstree.getroot()
################ Section to retrieve data for one station and apply parameter
all_obs = []
for obs in root.iter('observations'):
ood_observation = []
for n, param in enumerate(list_parameters):
x=obs.find(variable_to_check).text
ood_observation.append(x)
all_obs.append(ood_observation)
return(pd.DataFrame(all_obs, columns=list_parameters))
def filter_criteria(df,threshold,criteria):
if criteria in df.columns:
result = []
for index, row in df.iterrows():
if pd.to_numeric(row[criteria],errors='coerce') >= threshold:
result.append(index)
return result
else:
#print(criteria + ' parameter does not exist for this station !!! ')
return([])
def get_and_filter_data(filename,criteria,threshold):
try:
xmlToDf = xml_to_dataframe(filename)
final_df = xmlToDf.loc[filter_criteria(xmlToDf,threshold,criteria)]
some msql connection and instructions....
except:
pass
and then the main code I want to parallelise:
criteria = 'temperature'
threshold = 22
filenames =[url1.html, url2.html, url3.html]
for file in filenames:
get_and_filter_data(file,criteria,threshold)
Do you have any advice to do it ?
Many thanks for your help !
Guillaume
Not 100% sure this is what you are after, but one way is via delayed:
from dask import delayed, compute
delayeds = [delayed(get_and_filter_data)(file,criteria,threshold) for file in filenames]
results = compute(delayeds)

How to save results from an API call that uses a pandas column for the requests before the whole thing times out when using apply?

I have a pandas dataframe with strings that I'm using to query an API and return the results.
I'm trying to call the API using a function and .apply and then save the results from the api call into a csv file. The problem is that I'm trying to do 10000+ requests and my kernel/notebook crashes. Basically I'm trying to do a big operation and I'm guessing I'm running out of memory. So I'm trying to think of a way I can do these api calls and save the results and not have it all crash. My version with .apply works with a small amount of data but not once it gets larger.
So my notebook code currently looks something like this.
df = pd.read_csv('bigstringlist.csv')
df = df.loc[0:3000]
My function looks something like this.
def api_fetch_func(address):
sleep(.2)
API_PRIVATE = 'awewaefawefawef'
encoded = urllib.parse.quote(address)
query ='https://apitocall' + str(encoded) + \
'.json?limit=1&key=' \
+ API_PRIVATE
response = requests.get(query)
while True:
try:
jsonResponse = response.json()
break
except:
response = requests.get(query)
try:
return jsonResponse['results']
except:
return
else:
return
Then I'm calling the function like so
df['response_col'] = df['string_col'].apply(api_fetch_func)
Something tells me that .apply isn't the right thing to do here. Would be better if I just push the api responses into an array or another dataframe?
Should I just use .iterrows to loop over the list of strings and call the function? Something tells me .apply tries to jam too much into memory and that's why this doesn't work.
So I was going to try
results = []
for index, row in df.iterrows():
# call API
# push results to array
Or is there another way to do this?
If it's a memory issue, what I'd do is write the API calling function as a generator with the yield statement. Then, you can loop through the api_fetch_function generator and save smaller data frames for the csv files rather than holding everything in memory in one go.
for idx, response in api_fetch_generator():
if idx % 500 == 0:
df = create_df() # create a fresh df as you did above with 'string_col'.
df['response_col'] = df['string_col'].apply(response)
if (idx % 500 == 0) and idx != 0:
# Save the df using idx to control the file name
df.to_csv(f"response_batch_{idx / 500}.csv")
# Combine the csv's after everything is saved.

Store ndarray in a PyTable (and how to define the Col()-type)

TL;DR: I have a PyTable with a float32 Col and get an error when writing a numpy-float32-array into it. (How) can I store a numpy-array (float32) in the Column of a PyTables table?
I'm new to PyTables - following a recommendation of TFtables (a lib to use HDF5 in Tensorflow), I'm using it to store all my HDF5 data (currently distributed in batches in several files with each three datasets) within a table in a single HDF5 file. Datasets are
'data' : (n_elements, 1024, 1024, 4)#float32
'label' : (n_elements, 1024, 1024, 1)#uint8
'weights' : (n_elements, 1024, 1024, 1)#float32
where the n_elements are distributed over several files that I want to merge into one now (to allow unordered access).
So when I build my table, I figured each dataset represents a column. I built everything in a generic way that allows to do this for an arbitrary number of datasets:
# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)
# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}
# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))
The dtypes that I get for each dsets (read with h5py) are obviously the ones of the numpy arrays (ndarray) that reading the dset returns: float32 or uint8. So the Col()-types are Float32Col an UInt8Col. I naively assumed that I can now write a float32-array into this col, but filling in data with:
dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files
sample = table.row
sample['data'] = dummy_data
results in TypeError: invalid type (<class 'numpy.ndarray'>) for column ``data``. So now I feel stupid for assuming I'd be able to write an array in there, BUT there are no "ArrayCol()" types offered, neither are there any hints in the PyTables doc as to whether or how it is possible to write an array into a column. How do I do this?
There are "shape" arguments in the Col() class and it's descendents, so it should be possible, otherwise what are these for?!
I know it's a bit late, but I think the answer to your problem lies in the shape parameter for Float32Col.
Here's how it's used in the documentation:
from tables import *
from numpy import *
# Describe a particle record
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
# Open a file in "w"rite mode
fileh = open_file("tutorial2.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root
# Create the groups:
for groupname in ("Particles", "Events"):
group = fileh.create_group(root, groupname)
# Now, create and fill the tables in Particles group
gparticles = root.Particles
# Create 3 new tables
for tablename in ("TParticle1", "TParticle2", "TParticle3"):
# Create a table
table = fileh.create_table("/Particles", tablename, Particle, "Particles: "+tablename)
# Get the record object associated with the table:
particle = table.row
# Fill the table with 257 particles
for i in xrange(257):
# First, assign the values to the Particle record
particle['name'] = 'Particle: %6d' % (i)
particle['lati'] = i
particle['longi'] = 10 - i
########### Detectable errors start here. Play with them!
particle['pressure'] = array(i*arange(2*3)).reshape((2,4)) # Incorrect
#particle['pressure'] = array(i*arange(2*3)).reshape((2,3)) # Correct
########### End of errors
particle['temperature'] = (i**2) # Broadcasting
# This injects the Record values
particle.append()
# Flush the table buffers
table.flush()
Here's the link to the part of the documentation I'm referring to
https://www.pytables.org/usersguide/tutorials.html
Edit: I just saw that the tables.Col.from_type(type, shape) allows using the precision of a type (float32 instead of float alone). The rest stays the same (takes a string and shape).
The factory function tables.Col.from_kind(kind, shape) can be used to construct a Col-Type that supports ndarrays. What "kind" is and how to use this isn't documented anywhere I found; however with trial and error I found that allowed "kind"s are strings of basic datatypes. I.e.: 'float', 'uint', ... without the precision (NOT 'float64')
Since I get numpy.dtypes from h5py reading a dataset (dset.dtype), these have to be cast to str and the precision needs to be removed.
In the end the relevant lines look like this:
# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)
# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing
# the dtype w/o precision, e.g. 'float' or 'uint'
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]
# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}
Then writing data into the table finally works as suggested:
sample = table.row
sample[key] = my_array
Since it all felt a bit "hacky" and isn't documented well, I am still wondering, whether this is not an intended use for PyTables and would leave this question open for abit to see if s.o. knows more about this...

How to speed up iterating through an array / matrix? Tried both pandas and numpy arrays

I would like to do some feature enrichment through a large 2 dimensional array (15,100m).
Working on a sample set with 100'000 records showed that I need to get this faster.
Edit (data model info)
To simplify, let's say we have only two relevant columns:
IP (identifier)
Unix (timestamp in seconds since 1970)
I would like to add a 3rd column, counting how many times this IP has shown up in the past 12 hours.
End edit
My first attempt was using pandas, because it was comfortable working with named dimensions, but too slow:
for index,row in tqdm_notebook(myData.iterrows(),desc='iterrows'):
# how many times was the IP address (and specific device) around in the prior 5h?
hours = 12
seen = myData[(myData['ip']==row['ip'])
&(myData['device']==row['device'])
&(myData['os']==row['os'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
ip_seen = myData[(myData['ip']==row['ip'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
myData.loc[index,'seen'] = seen
myData.loc[index,'ip_seen'] = ip_seen
Then I switched to numpy arrays and hoped for a better result, but it is still too slow to run against the full dataset:
# speed test numpy arrays
for i in np.arange(myArray.shape[0]):
hours = 12
ip,device,os,ts = myArray[i,[0,3,4,12]]
ip_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
device_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,2] == device)
& (myArray[:,3] == os)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
myArray[i,13]=ip_seen
myArray[i,14]=device_seen
My next idea would be to iterate only once, and maintain a growing dictionary of the current count, instead of looking backwards in every iteration.
But that would have some other drawbacks (e.g. how to keep track when to reduce count for observations falling out of the 12h window).
How would you approach this problem?
Could it be even an option to use low level Tensorflow functions to involve a GPU?
Thanks
The only way to speed up things is not looping. In your case you can try using rolling with a window of the time span that you want, using the Unix timestamp as a datetime index (assuming that records are sorted by timestamp, otherwise you would need to sort first). This should work fine for the ip_seen:
ip = myData['ip']
ip.index = pd.to_datetime(myData['unix'], unit='s')
myData['ip_seen'] = ip.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
However, when the aggregation involves multiple columns, like in the seen column, it gets more complicated. Currently (see Pandas issue #15095) rolling functions do not support aggregations spanning two dimensions. A workaround could be merging the columns of interest into a single new series, for example a tuple (which may work better if values are numbers) or a string (which may be better is values are already strings). For example:
criteria = myData['ip'] + '|' + myData['device'] + '|' + myData['os']
criteria.index = pd.to_datetime(myData['unix'], unit='s')
myData['seen'] = criteria.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
EDIT
Apparently rolling only works with numeric types, which leaves as with two options:
Manipulate the data to use numeric types. For the IP this is easy, since it actually represents a 32 bit number (or 64 if IPv6 I guess). For device and OS, assuming they are strings now, it get's more complicated, you would have to map each possible value to an integer and the merge it with the IP in a long value, e.g. putting these in the higher bits or something like that (maybe even impossible with IPv6, since the biggest integers NumPy supports right now are 64 bits).
Roll over the index of myData (which should now be not datetime, because rolling cannot work with that either) and use the index window to get the necessary data and operate:
# Use sequential integer index
idx_orig = myData.index
myData.reset_index(drop=True, inplace=True)
# Index to roll
idx = pd.Series(myData.index)
idx.index = pd.to_datetime(myData['unix'], unit='s')
# Roll aggregation function
def agg_seen(w, data, fields):
# Use slice for faster data frame slicing
slc = slice(int(w[0]), int(w[-2])) if len(w) > 1 else []
match = data.loc[slc, fields] == data.loc[int(w[-1]), fields]
return np.count_nonzero(np.all(match, axis=1))
# Do rolling
myData['ip_seen'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip'])) \
.values.astype(np.int32)
myData['ip'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip', 'device', 'os'])) \
.values.astype(np.int32)
# Put index back
myData.index = idx_orig
This is not how rolling is meant to be used, though, and I'm not sure if this gives much better performance than just looping.
as mentioned in the comment to #jdehesa, I took another approach which allows me to only iterate once through the entire dataset and pull the (decaying) weight from an index.
decay_window = 60*60*12 # every 12
decay = 0.5 # fall by 50% every window
ip_idx = pd.DataFrame(myData.ip.unique())
ip_idx['ts_seen'] = 0
ip_idx['ip_seen'] = 0
ip_idx.columns = ['ip','ts_seen','ip_seen']
ip_idx.set_index('ip',inplace=True)
for index, row in myData.iterrows(): # all
# How often was this IP seen?
prior_ip_seen = ip_idx.loc[(row['ip'],'ip_seen')]
prior_ts_seen = ip_idx.loc[(row['ip'],'ts_seen')]
delay_since_count = row['unix']-ip_idx.loc[(row['ip'],'ts_seen')]
new_ip_seen = prior_ip_seen*decay**(delay_since_count/decay_window)+1
ip_idx.loc[(row['ip'],'ip_seen')] = new_ip_seen
ip_idx.loc[(row['ip'],'ts_seen')] = row['unix']
myData.iloc[index,14] = new_ip_seen-1
That way the result is not the fixed time window as requested initially, but prior observations "fade out" over time, giving frequent recent observations a higher weight.
This feature carries more information than the simplified (and turned out more expensive) approach initially planned.
Thanks for your input!
Edit
In the meantime I switched to numpy arrays for the same operation, which now only takes a fraction of the time (loop with 200m updates in <2h).
Just in case somebody looks for a starting point:
%%time
import sys
## temporary lookup
ip_seen_ts = [0]*365000
ip_seen_count = [0]*365000
cnt = 0
window = 60*60*12 # 12h
decay = 0.5
counter = 0
chunksize = 10000000
store = pd.HDFStore('store.h5')
t = time.process_time()
try:
store.remove('myCount')
except:
print("myData not present.")
for myHdfData in store.select_as_multiple(['myData','myFeatures'],columns=['ip','unix','ip_seen'],chunksize=chunksize):
print(counter, time.process_time() - t)
#display(myHdfData.head(5))
counter+=chunksize
t = time.process_time()
sys.stdout.flush()
keep_index = myHdfData.index.values
myArray = myHdfData.as_matrix()
for row in myArray[:,:]:
#for row in myArray:
i = (row[0].astype('uint32')) # IP as identifier
u = (row[1].astype('uint32')) # timestamp
try:
delay = u - ip_seen_ts[i]
except:
delay = 0
ip_seen_ts[i] = u
try:
ip_seen_count[i] = ip_seen_count[i]*decay**(delay/window)+1
except:
ip_seen_count[i] = 1
row[3] = np.tanh(ip_seen_count[i]-1) # tanh to normalize between 0 and 1
myArrayAsDF = pd.DataFrame(myArray,columns=['c_ip','c_unix','c_ip2','ip_seen'])
myArrayAsDF.set_index(keep_index,inplace=True)
store.append('myCount',myArrayAsDF)
store.close()

Efficiently insert massive amount of rows in Psycopg2

I need to efficiently insert about 500k (give or take 100k) rows of data into my PostgreSQL database. After a generous amount of google-ing, I've gotten to this solution, averaging about 150 (wall-clock) seconds.
def db_insert_spectrum(curs, visual_data, recording_id):
sql = """
INSERT INTO spectrums (row, col, value, recording_id)
VALUES %s
"""
# Mass-insertion technique
# visual_data is a 2D array (a nx63 matrix)
values_list = []
for rowIndex, rowData in enumerate(visual_data):
for colIndex, colData in enumerate(rowData): # colData is the value
value = [(rowIndex, colIndex, colData, recording_id)]
values_list.append(value)
psycopg2.extras.execute_batch(curs, sql, values_list, page_size=1000)
Is there a faster way?
Based on the answers given here, COPY is the fastest method. COPY reads from a file or file-like object.
Since memory I/O is many orders of magnitude faster than disk I/O, it is faster to write the data to a StringIO file-like object than to write to an actual file.
The psycopg docs show an example of calling copy_from with a StringIO as input.
Therefore, you could use something like:
try:
# Python2
from cStringIO import StringIO
except ImportError:
# Python3
from io import StringIO
def db_insert_spectrum(curs, visual_data, recording_id):
f = StringIO()
# visual_data is a 2D array (a nx63 matrix)
values_list = []
for rowIndex, rowData in enumerate(visual_data):
items = []
for colIndex, colData in enumerate(rowData):
value = (rowIndex, colIndex, colData, recording_id)
items.append('\t'.join(map(str, value))+'\n')
f.writelines(items)
f.seek(0)
cur.copy_from(f, 'spectrums', columns=('row', 'col', 'value', 'recording_id'))
I don't know whether .execute_batch can accept generator, but can u try something like:
def db_insert_spectrum(curs, visual_data, recording_id):
sql = """
INSERT INTO spectrums (row, col, value, recording_id)
VALUES %s
"""
data_gen = ((rIdx, cIdx, value, recording_id) for rIdx, cData in enumerate(visual_data)
for cIdx, value in enumerate(cData))
psycopg2.extras.execute_batch(curs, sql, data_gen, page_size=1000)
It might be faster.

Categories