RAM consumption by pandas DataFrame - python

I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.

DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.

Related

Python - Multiprocessing very large parquet file

I have stumbled across similar questions but all of them used either .csv or .txt files and a solution was to read in the data line by line or in chunks. I am not aware of that being possible with parquet files as they were designed to be read by columns and not rows.
My first attempt below works well with a smaller subset/test dataset created from the original full dataset.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the groups into chunks of groups
groupby_split = np.array_split(groupby, num_processes)
# Create a list where each element is a chunk of groups
# The starmap expects this sort of input
input_list = [[gb] for gb in groupby_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
x.to_parquet('path/to/save/file.parquet')
but when I switch to the full parquet file, I get the error:
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
which I expected.
My solution to this was to break the very large number of groups into smaller chunks of groups (size similar to the subset) and loop over each one like with the subset earlier.
EDIT I forgot to add the second np.array_split within the loop.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the large number of groups into smaller chunks
N = 10
groupby_split = np.array_split(groupby, N)
final_agg_df = pd.DataFrame()
# iterate over each of the smaller chunks
for groupby_group in groupby_split:
groupby_split_split = np.array_split(groupby_group, num_processes)
# Create a list to use as argument for starmap
input_list = [[gb] for gb in groupby_split_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
final_agg_df = pd.concat([final_agg_df, x])
final_agg_df.to_parquet('path/to/save/file.parquet')
But this is still giving me the same error...
I thought that since the pool was created prior to reading in the large parquet file (I read a solution earlier that mentioned doing this) that each process would only be given the small chunk of groups.
I am wondering if there is something I missed? And also if there is a better way of doing this in general (queue? dask? function logic?)
Thanks in advance!

How to multicore processing a for loop with iterrows in python

I have a massive dataset that could use multicore processing.
I have a dataframe that has sequences and blocksize for each row.
I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.
I can't figure out how to run it in parallel.
Can somebody help?
omega = []
AA=list('FYW')
for i, row in df.iterrows():
seq = df['IDRseq'][i]
b = df['bsize'][i]
bsize = [b-1,b]
SeqOb = SequenceParameters(seq,blobsize=bsize)
omega.append(SeqOb.get_kappa_X(AA))
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)
After a lot of googling, I came across pandarallel.
I think this is the most intuitive way of doing what I want.
I am posting the code for future reference.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable
def something(x):
#do stuff
return result
df['result'] = df.parallel_apply(something, axis=1)

How to convert a single column containing JSON with 250 variables to 250 separate column dataset using arrays?

I have an issue converting a JSON column (which contains around 250 variables) into 250 separate columns. I'm able to use Pandas dataframe, but just for 46k rows it takes 30 minutes and sometimes kernel is crashing due to low memory (for 0.5 million rows in database).
Can somebody help me with code using NumPy arrays (which should decrease conversion time and reduce file size)?
The JSON column has data in below format:
My code :
for x in records:
list_ = list(x)
json_acceptable_string = list_[4].read()
list_features.append(json.loads(json_acceptable_string)
Once I get the list-features I'm preprocessing and using machine learning pipeline. This isn't working for large data.
I think this could help for building your np array
variable_name_list = ['var1','va2',....,'var250']
list_features = np.empty(shape=(len(records),len(variable_name_list)),dtype=str)
for index in range(records):
list_ = list(records[index])
json_acceptable_string = list_[4].read()
tmp_feature_list = []
tmp_feature_dict = json.loads(json_acceptable_string)
for var_name in variable_name_list:
if var_name not in tmp_feature_dict.keys():
tmp_feature_list.append("missing_val")
else :
tmp_feature_list.append(tmp_feature_dict[var_name])
tmp_feature_list = np.asarray(tmp_feature_list,dtype=str).reshape(1,len(variable_name_list))
list_features[index] = tmp_feature_list

Applying a function to every observation in a dataframe

I have a large df of coordinates that I'm putting through a function (reverse geocoder),
How can I run through the whole df without iterating (Takes very long)
Example df:
Latitude Longitude
0 -25.66026 28.0914
1 -25.67923 28.10525
2 -30.68456 19.21694
3 -30.12345 22.34256
4 -15.12546 17.12365
After running through the function I want (without a for loop...) a df:
City
0 HappyPlace
1 SadPlace
2 AveragePlace
3 CoolPlace
4 BadPlace
Note: I dont need to know how to do reverse geocoding, this is a question about applying a function to a whole df without iteration.
EDIT:
using df.apply() might not work as my code looks like this:
for i in range(len(df)):
results = g.reverse_geocode(df['LATITUDE'][i], df['LONGITUDE'][i])
city.append(results.city)
Slower approach Iterating through the list of geo points and fetching city of the geo point
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
# example method of g.reverse_geocode() -> geo_reverse
def geo_reverse(lat, long):
time.sleep(2)
#assuming that your reverse_geocode will take 2 second
print(lat, long)
for i in range(len(df)):
results = geo_reverse(df['Latitude'][i], df['Longitude'][i])
Because of time.sleep(2). above program will take at least 20 seconds to process all ten geo point.
Better approach than above:
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
import threading
def runnable_method(f, args):
result_info = [threading.Event(), None]
def runit():
result_info[1] = f(args)
result_info[0].set()
threading.Thread(target=runit).start()
return result_info
def gather_results(result_infos):
results = []
for i in range(len(result_infos)):
result_infos[i][0].wait()
results.append(result_infos[i][1])
return results
def geo_reverse(args):
time.sleep(2)
return "City Name of ("+str(args[0])+","+str(args[1])+")"
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
cities_result = gather_results(result_info)
print(cities_result)
Notice the method geo_reverse has processing time of 2 seconds to fetch the data based on the geo points. In this second example the code will take only 2 seconds to process as many points as you want.
Note: Try both approach assuming that your geo_reverse will take approx. 2 seconds to fetch data. First approach will take 20+1 seconds and the processing time will increase with the increasing number of inputs but second approach will have almost constant processing time (i.e. approx 2+1) seconds no matter how many geo points you want to process.
Assume g.reverse_geocode() method is geo_reverse() on above code. Run both code (approach) above separately and see the difference on your own.
Explanation:
Take a look on above code and its major part that is creating list of tuples and comprehending that list passing each tuple to a dynamically created threads (Major part):
#Converting df of geo points into list of tuples
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
#List comprehension with custom methods and create run-able threads
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
#gather result from each thread.
cities_result = gather_results(result_info)
print(cities_result)

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

Categories