Running arithmatics through big data in python Pandas - python

I am getting data that has over 2 million columns and that data runs fast through pandas and I want to apply linear regression to this equation and it takes a very long time to process through normal python code. The equation computes every single column value of price Closes and that takes a very long time. Would numpy or any other library be more efficient in processing this data and is there a function that could increase this process by a good amount?
%%time
%matplotlib notebook
import pandas as pd
import math
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from copy import deepcopy
from itertools import groupby
minute_2015_20 = 'input.csv'
data =pd.read_csv(minute_2015_20, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
data1

Since I don't actually know your approach, first thought would be to make sure you are running a vectorized operation rather than a loop.
If you are already doing that, you can try parallelism on the chunks of data.
This should make the operations faster.
Another approach ,provided you have access to a GPU, would be to convert it to tensor and perform the actions on a GPU.

Related

Cant fit dataframe with fbprophet using dask to read the csv into a dataframe

References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981

Load mat file into pandas data frame

I have a problem loading specific Matlab file into pandas data frame (data). Basically, it is a multidimensional array consisting of 2x4 arrays and then either 3 or 4 columns. I searched through and found this Convert mat file to pandas dataframe, matlab data file to pandas DataFrame but neither works for me. Here is what I am using. Thit creates a pandas for one of the nests. I would like to have a column which will tell me where I am, ie first column tells if it is all_params[0] or all_params[1], second distinguishes the next level for each of those etc.
import numpy as np
import scipy.io as sio
all_params = sio.loadmat(path+'all_params')
all_params = all_params['all_params']
pd.DataFrame(all_params[0][0][:], columns=['par1', 'par2', 'par3'])
It must be simple, I'm just not able to figure it out. Or is there a way how to do it directly using scipy or another loading tool?
Thanks

Python: Do not remove the data when I stop the program. Loading a very big database only once

So, I have this database with thousands of rows and columns. At the start of the program I load the data and assign a variable to it:
data=np.loadtxt('database1.txt',delimiter=',')
Since this database contains many elements, it takes minutes to start the program. Is there a way in Python (similar to .mat files in matlab) which makes me only load the data once even when I stop the program then run it again? Currenly my time is wasted waiting for the program to load the data if I just change a small thing for testing.
Firstly, the Numpy package isn't good to read a large file, the Pandas package it's so strongly.
So just stop using np.loadtxt and start using pd.read_csv instead.
But, if you want to use it
I think that the np.fromfile() module is more efficient and faster than np.loadtxt().
So, my advice try:
data = np.fromfile('database1.txt', sep=',')
instead of:
data = np.loadtxt('database1.txt',delimiter=',')
You could pickle to cache your data.
import pickle
import os
import numpy as np
if os.path.isfile("cache.p"):
with open("cache.p","rb") as f:
data=pickle.load(f)
else:
data=data=np.loadtxt('database1.txt',delimiter=',')
with open("cache.p","wb") as f:
pickle.dump(data,f)
The first time it will be very slow, then in later executions it will be pretty fast.
just tested with a file containing 1 million rows and 20 columns of random floats, it took ~30s the first time, and ~0.4s the following times.

Using Dask with Python causes issues when running Pandas code

I am trying to work with Dask because my dataframe has become large and that pandas by itself can't simply process it. I read my dataset in as follows and get the following result that looks odd, not sure why its not outputting the dataframe:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import dask.bag as db
import json
%matplotlib inline
Leads = db.read_text('Leads 6.4.18.txt')
Leads
This returns (instead of my pandas dataframe):
dask.bag<bag-fro..., npartitions=1>
Then when I try to rename a few columns:
Leads_updated = Leads.rename(columns={'Business Type':'Business_Type','Lender
Type':'Lender_Type'})
Leads_updated
I get:
AttributeError: 'Bag' object has no attribute 'rename'
Can someone please explain what I am not doing correctly. The ojective is to just use Dask for all these steps since it is too big for regular Python/Pandas. My understanding is the syntax used under Dask should be the same as Pandas.

How to speed up pandas.DataFrame plotting?

I am building a GUI in PySide where I have to keep redrawing pandas.DataFrame objects.
I found out in this simple snippet of code that plotting the pandas DataFrame object df takes much longer to plot than the numpy.array object, despite the fact that the plots are nearly identical. This is too slow for my GUI. Why is this so much slower?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = np.cumsum(np.random.randn(100, 10), axis=0)
df = pd.DataFrame(data)
df.plot() # Compare the speed of this line... (slow)
plt.plot(data) # to this line. (fast)
I like the way that the pandas.DataFrame plots look, especially because in my real example my x-axis is datetime data from pandas. I do not know how to format a matplotlib.pyplot x-axis to look good with datetime data.
How do I speed up pandas.DataFrame plotting?

Categories