How to use incremental PCA on dask dataframe?

How to use incremental PCA on dask dataframe? - python

I am using a dask dataframe which can not be loaded directly into the memory because of the size of it. I want to perform dimentionality reduction of top of using incremental PCA.
My dataframe is sparse in nature, so the question is can I perform it and if yes then how to do so.
image_features_df.head(3)
feat1 feat2 feat3 ... feat25087 feat25088 fid selling_price
0 0.0 0.0 0.0 ... 0.0 0.0 2 269.00
4 0.3 0.1 0.0 ... 0.0 0.8 26 1720.00
6 0.8 0.0 0.0 ... 0.0 0.1 50 18145.25
The above is a view of my dataframe. I want the output to have 95% cumulative varience. How to do so?
My dataframe has 100,000 rows and 25088 columns so please tell a solution which is memory efficient.

Have a look at the PCA implementation in dask-ML, https://ml.dask.org/modules/generated/dask_ml.decomposition.PCA.html,
this might already work for your case, as it uses the tsqr algorithm (https://arxiv.org/abs/1301.1071)

Related

in Python/iPython, is there any way to access or call an object or variable properties from solely its name within a list?

I have a Jupyter Notebook. I know it's not optimal for large works but for many circumstances, is the tool I have to use.
After some computations, I end up with several pandas DataFrame in memory that I would like to pickle. So I do
df_name.to_pickle(filename)
However, I wanted to create a list of all DataFrame using
df_list = %who DataFrame
And then I wanted to do something like
for varname in df_list:
varname.to_pickle(f'{varname}.pickle')
This of course doesn't work because varname is a string, not a DataFrame object with the associated .to_pickle method
So my stupid question is, what's the best way to access the actual object varname and not just the string with it's name?
Note: If I create a list of the actual DataFrame, these are quite big objects in memory, so I will probably run into memory issues.
Thanks

As #matszwecja pointed out in the comments, the most reasonable way is to collect them as you make them. It will also be clearest to yourself and others later. Plus more robust and easier to debug as you develop the code.
However, you seemed to be thinking more abstractly of iterating on the dataframes in kernel's namespace, and it is possible to do that and step through pickling the dataframes all automatically. It's just not that easy, perhaps. For example, you already found you cannot simply make a useable list using df_list = %who DataFrame. (It shows the names in the output cell but not in a way Python can use.)
Here's an option that would work if you really did want to do it. This first part sets up some dummy dataframes and then makes a list of dictionaries of them:
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
input ='''
River_Level Rainfall
0.876 0.0
0.877 0.8
0.882 0.0
0.816 0.0
0.826 0.0
0.836 0.0
0.817 0.8
0.812 0.0
0.816 0.0
0.826 0.0
0.836 0.0
0.807 0.8
0.802 0.0
'''
df_name_one = pd.read_table(StringIO(input), header=0, index_col=None, delim_whitespace=True)
input ='''
River_Level Rainfall
0.976 0.1
0.977 0.5
0.982 0.0
0.916 0.3
0.926 0.0
0.996 9.0
0.917 0.8
0.912 0.0
0.916 0.0
0.926 0.1
0.836 0.0
0.907 0.6
0.902 0.0
'''
df_name_two = pd.read_table(StringIO(input), header=0, index_col=None, delim_whitespace=True)
list_of_dfs_dicts = []
for obj_name in dir():
obj_type_str = str((type(eval(obj_name))))
#print(obj_type_str)
if "DataFrame" in obj_type_str:
#print(obj_name)
#print(obj_type_str)
list_of_dfs_dicts.append({obj_name: eval(obj_name)})
Now each entry in the list is the name of the dataframe object and the dataframe. That can be iterated on and pickled via a single line in a notebook:
[df.to_pickle(f'{varname}.pkl') for d in list_of_dfs_dicts for varname,df in d.items()];
That actually equates to this, which is easier to read:
for d in list_of_dfs_dicts:
for varname,df in d.items():
df.to_pickle(f'{varname}.pkl')
For this self-contained answer, I listed the entire dataframe as part of the collected list and dictionary. Memory wasn't a concern here with these dataframes and I wanted it to illustrate things well in small steps.
However, memory was a concern of yours. You can just vary the collection step to not add the entire dataframe to the list, like so:
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
input ='''
River_Level Rainfall
0.876 0.0
0.877 0.8
0.882 0.0
0.816 0.0
0.826 0.0
0.836 0.0
0.817 0.8
0.812 0.0
0.816 0.0
0.826 0.0
0.836 0.0
0.807 0.8
0.802 0.0
'''
df_name_one = pd.read_table(StringIO(input), header=0, index_col=None, delim_whitespace=True)
input ='''
River_Level Rainfall
0.976 0.1
0.977 0.5
0.982 0.0
0.916 0.3
0.926 0.0
0.996 9.0
0.917 0.8
0.912 0.0
0.916 0.0
0.926 0.1
0.836 0.0
0.907 0.6
0.902 0.0
'''
df_name_two = pd.read_table(StringIO(input), header=0, index_col=None, delim_whitespace=True)
df_list = []
for obj_name in dir():
obj_type_str = str((type(eval(obj_name))))
if "DataFrame" in obj_type_str:
df_list.append(obj_name)
for df_name in df_list:
eval(df_name).to_pickle(f'{df_name}.pkl')
Bear in mind though eval() is something to be careful using. In particular it opens the gate to code injection.
And by doing it this way, you aren't checking things. For example, while developing you could erroneously make a lot of dataframes at some point (example), and if those were still in your kernel's namespace, they'd ALL get pickled by the pickling step. That's why collecting what you want as you go along is more practical & safer/robust in the long run. I just thought your idea of using df_list = %who DataFrame was intriguing.

Degree Centrality and Clustering Coefficient in Adjacent matrix

Based on a dataset extracted from this link: Brain and Cosmic Web samples, I'm trying to do some Complex Network analysis.
The paper The Quantitative Comparison Between the Neuronal Network and the Cosmic Web, claims to have used this dataset, as well as its adjacent matrixes
"Mij, i.e., a matrix with rows/columns equal to the number of detected nodes, with value Mij = 1 if the nodes are separated by a distance ≤ llink , or Mij = 0 otherwise".
I then probed into the matrix, like so:
from astropy.io import fits
with fits.open('mind_dataset/matrix_CEREBELLUM_large.fits') as data:
matrix_cerebellum = pd.DataFrame(data[0].data)
which does not print a sparse matrix, but rather a matrix with distances from nodes expressed as pixels.
I've learned that the correspondence between 1 pixel and scale is:
neuronal_web_pixel = 0.32 # micrometers
And came up with a method in order to convert pixels to microns:
def pixels_to_scale(df, mind=False, cosmos=False):
one_pixel_equals_parsec = cosmic_web_pixel
one_pixel_equals_micron = neuronal_web_pixel
if mind:
df = df/one_pixel_equals_micron
if cosmos:
df = df/one_pixel_equals_parsec
return df
Then, another method to binaryze the matrix after the conversion:
def binarize_matrix(df, mind=False, cosmos=False):
if mind:
brain_Llink = 16.0 # microns
# distances less than 16 microns
brain_mask = (df<=brain_Llink)
# convert to 1
df = df.where(brain_mask, 1.0)
if cosmos:
cosmos_Llink = 1.2 # 1.2 mpc
brain_mask = (df<=cosmos_Llink)
df = df.where(brain_mask, 1.0)
return df
Finally, with:
matrix_cerebellum = pixels_to_scale(matrix_cerebellum, mind=True)
matrix_cerebellum = binarize_matrix(matrix_cerebellum, mind=True)
matrix_cerebellum.head(5) prints my sparse matrix of (mostly) 0.0s and 1.0s:
0 1 2 3 4 5 6 7 8 9 ... 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 1858 columns
Now I would like to calculate:
Degree Centrality of the network, given by the formula:
Cd(j) = Kj / n-1
Where kj is the number of (undirected) connections to/from each j-node and n is the total number of nodes in the entire network.
Clustering Coefficient, which quantifies the existence of infrastructure within the local vicinity of nodes, given by the formula:
C(j) = 2yi / Kj(Kj -1)
in which yj is the number of links between neighbooring nodes of the j-node.
For finding Degree Centrality, I have tried:
# find connections by adding matrix row values
matrix_cerebellum['K'] = matrix_cerebellum.sum(axis=1)
# applying formula
matrix_cerebellum['centrality'] = matrix_cerebellum['K']/matrix_cerebellum.shape[0]-1
Generates:
... K centrality
9.0 -0.995156
6.0 -0.996771
7.0 -0.996771
11.0 -0.996233
11.0 -0.994080
According to the paper, I should be finding:
"For the cerebellum slices we measured 〈k〉 ∼ 1.9 − 3.7",
For the average numbers of connections per node.
Also I'm finding negative centralities.
Does anyone know how to apply any of these formulas based on the dataframe above?

This is not really a programming question, but I will try to answer it. The webpage with the data sources states that the adjacent matrix files for brain samples give distances between connected nodes expressed in pixels of the images used to reconstruct the networks. The paper then explains that to get the real adjacency matrix Mij (with 0 and 1 values only) the authors consider as connected nodes where the distance is at most 16 micrometers. I don't see the information on how many pixels in the image corresponds to one micrometer. This would be needed to compute the same matrix Mij that the authors used in their calculations.
Furthermore, the value〈k〉is not the degree centrality or the clustering coefficient (that depend on a node), but rather the average number of connections per node in the network, computed using the matrix Mij. The paper then compares the observed distributions of degree centralities and clustering coefficients in the brain and cosmic networks to the distribution one would see in a random network with the same number of nodes and the same value of〈k〉. The conclusion is that brain and cosmic networks are highly non-random.
Edits:
1. The conversion of 0.32 micrometers per pixel seems to be right. In the files with data on brain samples (both for cortex and cerebellum) the largest value is 50 pixels, which with this conversion corresponds to 16 micrometers. This suggests that the authors of the paper already thresholded the matrices, listing in them only distances not exceeding 16 micrometers. In view of this, to obtain the matrix Mij with 0 and 1 values only, one simply needs to replace all non-zero values with 1. An issue is that using the matrices obtained in this way one gets 〈k〉 = 9.22 for cerebellum and 〈k〉 = 7.13 for cortex, which is somewhat outside the ranges given in the paper. I don't know how to account for this discrepancy.
2. Negative centrality values are due to a mistake (missing parentheses) in the code. It should be:
matrix_cerebellum['centrality'] = matrix_cerebellum['K']/(matrix_cerebellum.shape[0] - 1)
3. Clustering coefficient and degree centrality of each node can be computed using tools provided by the networkx library:
from astropy.io import fits
import networkx as nx
# get the adjacency matrix for cortex
with fits.open('matrix_CORTEX_large.fits') as data:
M = data[0].data
M[M > 0] = 1
# create a graph object
G_cortex = nx.from_numpy_matrix(M)
# compute degree centrality of all nodes
centrality = nx.degree_centrality(G_cortex)
# compute clustering coefficient of all nodes
clustering = nx.clustering(G_cortex)

Count All Occurrences of a Specific Value in a Dask Dataframe

I have a dask dataframe with thousands of columns and rows as follows:
pprint(daskdf.head())
grid lat lon ... 2014-12-29 2014-12-30 2014-12-31
0 0 48.125 -124.625 ... 0.0 0.0 -17.034216
1 0 48.625 -124.625 ... 0.0 0.0 -19.904214
4 0 42.375 -124.375 ... 0.0 0.0 -8.380443
5 0 42.625 -124.375 ... 0.0 0.0 -8.796803
6 0 42.875 -124.375 ... 0.0 0.0 -7.683688
I want to count all occurrences in the entire dataframe where a certain value appears. In pandas, this can be done as follows:
pddf[pddf==500].count().sum()
I'm aware that you can't translate all pandas functions/syntax with dask, but how would I do this with a dask dataframe? I tried doing:
daskdf[daskdf==500].count().sum().compute()
but this yielded a "Not Implemented" error.

As in many cases, where there is a row-wise pandas method which is not explicitly implemented yet in dask, you can use map_partitions. In this case this might look like:
ppdf.map_partitions(lambda df: df[df==500].count()).sum().compute()
You can experiment with whether also doing a .sum() within the lambda helps (it would produce smaller intermediaries) and what the meta= argument to map_partition should look like.

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.

A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Modelling and Plotting Fat Tails - Python

I am working with Stock Indices. I have a Numpy Array which contains the daily returns data for the Index for last 25 yrs or so. I have Plotted the Empirical PDF and also the Corresponding Normal PDF to show how deviant the actual data is from a Normal Distribution.
My questions are:-
Is there a Pythonic way to test if my Left Tail is actually a Fat Tail or not?
And in the above graph how do I mark a point/ threshold beyond which I can say the Tail is Fat?

Consider scipy.stats.kurtosistest and scipy.stats.skewtest.
To your second question, use .axvline to mark your line there. Depending on how granular the bins are, try finding the first point left of zero that meets the following condition:
df
Out[20]:
Normal Empirical
Bin
-1.0 0 2.0
-0.9 1 2.5
-0.8 2 3.0
-0.7 3 3.5
-0.6 4 4.0
-0.5 5 4.5
-0.4 6 5.0
-0.3 7 6.0
-0.2 8 8.0
-0.1 9 10.0
0.0 10 12.0
df.index[(df.Normal.shift() < df.Empirical.shift())
& (df.Normal == df.Empirical)].values
Out[38]: array([-0.6])
And lastly, you could consider plotting the actual histogram in addition to fitted distribution, and using an inset, as is done here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.