Pyspark random split changes distribution of data

Pyspark random split changes distribution of data - python

I found a very strange behavior with pyspark when I use randomSplit. I have a column is_clicked that takes values 0 or 1 and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1, followed by rows that are all is_clicked=0. You can see that number of clicks in the original dataframe df is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1 until there are no more columns like this, and then it will be followed by rows is_clicked=0.
Anyone knows why there is distribution change after random split? How can I make is_clicked be uniformly distributed after split?

So indeed pyspark does sort the data, when does randomSplit. Here is a quote from the code:
It is possible that the underlying dataframe doesn't guarantee the
ordering of rows in its constituent partitions each time a split is
materialized which could result in overlapping splits. To prevent
this, we explicitly sort each input partition to make the ordering
deterministic. Note that MapTypes cannot be sorted and are explicitly
pruned out from the sort order.
The solution to this either reshuffle the data after the split or just use filter instead of randomSplit:
Solution 1:
df = df.withColumn('rand', sf.rand(seed=42)).orderBy('rand')
df_train, df_test = df.randomSplit([0.5, 0.5])
df_train.orderBy('rand')
Solution 2:
df_train = df.filter(df.rand < 0.5)
df_test = df.filter(df.rand >= 0.5)
Here is a blog post with more details.

Related

pandas : sampling avoiding twice same values in different samples

I have this 5000 rows dataframe.
I want to make 4 random samples of 300 rows from the dataframe. I want each of my sample to have no duplicate inside the sample, but i also want no duplicate among samples. Ie i dont want a row to appear in sample 1 and sample 3 for example.
I have tried df.sample(300,replace=False) but it's not enough.
I have also searched the forum but didnt find what i want.
How can i code pandas to do so without doing batch groups?

I don't think there is a pandas function specifically for that, but how about doing this:
df = pd.DataFrame({"col": range(5000)})
sample = df.sample(1200, replace= False)
sample.duplicated().any()
>> False # <-- no duplicates
samples = [sample.iloc[i-300:i] for i in range(300, 1500, 300)] # <-- 4 samples
Considering that .sample will return a random selection without replacement, this would achieve what you want.

Return String Similarity Scores between two String Columns - Pandas

I'm trying to build a search based results, where in I will have an input dataframe having one row and I want to compare with another dataframe having almost 1 million rows. I'm using a package called Record Linkage
However, I'm not able to handle typos. Lets say I have "HSBC" in my original data and the user types it as "HKSBC", I want to return "HSBC" results only. On comparing the string similarity distance with jarowinkler I get the following results:
from pyjarowinkler import distance
distance.get_jaro_distance("hksbc", "hsbc", winkler=True, scaling=0.1)
>> 0.94
However, I'm not able to give "HSBC" as an output, so I want to create a new column in my pandas dataframe where in I'll compute the string similarity scores and take that part of the score which has a score above a particular threshold.
Also, the main bottleneck is that I have almost 1 million data, so I need to compute it really fast.
P.S. I have no intentions of using fuzzywuzzy, preferable either of Jaccard or Jaro-Winkler
P.P.S. Any other ideas to handle typos for search based thing is also acceptable

I was able to solve it through record linkage only. So basically it does an initial indexing and generates candidate links (You can refer to the documentation on "SortedNeighbourhoodindexing" for more info), i.e. it does a multi-indexing between the two dataframes that needs to be compared, which I did manually.
So here is my code:
import recordlinkage
df['index'] = 1 # this will be static since I'll have only one input value
df['index_2'] = range(1, len(df)+1)
df.set_index(['index', 'index_2'], inplace=True)
candidate_links=df.index
df.reset_index(drop=True, inplace=True)
df.index = range(1, len(df)+1)
# once the candidate links has been generated you need to reset the index and compare with the input dataframe which basically has only one static index, i.e. 1
compare_cl = recordlinkage.Compare()
compare_cl.string('Name', 'Name', label='Name', method='jarowinkler') # 'Name' is the column name which is there in both the dataframe
features = compare_cl.compute(candidate_links,df_input,df) # df_input is the i/p df having only one index value since it will always have only one row
print(features)
Name
index index_2
1 13446 0.494444
13447 0.420833
13469 0.517949
Now I can give a filter like this:
features = features[features['Name'] > 0.9] # setting the threshold which will filter away my not-so-close names.
Then,
df = df[df['index'].isin(features['index_2'])
This will sort my results and give me the final dataframe which has a name score greater than a particular threshold set by the user.

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.

Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

Unable to drop rows in pandas DataFrame which contain zeros

Editing a large dataframe in python. How do you drop entire rows in the dataframe if a specific column's row has the value 0.0?
When I drop the 0.0s in the overall satisfaction column the edits are not displayed in my scatterplot matrix of the large dataframe.
I have tried:
filtered_df = filtered_df.drop([('overall_satisfaction'==0)], axis=0)
also tried replacing 0.0 with nulls & dropping the nulls:
filtered_df = filtered_df.['overall_satisfaction'].replace(0.0, np.nan), axis=0)
filtered_df = filtered_df[filtered_NZ_df['overall_satisfaction'].notnull()]
What concept am I missing? Thanks :)

So it seems like your values are small enough to be represented as zeros, but are not actually zeros. This usually happens when calculations result in vanishing gradients (really small numbers that approach zero, but are not quite zero), so equality comparisons do not give you the result you're looking for.
In cases like this, numpy has a handy function called isclose that lets you test whether a number is close enough to another number within a certain tolerance.
In your case, doing
df = df[~np.isclose(df['overall_satisfaction'], 0)]
Seems to work.

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
Check if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.

You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
% floats: percentage of values that are float
% int: percentage of values that are whole numbers
% string: percentage of values that are strings
% unique string: number of unique string values / total number
% unique integers: number of unique integer values / total number
mean numerical value (non numerical values considered 0 for this)
std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.

IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.

I've been looking at this, thought it maybe useful to share what I have. This builds on #Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark random split changes distribution of data - python

Related

pandas : sampling avoiding twice same values in different samples

Return String Similarity Scores between two String Columns - Pandas

Paritition matrix into smaller matrices based on multiple values

Unable to drop rows in pandas DataFrame which contain zeros

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

Categories

Resources