Randomly selecting from Pandas groups with equal probability -- unexpected behavior - python

I have 12 unique groups that I am trying to randomly sample from, each with a different number of observations. I want to randomly sample from the entire population (dataframe) with each group having the same probability of being selected from. The simplest example of this would be a dataframe with 2 groups.
groups probability
0 a 0.25
1 a 0.25
2 b 0.5
using np.random.choice(df['groups'], p=df['probability'], size=100) Each iteration will now have a 50% chance of selecting group a and a 50% chance of selecting group b
To come up with the probabilities I used the formula:
(1. / num_groups) / size_of_groups
or in Python:
num_groups = len(df['groups'].unique()) # 2
size_of_groups = df.groupby('label').size() # {a: 2, b: 1}
(1. / num_groups) / size_of_groups
Which returns
groups
a 0.25
b 0.50
This works great until I get past 10 unique groups, after which I start getting weird distributions. Here is a small example:
np.random.seed(1234)
group_size = 12
groups = np.arange(group_size)
probs = np.random.uniform(size=group_size)
probs = probs / probs.sum()
g = np.random.choice(groups, size=10000, p=probs)
df = pd.DataFrame({'groups': g})
prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict()
df['probability'] = df['groups'].map(prob_map)
plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))
plt.xticks(np.arange(group_size))
plt.show()
I would expect a fairly uniform distribution with a large enough sample size, but I am getting these wings when the number of groups is 11+. If I change the group_size variable to 10 or lower, I do get the desired uniform distribution.
I can't tell if the problem is with my formula for calculating the probabilities, or possibly a floating point precision problem? Anyone know a better way to accomplish this, or a fix for this example?
Thanks in advance!

you are using hist which defaults to 10 bins...
plt.rcParams['hist.bins']
10
pass group_size as the bins parameter.
plt.hist(
np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),
bins=group_size)

There is no problem about your calculations. Your resulting array is:
arr = np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True)
If you check the value counts:
pd.Series(arr).value_counts().sort_index()
Out:
0 855
1 800
2 856
3 825
4 847
5 835
6 790
7 847
8 834
9 850
10 806
11 855
dtype: int64
It is pretty close to a uniform distribution. The problem is with the default number of bins (10) of the histogram. Instead, try this:
bins = np.linspace(-0.5, 10.5, num=12)
pd.Series(arr).plot.hist(bins=bins)

Related

Python plotting numerical columns of dataframe in loop while dynamically changing xtick frequency

Context
I'm trying to produce plots across a dataframe for value_counts.
I'm unable to share the dataset I've used as its work related. But have used another dataset below.
Blocker
There are 3 main issues:
This line "plt.xticks(np.arange(min(df_num[c]),max(df_num[c])+1, aaa));" causes a
"ValueError: arange: cannot compute length.
The xticks overlap
The xticks at times aren't at the frequency specified below
# load dataset
df = sns.load_dataset('mpg')
# subset dataset
df_num = df.select_dtypes(['int64', 'float64'])
# Loop over columns - plots
for c in df_num.columns:
fig = plt.figure(figsize= [10,5]);
bins1 = df_num[c].nunique()+1
# plot
ax = df[c].plot(kind='hist', color='orange', bins=bins1, edgecolor='w');
# dynamic xtick frequency
if df_num[c].nunique() <=30:
aaa = 1
elif 30< df_num[c].nunique() <=50:
aaa = 3
elif 50< df_num[c].nunique() <=60:
aaa = 6
elif 60< df_num[c].nunique() <=70:
aaa = 7
elif 70< df_num[c].nunique() <=80:
aaa = 8
elif 80< df_num[c].nunique() <=90:
aaa = 9
elif 90< df_num[c].nunique() <=100:
aaa = 10
elif 90< df_num[c].nunique() <=100:
aaa = 20
else:
aaa = 40
# format plot
plt.xticks(np.arange(min(df_num[c]),max(df_num[c])+1, aaa));
ax.set_title(c)
#Cimbali
The ticks are at times at the edgepoint and other times partly in bin.
Is it possible to have one or the other?
TL;DR: define histogram bins and ticks based on the range of values and not the number of unique values.
Your histogram plots make some assumptions that might not be verified, in particular that all unique values are distributed identically. If that’s not the case − which in general it isn’t − then the range from min to max has little to do with the number of unique values (especially with floating point values, where unique values mean very little).
In particular, when you plot histograms, your bins (on the x-axis) correspond to the values (left). If you plot bars (right), you would get one bar per unique value, but not distributed based on the x-axis.
Here’s a simple example:
>>> s = pd.DataFrame([1, 1, 2, 5])
>>> s.plot(kind='hist')
>>> s.value_counts().plot(kind='bar')
You see there’s only 3 unique values but the index range (and number of bars) is from min to max on the histogram (left). If you only defined 3 bins, then 1 and 2 would be in the same bar.
The bar plot (right) has bar counts proportional to the number of unique values, but then the your x-axis is not proportional to the values anymore.
So instead, let’s define the number of bars and indexes from the range of values:
>>> df_range = df_num.max() - df_num.min()
>>> df_range
mpg 37.6
cylinders 5.0
displacement 387.0
horsepower 184.0
weight 3527.0
acceleration 16.8
model_year 12.0
dtype: float64
>>> df_bins = df_range.div(10).round().astype(int).clip(lower=df_range.transform(np.ceil), upper=50)
>>> df_bins
mpg 39
cylinders 6
displacement 50
horsepower 50
weight 50
acceleration 18
model_year 13
dtype: int64
Here’s an example of plotting using these number of bins:
>>> for col, n in df_bins.iteritems():
... fig = plt.figure(figsize=(10,5))
... df[col].plot.hist(bins=n, title=col)
You can also define xticks additionally to bin sizes, but again for histograms you have to take the range into account, not the number of unique values (so you could compute ticks from bins too), but your rules make for some pretty weird results, especially on very wide ranges:
>>> ticks = pd.Series(index=df_range.index, dtype=int)
>>> ticks[df_range < 30] = 1
>>> ticks[(30 < df_range) & (df_range <= 50)] = 3
>>> ticks[(50 < df_range) & (df_range <= 100)] = np.floor(df_range.div(10)) + 1
>>> ticks[100 < df_range] = 40
>>> for col, n in df_bins.iteritems():
... fig = plt.figure(figsize=(10,5))
... df[col].plot.hist(bins=n, title=col, xticks=np.arange(df[col].min(), df[col].max() + 1, ticks[col]))
Note that you could also use np.linspace to define the ticks from the min, max, and number (instead of min, max, and interval).

Optimizing interpolation of values in Pandas

I have been strugling with an optimization problem with Pandas.
I had developed a script to apply computation on every line of a relatively small DataFrame (~a few 1000s lines, a few dozen columns).
I relied heavily on the apply() function which was obviously a poor choice in most cases.
After a round of optimization I only have a method which takes time and I haven't found an easy solution for :
Basically my dataframe contains a list of video viewing statistics with the number of people who watched the video for every quartile (how many have watched 0%, 25%, 50%, etc) such as :
video_name
video_length
video_0
video_25
video_50
video_75
video_100
video_1
6
1000
500
300
250
5
video_2
30
1000
500
300
250
5
I am trying to interpolate the statistics to be able to answer "how many people would have watched each quartile of the video if it lasted X seconds"
Right now my function takes the dataframe and a "new_length" parameter, and calls apply() on each line.
The function which handles each line computes the time marks for each quartile (so 0, 7.5, 15, 22.5 and 30 for the 30s video), and time marks for each quartile given the new length (so to reduce the 30s video to 6s, the new time marks would be 0, 1.5, 3, 4.5 and 6).
I build a dataframe containing the time marks as index, and the stats as values in the first column:
index (time marks)
view_stats
0
1000
7.5
500
15
300
22.5
250
30
5
1.5
NaN
3
NaN
4.5
NaN
I then call DataFrame.interpolate(method="index") to fill the NaN values.
It works and gives me the result I expect, but it is taking a whopping 11s for a 3k lines dataframe and I believe it has to do with the use of the apply() method combined with the creation of a new dataframe to interpolate the data for each line.
Is there an obvious way achieve the same result "in place", e.g by avoiding the apply / new dataframe method, directly on the original dataframe ?
EDIT: The expected output when calling the function with 6 as the new length parameter would be :
video_name
video_length
video_0
video_25
video_50
video_75
video_100
new_video_0
new_video_25
new_video_50
new_video_75
new_video_100
video_1
6
1000
500
300
250
5
1000
500
300
250
5
video_2
6
1000
500
300
250
5
1000
900
800
700
600
The first line would be untouched because the video is already 6s long.
In the second line, the video would be cut from 30s to 6s so the new quartiles would be at 0, 1.5, 3, 4.5, 6s and the stats would be interpolated between 1000 and 500, which were the values at the old 0% and 25% time marks
EDIT2: I do not care if I need to add temporary columns, time is an issue, memory is not.
As a reference, this is my code :
def get_value(marks, asset, mark_index) -> int:
value = marks["count"][asset["new_length_marks"][mark_index]]
if isinstance(value, pandas.Series):
res = value.iloc(0)
else:
res = value
return math.ceil(res)
def length_update_row(row, assets, **kwargs):
asset_name = row["asset_name"]
asset = assets[asset_name]
# assets is a dict containing the list of files and the old and "new" video marks
# pre-calculated
marks = pandas.DataFrame(data=[int(row["video_start"]), int(row["video_25"]), int(row["video_50"]), int(row["video_75"]), int(row["video_completed"])],
columns=["count"],
index=asset["old_length_marks"])
marks = marks.combine_first(pandas.DataFrame(data=NaN, columns=["count"], index=asset["new_length_marks"][1:]))
marks = marks.interpolate(method="index")
row["video_25"] = get_value(marks, asset, 1)
row["video_50"] = get_value(marks, asset, 2)
row["video_75"] = get_value(marks, asset, 3)
row["video_completed"] = get_value(marks, asset, 4)
return row
def length_update_stats(report: pandas.DataFrame,
assets: dict) -> pandas.DataFrame:
new_report = new_report.apply(lambda row: length_update_row(row, assets), axis=1)
return new_report
IIUC, you could use np.interp:
# get the old x values
xs = df['video_length'].values[:, None] * [0, 0.25, 0.50, 0.75, 1]
# the corresponding y values
ys = df.iloc[:, 2:].values
# note that 6 is the new value
nxs = np.repeat(np.array(6), 2)[:, None] * [0, 0.25, 0.50, 0.75, 1]
res = pd.DataFrame(data=np.array([np.interp(nxi, xi, yi) for nxi, xi, yi in zip(nxs, xs, ys)]), columns="new_" + df.columns[2:] )
print(res)
Output
new_video_0 new_video_25 new_video_50 new_video_75 new_video_100
0 1000.0 500.0 300.0 250.0 5.0
1 1000.0 900.0 800.0 700.0 600.0
And then concat across the second axis:
output = pd.concat((df, res), axis=1)
print(output)
Output (concat)
video_name video_length video_0 ... new_video_50 new_video_75 new_video_100
0 video_1 6 1000 ... 300.0 250.0 5.0
1 video_2 30 1000 ... 800.0 700.0 600.0
[2 rows x 12 columns]

making numerical categories of pandas data

I tried to look at some reference where I can make an extra column that is categorical based on another column. I tried the documentation already pandas categorical, and stack overflow does not seem to have this, but I think it must be, but maybe I am using the wrong search tags?
for example
Size Size_cat
10 0-50
50 0-50
150 50-500
450 50-500
5000 1000-9000
10000 >9000
notice that the size category 500-1000 is missing (but no number corresponds to that)
The problem lies here is that I create a pandas crosstable later like this:
summary_table = pd.crosstab(index[res_sum["Type"],res_sum["Size"]],columns=[res_sum["Found"]],margins=True)
summary_table = summary_table.div(summary_table["All"] / 100, axis=0)
After some editing of this table I get this kind of result:
Found Exact Near No
Type Size
DEL 50 80 20 0
100 60 40 0
500 80 20 0
1000 60 40 0
5000 40 60 0
10000 20 80 0
DEL_Total 56.666667 43.333333 0
DUP 50 0 0 100
100 0 0 100
500 0 100 0
1000 0 100 0
5000 0 100 0
10000 20 80 0
DUP_Total 3.333333 63.333333 33.333333
the problem is that now (Size) just puts the sizes here, and therefore this table can vary in size. If 5000-DEL is missing in the data, that column will also disappear and then DUP has 6 categories and DEL 5. Additionally if I add more sizes, this table will become very large. So I wanted to make categories of the sizes, but always retaining the same categories, even if some of them are empty.
I hope I am clear, because it is kinda hard to explain.
this is what I tried already:
highest_size = res['Size'].max()
categories = int(math.ceil(highest_size / 100.0) * 100.0)
categories = int(categories / 10)
labels = ["{0} - {1}".format(i, i + categories) for i in range(0, highest_size, categories)]
print(highest_size)
print(categories)
print(labels)
10000
1000
['0 - 1000', '1000 - 2000', '2000 - 3000', '3000 - 4000', '4000 - 5000', '5000 - 6000', '6000 - 7000', '7000 - 8000', '8000 - 9000', '9000 - 10000']
I get number categories, but of course now they depend on the highest number, and the categories change based on the data. additionally I still need to link them to the 'Size' column in pandas. This does not work.
df['group'] = pd.cut(df.value, range(0, highest_size), right=False, labels=labels)
If possible I would like to make my own categories, instead of using range to get the same steps, like I made in the first example above. (otherwise it takes way to long to get to 10000 with steps of 100, and taking steps of 1000 will lose a lot of data in the smaller regions)
See a mock up below, to help you get the logic. Basically, you bin the Score into custom groups, by using cut (or even lambda or map ) and passing the value to the function GroupMapping. Let me know if it works.
import pandas as pd
df=pd.DataFrame({
'Name':['Harry','Sally','Mary','John','Francis','Devon','James','Holly','Molly','Nancy','Ben'],
'Score': [1143,2040,2500,3300,3143,2330,2670,2140,2890,3493,1723]}
)
def GroupMapping(dl):
if int(dl) <= 1000: return '0-1000'
elif 1000 < dl <= 2000: return '1000 - 2000'
elif 2000 < dl <= 3000: return '2000 - 3000'
elif 3000 < dl <= 4000: return '3000 - 4000'
else: return 'None'
#df["Group"] = df['Score'].map(GroupMapping)
#df["Group"] = df['Score'].apply(lambda row: GroupMapping(row))
df['Group'] = pd.cut(df['Score'], [0, 1000, 2000, 3000, 4000], labels=['0-1000', '1000 - 2000', '2000 - 3000','3000 - 4000' ])
df

python pandas: How dropping items in dateframe

I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...
I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]
You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()

pandas qcut not putting equal number of observations into each bin

I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!
qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?
Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)
If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.
This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-

Categories