Downsample pandas data frame based on count column - python

I have a thousands of data frame like the following, though much larger (1000000 rows, 100 columns).
data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[45, 66, 6, 6, 1, 432, 3],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
I want to randomly sample from this data frame and make a new data frame such that the sum of count should only equal N. Meaning I want to randomly sample based on the count value as a weight, and make a new data frame with this new resampled data such that sum of count is N.
The relative proportions should stay approximately the same, and no value when resampled should exceed the count of the original count value. The values in cols1 (or any other column except Value and count) should remain the same.
For example, if N was 50, it might look like:
pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[4, 7, 1, 1, 0, 37, 0],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
How can this be done?
Efficiency is key, otherwise I could expand the data frame based on count and randomly sample without replacement, then merge it back together.
Thanks,
Jack

Using multinomial sampling, this is relatively easy.
import numpy as np
from itertools import chain
def downsample(df, N):
prob = df['count']/sum(df['count'])
df['count'] = list(chain.from_iterable(np.random.multinomial(n = N, pvals = prob, size = 1)))
df = df[df['count'] != 0]
return df
For OP's example:
downsample(data, 50)
returns:
Value cols1 count
1 Boy 5 1
3 Corn 4 16
5 Barnes 32 33

Related

A more efficient way to take samples from a pandas DataFrame

I have a piece of code like this:
import pandas as pd
data = {
'col1': [17,2,3,4,5,5,10,22,31,11,65,86],
'col2': [6,7,8,9,10,31,46,12,20,37,91,32],
'col3': [1,2,3,4,5,6,7,8,9,10,11,12]
}
df = pd.DataFrame(data)
sampling_period = 3
abnormal_data = set()
for i in range(sampling_period):
# get index of [0, 3, 6, 9, ...], [1, 4, 7, 10, ...], and [2, 5, 8, 11, ...]
df_sampled = df[i::sampling_period]
diff = df_sampled - df_sampled.shift(1)
# diff >= 5 are considered as an abnormal columns
abnormal_df = df_sampled[
diff >= 5
].dropna(how="all", axis=1)
abnormal_data = abnormal_data.union(set(abnormal_df.columns))
print(f"abnormal_data: {abnormal_data}")
What the code above does are as the followings:
Sampling all the columns in df based on sampling_period.
If the difference between 2 consecutive elements in df_sampled is larger than or equal to 5, mark this column as abnormal.
Return abnormal columns.
Is there anyway to avoid the for loop in the code?
The code above takes a lot of time to run when sampling_period and df becomes large. I wish that it could run faster.
For example, when my sampling_period is 60, and df.shape is (20040, 3562), it takes about 683 seconds to run the above code.

How to loop through a 3-D array and multiple dataframes?

Back ground info
multi_data is a 3d array , (10,5,5) array. For this example multi_data = np.arange(250).reshape(10,5,5)
Each of the 10 matrices have 5X5 states (A-E).
Each of the matrices are in order and represent time in years in increments of 1.
Starting from matrices[0] which contains the matrix values for year 1, up to matrices[9] year 10.
Example of multi_data at year 1
multi_data[0]
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14],
[ 15, 16, 17, 18, 19],
[ 20, 21, 22, 23, 24]],
Customers usually make a purchase within a few years (not immediately on signup), for example this customer made a purchase in year 3.
Hence the matrix calculation for this customer starts at year 3.
Each user has a current_state (A-E) and I need to transform the user data so that I can multiply it by the matrices. For example user customer1 has a current state B, hence the amount is the second element in the array customer1= np.array([0, 1000, 0, 0, 0])
dataframe 1 (customers)
cust_id|state|amount|year|
1 | B | 1000 | 3
2 | D | 500 | 2
multi_data = np.arange(250).reshape(10,5,5)
customer1= np.array([0, 1000, 0, 0, 0])
output = customer1
results = []
for arr in multi_data[3:4]: #customer purchases at year 3 hence I am multiplying customer1 by matrix at year 3
output = output#arr
results.append(output)
example of output
results = [array([80000, 81000, 82000, 83000, 84000])]
I then need to multiply the results by dataframe 2
dataframe_2
| year | lim %
| 1 | 0.19
| 2 | 0.11
| 3 | 0.02
| 10 | 0.23
so I multiply the results by lim % at year 3.
dataframe2=dataframe2.loc[dataframe2['year'] == 3]
results=dataframe2['LimitPerc'].values * results
example output results
[array([1600,1620,1640,1660,1680])]
I then need to multiply these results by matrix year 4 and then by lim% year 4 , until year 10 is reached.
like this:
customer1= [array([1600,1620,1640,1660,1680])]
output = customer1
results = []
for arr in data[4:5]: #multiplying by year 4 matrix (multi_data)
output = output#arr
results.append(output)
dataframe2=dataframe2.loc[dataframe2['year'] == 4]
results=dataframe2['LimitPerc'].values * results
Is there an easier way to do this that is less manual?. I need to continue this calculation until year 10, for each customer.I need to save the results for each customer, after every calculation.
Additional info:
I am currently running through all customers years like below, but my problem is I have a lot of vlookup type calculations like dataframe2 that needs to be calculated inbetween each year for each customer, and I have to save the results for each customer after each calculation.
results_dict = {}
for _id, c, y in zip(cust_id ,cust_amt, year):
results = []
for m in multi_data[y:]:
c = c # m
results.append(c)
results_dict[_id] = results
Unfortunately, since you need all the intermediate results, I don't think it's possible to optimise this much, so you will need the loop. If you didn't need the intermediate results, you could precompute the matrix product for each year up to year 10. However, here it is not useful.
To integrate the look-ups in your loop, you could just put all the dataframes in a list and use the DataFrame index to query the values. Also, you can convert the state to an integer index. Note that you don't need to create the customer1 vector. Since it's non-zero only in one position, you can directly extract the relevant row of the matrix and multiply it by amount.
Sample data:
import pandas as pd
import numpy as np
customer_data = pd.DataFrame({"cust_id": [1, 2, 3, 4, 5, 6, 7, 8],
"state": ['B', 'E', 'D', 'A', 'B', 'E', 'C', 'A'],
"cust_amt": [1000,300, 500, 200, 400, 600, 200, 300],
"year":[3, 3, 4, 3, 4, 2, 2, 4],
"group":[10, 25, 30, 40, 55, 60, 70, 85]})
state_list = ['A','B','C','D','E']
# All lookups should be dataframes with the year and/or group and the value like these.
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': 0.1})
lookup2 = pd.concat([pd.DataFrame({'group':g, 'lookup_val': 0.1, 'year':range(1, 11)}
for g in customer_data['group'].unique())]).explode('year')
multi_data = np.arange(250).reshape(10,5,5)
Preprocessing:
# Put all lookups in order of calculation in this list.
lookups = [lookup1, lookup2]
# Preprocessing.
# Transform the state to categorical code to use it as array index.
customer_data['state'] = pd.Categorical(customer_data['state'],
categories=state_list,
ordered=True).codes
# Set index on lookups.
for i in range(len(lookups)):
if 'group' in lookups[i].columns:
lookups[i] = lookups[i].set_index(['year', 'group'])
else:
lookups[i] = lookups[i].set_index(['year'])
Calculating results:
results = {}
for customer, state, amount, start, group in customer_data.itertuples(name=None, index=False):
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
Accessing the results:
# Here are examples of how you obtain the results from the dictionary:
# Customer 1 at start year, first result.
results[1][0][0]
# Customer 1 at start year, second result (which would be after lookup1 here).
results[1][0][1]
# Customer 1 at start year, third result (which would be after lookup2 here).
results[1][0][2]
# Customer 1 at year start+1, first result.
results[1][1][0]
# ...
# Customer c at year start+y, result k+1.
results[c][y][k]

Group rows based on +- threshold on high dimensional object

I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.
As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())

Rolling mean with intervals

How can I compute efficiently the rolling mean at fixed intervals?
import numpy as np
import pandas as pd
n=50
s = pd.Series(data = np.random.randint(0,10,n), index = pd.date_range(pd.to_datetime('today').floor('D'), freq='D', periods = n))
E.g. in the series above with an interval of 4 days and number of elements 3, the ith element of the new series t=t_i will have s_i =1/3 *( s_(i-4) + s_(i-4*2) + s_(i-4*3) )
Have you checked out pandas.DataFrame.rolling? It might have what you're looking for.
If I understand correctly, here is an example with an array of 1 to 50:
interval = 4
window = 3
data = np.linspace(1,50,50)
arr = pd.Series(np.array(data)[::interval]) #subset data by every 4th value
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window) #look forward 3 spaces on every 4th value
arr.rolling(indexer).mean() #take the mean of the window
The output would be an array [5, 9, 13, 17, ...], 5 corresponding to averaging 1, 5, and 9 and 9 being the average of 5, 9, and 13.

How can I split a DataFrame into multiple DataFrames of fewer rows?

If I have a DataFrame that is composed of 100 rows w/ 4 columns for sake of example, how can I create 5 new DataFrames that are each composed of 20 rows w/ 4 columns?
That is, if an arbitrary column of the original DataFrame holds the list [0, 1, 2, 3, ..., 98, 99], how would I create 5 new DataFrames such that the first DataFrame's arbitrary column holds the list [0, 1, 2, ..., 9], the second DataFrame's arbitrary column holds the list [10, 11, 12, ..., 19], etc. etc.?
I tried the following to a DataFrame consisting of a single column A that holds the list [0, 1, 2, 3, ..., 98, 99], but it gave me 100 CSV files each w/ a single row rather than the desired 5 CSenter code hereV files each w/ 20 rows:
import pandas as pd
import numpy as np
list = []
for i in range (0, 100):
list.append(i)
df = pd.DataFrame(data=list, columns=['A'])
groups = df['A'].groupby(np.arange(len(df['A']/10)))
for (frameno, frame) in groups:
frame.to_csv("/Users/ephemeralhappiness/Desktop/Cycle Test/" + "%s.csv" % frameno)
Just change your groupby to:
# to get 5 groups
nrows = 20
groups = df.groupby(df.index // nrows)
print(groups.ngroups)
5

Categories