Here is my problem: i have to generate some synthetic data (like 7/8 columns), correlated each other (using pearson coefficient). I can do this easily, but next i have to insert a percentage of duplicates in each column (yes, pearson coefficent will be lower), different for each column.
The problem is that i don't want to insert personally that duplicates, cause in my case it would be like to cheat.
Someone knows how to generate correlated data already duplicates ? I've searched but usually questions are about drop or avoid duplicates..
Language: python3
To generate correlated data i'm using this simple code: Generatin correlated data
Try something like this :
indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))
for index in indices:
array.append(array[index])
Here I make the assumption that your data is stored in array which is an ndarray, where each row contains your 7/8 columns of data.
The above code should create an array of random indices, whose entries (rows) you select and append again to the array.
i find out the solution.
I post the code, it might be helpful for someone.
#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1
#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))
from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)
from scipy.linalg import cholesky
upper_chol = cholesky(a)
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)
#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4',
'att5', 'att6', 'att7', 'att8'])
#last step is to truncate float values of ans in a variable way, so i got
#duplicates in varying percentage
a = df.values
for i in range(8):
trunc = np.random.randint(5,12)
print(trunc)
a.T[i] = a.T[i].round(decimals=trunc)
#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value
Finally, those are my duplicates percentages for each column:
duplicate rate attribute: att1 = 5.159390000000002
duplicate rate attribute: att2 = 11.852260000000001
duplicate rate attribute: att3 = 12.036079999999998
duplicate rate attribute: att4 = 35.10611
duplicate rate attribute: att5 = 4.6471599999999995
duplicate rate attribute: att6 = 35.46553
duplicate rate attribute: att7 = 0.49115000000000464
duplicate rate attribute: att8 = 37.33252
Related
Goal: round each individual value in the two-dimensional (2D) dataframe to N decimal places while preserving the overall sum of the dataframe (to N decimal places). For example, if the overall table sum was 500.29239, then sum-safe rounding to 2 decimal places should result in an overall sum of 500.29. The pandas.DataFrame.round function rounds each individual value but does not necessarily preserve the overall sum. The iteround.saferound() function performs sum-safe rounding, but only on one-dimensional (1D) objects.
I have a solution (see my 'answer' below), but I wondered if anyone has a nicer (one-line) solution. My current solution involves converting the 2D dataframe into a 1D object, and then applying the iteround.saferound() function which seems to only work on 1D objects, and then converting it back into a 2D dataframe.
I have not come across full existing solutions to this question online, so I figured there might be value in posting the question and my current solution.
I'm new - sorry if I'm not doing this right! I am happy to edit my question/answer/both as needed.
My (OP) own current solution that I am looking to improve upon / simplify:
import pandas as pd
import iteround # For rounding values in a table without changing overall sum
import numpy as np
import itertools as it # A number of iterator building blocks
#########################################
# Create example pandas dataframe table #
#########################################
def generate_example_pandas_df():
"""
Generate pandas dataframe of 36 rows and 6 columns with random values.
Returns
-------
example_data : DataFrame
Pandas DataFrame containing 36 rows and 6 columns with random values.
"""
np.random.seed(123)
prefix = ["North", "East", "South", "West", "Middle", "Statistical"]
suffix = ["port", "ville", "shire", "end", "don", "mouth"]
nyear, ystart = 6, 2016
area_names = ["".join(x) for x in it.product(prefix, suffix)]
# Data is array with rows equal to length of area names list and columns
# equal to value of nyear variable, filled with random numbers between 0 to 100.
example_data = pd.DataFrame(
np.random.uniform(0.00, 100.00, size=(len(area_names), nyear)),
columns=[int(y + ystart) for y in range(nyear)],
index=area_names,
)
return example_data
##################
# Apply rounding #
##################
def round_2d_pandas_dataframe(data, decimal_places=0, print_results=True):
"""
Sum-safe round two-dimensional (2D) Pandas DataFrame to chosen decimal places (dp).
This function rounds each value in a 2D Pandas DataFrame to the chosen number of
dp while preserving the overall sum of DataFrame (it is 'sum-safe').
For example, applying this function to a DataFrame that sums to 500.1919
with decimal_places set to 1 will result in a DataFrame that sums to 500.2,
whereas `pandas.DataFrame.round()` rounds individual values without
regard for the overall sum of the DataFrame.
Parameters
----------
data : DataFrame
The input Pandas DataFrame to apply the rounding to.
decimal_places : int, optional
Chosen number of decimal places to round to. The default is 0.
print_results : bool, optional
Whether to print a results summary to console. The default is True.
Returns
-------
output_data : DataFrame
The output Pandas DataFrame after rounding.
"""
# Convert the 2D DataFrame into a 1D DataFrame for compatibility with .saferound()
output_data = data.stack().to_frame().T
output_data.iloc[0] = iteround.saferound(output_data.iloc[0], decimal_places)
# Convert from 1D DataFrame back to 2D DataFrame
output_data = output_data.T.unstack()
# Drop the superfluous extra index header created by .unstack()
output_data.columns = output_data.columns.droplevel()
differences = data - output_data
# Identify the greatest positive and negative individual value changes
max_pos_change = round((differences.max(axis=1)).max(axis=0), decimal_places + 3)
max_neg_change = round((differences.min(axis=1)).min(axis=0), decimal_places + 3)
# Identify the before and after values for the top-left cell (to present as example)
top_left_pre = data.iloc[0].iloc[0]
top_left_post = output_data.iloc[0].iloc[0]
overall_sum_change = round(
data.sum().sum() - output_data.sum().sum(), decimal_places + 3
)
if print_results is True:
print(f"Decimal places rounded to: {decimal_places}")
print(f"Individual value changes : " f"{max_neg_change} to {max_pos_change}")
print(f"Overall sum pre-rounding : {data.sum().sum()}")
print(f"Overall sum post-rounding: {output_data.sum().sum()}")
print(f"Overall table sum change : {overall_sum_change}")
print(f"Example of top left value: {top_left_pre} became {top_left_post}")
return output_data
input_df = generate_example_pandas_df()
output_df = round_2d_pandas_dataframe(
data=input_df, decimal_places=1, print_results=True
)
"""
2016 2017 2018 2019 2020 2021
Northport 69.646919 28.613933 22.685145 55.131477 71.946897 42.310646
Decimal places rounded to: 1
Individual value changes : -0.0511 to 0.0485
Overall sum pre-rounding : 10918.659674395305
Overall sum post-rounding: 10918.7
Overall table sum change : -0.0403
Example of top left value: 69.64691855978616 became 69.6
"""
(I'm new - sorry if I'm not doing this right! I am happy to edit my question/answer/both as needed.)
I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array
I have a set of data such that the 1st column is age (numerical), 2nd column is gender (categorical) and 3rd column is saving (numerical).
What I want to do is find the mean and standard deviation if the column is a numerical data, and find the mode if the column is categorical data.
I tried to find the index if the type = num and put the index into the for loop to calculate the mean and standard deviation and the rest of the index is used to calculate the mode of the categorical data (in this case is 2nd column), however, I had stuck in the loop.
import numpy as np
data = np.array([[11, "male",1222],[23,"female",333],[15,"male",542]])
# type of the data above
types = ["num","cat","num"]
idx = []
for i in range(2):
if (types[i] == "num"):
idx.append(types[i].index)
for i in idx:
np.mean(data[:,i].astype("float64"))
I hope the code is able to obtain the mean and standard deviation for numerical data and mode for categorical data. If it is possible, try not to build in any other package (I'm not sure `index' have it own package or not).
Simply remove the parenthesis in the if statement.
...
idx = []
for i in range(2):
if types[i] == "num":
idx.append(types[i].index)
...
Edit:
Instead of looping a range I would suggest iterate your types array with enumerate, in that way you have the index of your desired item.
for index, _type in enumerate(types):
if _type == 'num':
idx.append(index)
Not sure how to correctly phrase this, but here goes:
What's the easiest way to create a one-column dataframe in Python that holds ones and zeros, and the length is determined by some input?
For example, say that I have a sample size of 1000, of which a 100 successes (ones). The amount of zeros would then be the sample size (i.e., 1000) minus successes. So the output would be a df with a length of 1000, of which a 100 rows contain a one and 900 a zero.
From what you describe, a simple list would do the trick. Otherwise, you can use numpy.array or pandas.DataFrame/pandas.Series (more table-like).
import numpy as np
import pandas as pd
input_length = 1000
# List approach:
my_list = [0 for i in range(input_length)]
# Numpy array:
my_array = np.zeros(input length)
# With Pandas:
my_table = pd.Series(0, index=range(input_length))
All these create a vector of zeroes, then you assign the successes (ones) as you please. If these were to follow some known distribution, numpy also has methods to generate random vectors that follow them (see here).
If you're really looking for the pandas approach, it can also be combined with the previous ones. This is, you can assign a list or numpy.array to the values of your Series/DataFrame. For example, imagine you want to draw 1000 random samples of a binomial distribution with p=0.5:
p=0.5
my_data = pd.Series(np.random.binomial(1, p, input_length))
In addition to N.P.'s answer. You could do something like this:
import pandas as pd
import numpy as np
def generate_df(df_len):
values = np.random.binomial(n=1, p=0.1, size=df_len)
return pd.DataFrame({'value': values})
df = generate_df(1000)
edit:
More complete function:
def generate_df(df_len, option, p_success=0.1):
'''
Generate a pandas DataFrame with one single field filled with
1s and 0s in p_success proportion and length df_len.
Input:
- df_len: int, length of the 1st dimension of the DataFrame
- option: string, determines how will the sample be generated
* random: according to a bernoully distribution with p=p_success
* fixed: failures first, and fixed proportion of successes p_success
* fixed_shuffled: fixed proportion of successes p_success, random order
- p_success: proportion of successes among total
Output:
- df: pandas Dataframe
'''
if option == 'random':
values = np.random.binomial(n=1, p=p_success, size=df_len)
elif option in ('fixed_shuffled', 'fixed'):
n_success = int(df_len*p_success)
n_fail = df_len - n_success
values = [0]*n_fail + [1]*n_success
if option == 'fixed_shuffled':
np.random.shuffle(values)
else:
raise Exception('Unknown option: {}'.format(option))
df = pd.DataFrame({'value': values})
return df
I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)