I'm working on a script that takes a sample from an Excel file, and produces a sample from a sample data calculation. I'm trying to make sure the formula distributes the sample evenly between all unique categories, but I'm not entirely sure where to start.
import pandas as pd
import random
df = pd.read_excel("C:/Users/bryanmccormack/Desktop/Test_Catalog.xlsx")
df2 = df.loc[(df['Track Item']=='Y')]
category_total = df2['Category'].nunique()
total_rows = len(df2.axes[0])
ss = (2.58**2)*(.5)*(1-.5)/.04**2
ss2 = 1+(ss-1/total_rows)
ss3 = ss/ss2
ss4 = round(ss3 * 1000)
category = ss4 / category_total
df3 = df2.groupby('Category').apply(lambda x: x.sample(category))
df3 has 3774 items, and the sample formula takes 999 items, but I'm getting this error: "ValueError: Cannot take a larger sample than population when 'replace=False'"
Any idea why my code is wrong?
Related
I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **
I have a df of 300000 rows and 25 columns.
Heres a link to 21 rows of the dataset
I have added a unique index to all the rows, using uuid.uuid4().
Now I only wand a random portion of the dataset (say 25%). Here is what I am trying to do to get it, but its not working:
def gen_uuid(self, df, percentage = 1.0, uuid_list = []):
for i in range(df.shape[0]):
uuid_list.append(str(uuid.uuid4()))
uuid_pd = pd.Series(uuid_list)
df_uuid = df.copy()
df_uuid['id'] = uuid_pd
df_uuid = df_uuid.set_index('id')
if (percentage == 1.0) : return df_uuid
else:
uuid_list_sample = random.sample(uuid_list, int(len(uuid_list) * percentage))
return df_uuid[df_uuid.index.any() in uuid_list_sample]
But this gives an error saying keyerror: False
The uuid_list_sample that I generate is the correct length
So I have 2 questions:
How do I get the above code to work as intendend? Return a random portion of the pandas df based on index
How do I in general get a percentage of the whole pandas data frame? I was looking at pandas.DataFrame.quantile, but Im not sure if that does what im looking for
I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!
I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)
I've imported deque from collections to limit the size of my data frame. When new data is entered, the older ones should be progressively deleted over time.
Big Picture:
Im creating a Data Frame of historical values of the previous 26 days from time "whatever day it is..."
Confusion:
I think my data each minute comes in a series format, which then I attempted to restrict the maxlen using deque. Then I tried implementing the data into an data frame. However I just get NaN values.
Code:
import numpy as np
import pandas as pd
from collections import deque
def initialize(context):
context.stocks = (symbol('AAPL'))
def before_trading_start(context, data):
data = data.history(context.stocks, 'close', 20, '1m').dropna()
length = 5
d = deque(maxlen = length)
data = d.append(data)
index = pd.DatetimeIndex(start='2016-04-03 00:00:00', freq='S', periods=length)
columns = ['price']
df = pd.DataFrame(index=index, columns=columns, data=data)
print df
How can I get this to work?
Mike
If I understand correctly the question, you want to keep all the values of the last twenty six last days. Does the following function is enough for you?
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
.loc[lambda x: x.index > twenty_six_day_before, :]
.iloc[-length:, :]
)
If the dates are not in the index:
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
# the following line is changed for values in a specific column
.loc[lambda x: x['column_with_date'] > twenty_six_day_before, :]
.iloc[-length:, :]
)
Don't forget to change the hard coded timezone if you are not in France. :-)
I have a large temperature time series that I'm performing some functions on. I'm taking hourly observations and creating daily statistics. After I'm done with my calculations, I want to use the grouped year and Julian days that are objects in the Groupby ('aa' below) and the drangeT and drangeHI arrays that come out and make an entirely new DataFrame with those variables. Code is below:
import numpy as np
import scipy.stats as st
import pandas as pd
city = ['BUF']#,'PIT','CIN','CHI','STL','MSP','DET']
mons = np.arange(5,11,1)
for a in city:
data = 'H:/Classwork/GEOG612/Project/'+a+'Data_cut.txt'
df = pd.read_table(data,sep='\t')
df['TempF'] = ((9./5.)*df['TempC'])+32.
df1 = df.loc[df['Month'].isin(mons)]
aa = df1.groupby(['Year','Julian'],as_index=False)
maxT = aa.aggregate({'TempF':np.max})
minT = aa.aggregate({'TempF':np.min})
maxHI = aa.aggregate({'HeatIndex':np.max})
minHI = aa.aggregate({'HeatIndex':np.min})
drangeT = maxT - minT
drangeHI = maxHI - minHI
df2 = pd.DataFrame(data = {'Year':aa.Year,'Day':aa.Julian,'TRange':drangeT,'HIRange':drangeHI})
All variables in the df2 command are of length 8250, but I get this error message when I run the it:
ValueError: cannot copy sequence with size 3 to array axis with dimension 8250
Any suggestions are welcomed and appreciated. Thanks!