I have a small dataframe comprised of two columns, an ORG column and a percentage column. The dataframe is sorted largest to smallest based on the percentage column.
I'd like to create a while loop that adds up the values in the percentage column up until it hits a value of .80 (80%).
So far I've tried:
retail_pareto = 0
counter = 0
while retail_pareto < .80:
retail_pareto += retailerDF[counter]['RETAILER_PCT_OF_CHANGE']
counter += 1
This does not work, both the counter and the counter and retail_pareto value remain at zero with no real error message to help me troubleshoot what I'm doing incorrectly. Ideally, I'd like to end up with a list of the orgs with the largest percentage that together add up to 80%.
I'm not exactly sure what to try next. I've searched these forums, but haven't found anything similar in the forums yet.
Any advice or help is much appreciated. Thank you.
Example Dataframe:
ORG PCT
KST 0.582561
ISL 0.290904
BOV 0.254456
BRH 0.10824
GNT 0.0913631
DSH 0.023441
RDM -0.0119665
JBL -0.0348893
JBD -0.071883
WEG -0.232227
The output that I would expect would be something along the lines of:
ORG PCT
KST 0.582561
ISL 0.290904
Use:
df_filtered = df.loc[df['PCT'].shift(fill_value=0).cumsum().le(0.80),:]
#if you don't want include where cumsum is greater than 0,80
#df_filtered = df.loc[df['PCT'].cumsum().le(0.80),:]
print(df_filtered)
ORG PCT
0 KST 0.582561
1 ISL 0.290904
Can you use this example to help you?
import pandas as pd
retail_pareto = 0
orgs = []
for i,row in retailerDF.iterrows():
if retail_pareto <= .80:
retail_pareto += row['RETAILER_PCT_OF_CHANGE']
orgs.append(row)
else:
break
new_df = pd.DataFrame(orgs)
Edit: made it more like your example and added the new DataFrame.
Instead of your loop, take a more pandasonic approach.
Start with computing an additional column containing cumulative sum
of RETAILER_PCT_OF_CHANGE:
df['pct_cum'] = df.RETAILER_PCT_OF_CHANGE.cumsum()
For your data, the result is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465
2 BOV 0.254456 1.127921
3 BRH 0.108240 1.236161
4 GNT 0.091363 1.327524
5 DSH 0.023441 1.350965
6 RDM -0.011967 1.338999
7 JBL -0.034889 1.304109
8 JBD -0.071883 1.232226
9 WEG -0.232227 0.999999
And now, to print rows which totally include 80 % of change,
ending on the first row above the limit, run:
df[df.pct_cum.shift(1).fillna(0) < 0.8]
The result, together with the cumulated sum, is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465
Related
My Problem
I am trying to create a column in python which is the conditional smoothed moving 14 day average of another column. The condition is that I only want to include positive values from another column in the rolling average.
I am currently using the following code which works exactly how I want it to, but it is really slow because of the loops. I want to try and re-do it without using loops. The dataset is simply the last closing price of a stock.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df['delta'] = df.PX_LAST.pct_change()
df.loc[df.index[0], 'avg_gain'] = 0
for x in range(1,len(df.index)):
if df["delta"].iloc[x] > 0:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + df["delta"].iloc[x]) / 14
else:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + 0) / 14
df
Correct Output Example
Dates PX_LAST delta avg_gain
03/09/2018 43.67800 NaN 0.000000
04/09/2018 43.14825 -0.012129 0.000000
05/09/2018 42.81725 -0.007671 0.000000
06/09/2018 43.07725 0.006072 0.000434
07/09/2018 43.37525 0.006918 0.000897
10/09/2018 43.47925 0.002398 0.001004
11/09/2018 43.59750 0.002720 0.001127
12/09/2018 43.68725 0.002059 0.001193
13/09/2018 44.08925 0.009202 0.001765
14/09/2018 43.89075 -0.004502 0.001639
17/09/2018 44.04200 0.003446 0.001768
Attempted Solutions
I tried to create a new column that only comprises of the positive values and then tried to create the smoothed moving average of that new column but it doesn't give me the right answer
df['new_col'] = df['delta'].apply(lambda x: x if x > 0 else 0)
df['avg_gain'] = df['new_col'].ewm(14,min_periods=1).mean()
The maths behind it as follows...
Avg_Gain = ((Avg_Gain(t-1) * 13) + (New_Col * 1)) / 14
where New_Col only equals the positive values of Delta
Does anyone know how I might be able to do it?
Cheers
This should speed up your code:
df['avg_gain'] = df[df['delta'] > 0]['delta'].rolling(14).mean()
Does your current code converge to zero? If you can provide the data, then it would be easier for the folk to do some analysis.
I would suggest you add a column which is 0 if the value is < 0 and instead has the same value as the one you want to consider if it is >= 0. Then you take the running average of this new column.
df['new_col'] = df.apply(lambda x: x['delta'] if x['delta'] >= 0 else 0)
df['avg_gain'] = df['new_value'].rolling(14).mean()
This would take into account zeros instead of just discarding them.
I'm looking to add a field or two into my data set that represents the difference in sales from the last week to current week and from current week to the next week.
My dataset is about 4.5 million rows so I'm looking to find an efficient way of doing this, currently I'm getting into a lot of iteration and for loops and I'm quite sure I'm going about this the wrong way. but Im trying to write code that will be reusable on other datasets and there are situations where you might have nulls or no change in sales week to week (therefore there is no record)
The dataset looks like the following:
Store Item WeekID WeeklySales
1 1567 34 100.00
2 2765 34 86.00
3 1163 34 200.00
1 1567 35 160.00
. .
. .
. .
I have each week as its own dictionary and then each store sales for that week in a dictionary within. So I can use the week as a key and then within the week I access the store's dictionary of item sales.
weekly_sales_dict = {}
for i in df['WeekID'].unique():
store_items_dict = {}
subset = df[df['WeekID'] == i]
subset = subset.groupby(['Store', 'Item']).agg({'WeeklySales':'sum'}).reset_index()
for j in subset['Store'].unique():
storeset = subset[subset['Store'] == j]
store_items_dict.update({str(j): storeset})
weekly_sales_dict.update({ str(i) : store_items_dict})
Then I iterate through each week in the weekly_sales_dict and compare each store/item within it to the week behind it (I planned to do the same for the next week as well). The 'lag_list' I create can be indexed by week, store, and Item so I was going to iterate through and add the values to my df as a new lag column but I feel I am way overthinking this.
count = 0
key_list = list(df['WeekID'].unique())
lag_list = []
for k,v in weekly_sales_dict.items():
if count != 0 and count != len(df['WeekID'].unique())-1:
prev_wk = weekly_sales_dict[str(key_list[(count - 1)])]
current_wk = weekly_sales_dict[str(key_list[count])
for i in df['Store'].unique():
prev_df = prev_wk[str(i)]
current_df = current_wk[str(i)]
for j in df['Item'].unique():
print('in j')
if j in list(current_df['Item'].unique()) and j in list(prev_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values - prev_df[prev_df['Item'] == int(j)]['WeeklySales'].values
df[df['Item'] == j][df['Store'] == i ][df['WeekID'] == key_list[count]]['lag'] = item_lag[0]
lag_list.append((str(i),str(j),item_lag[0]))
elif j in list(current_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values
lag_list.append((str(i),str(j),item_lag[0]))
else:
pass
count += 1
else:
count += 1
Using pd.diff() the problem was solved. I sorted all rows by week, then created a subset with a multi-index by grouping on store,items,and week. Finally I used pd.diff() with a period of 1 and I ended up with the sales difference from the current week to the week prior.
df = df.sort_values(by = 'WeekID')
subset = df.groupby(['Store', 'Items', 'WeekID']).agg({''WeeklySales'':'sum'})
subset['lag'] = subset[['WeeklySales']].diff(1)
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278
I have a loop within my function that is supposed to find the max rate, min rate, and compute the average, and the function that I wrote is doing this right, but how can I keep the row information when I find the max, and min within my data? I'm a beginner at python, but here is the loop that I have.
max_rate = -1
min_rate = 25
count = 0
sum = 0
with open(file_names, "r") as file_out:
# skips the headers in the file
next(file_out)
for line in file_out:
values = line.split(",")
# since rate is index 6 that is what we are going to compare to values above
if float(values[6]) > max_rate:
max_rate = float(values[6])
if float(values[6]) < min_rate:
min_rate = float(values[6])
count += 1
# sum up all rates in the rates column
sum = float(values[6]) + sum
avg_rate = sum / count
print(avg_rate)
I have printed the average just to test my function. Hopefully the question I am asking makes sense, I don't just want the 6th index but I want the rest of the row information that has the min or the max. An example would be to get the company name, state, zip, and rate. Don't worry about indentations, I don't know if I formatted it right in the code block here, but all the indents are right in my code chunk.
It looks like you're working with CSV or other table-like data. Pandas handles this really well. An example would be:
import pandas as pd
df = pd.read_csv('something.csv')
print(df)
print(f'\nMax Rate: {df.rate.max()}')
print(f'Avg Rate: {df.rate.mean()}')
print(f'Min Rate: {df.rate.min()}')
print(f'Last Company (Alphabetically): {df.company_name.max()}')
Yields:
company_name state zip rate
0 Company1 Inc. Texas 76189 0.6527
1 Company2 LLC. Pennsylvania 18657 0.7265
2 Company3 Corp Indiana 47935 0.5267
Max Rate: 0.7265
Avg Rate: 0.6353
Min Rate: 0.5267
Last Company (Alphabetically): Company3 Corp
Try this:
max_rate = []
min_rate = []
count = 0
total = 0
with open(file_names, "r") as file_out:
# skips the headers in the file
next(file_out)
# reset max, min, total sum and count
max_rate = []
min_rate = []
total = 0
count = 0
for line in file_out:
values = line.split(",")
max_rate = max(values, max_rate or values, key=lambda x: x[6])
min_rate = min(values, min_rate or values, key=lambda x: x[6])
# sum up all rates in the rates column
total += float(values[6])
count += 1
avg_rate = total / count
print(avg_rate)
This will attribute the whole list for the min and max related to the 6th column as you intended. The max_rate or values code will evaluate the maximum value between values and max_rate lists only if max_rate is not empty (that will be the case in the first interaction of the for loop) that will prevent an IndexError. Same thing for min_rate
An important change I've made on your code is the name for the variable sum. That's a Python registered keyword and it's not a good practice to use it as a variable name, so prefer using something like total or total_sum instead.
Those suggestions are great. Thanks, I also found out that I could just assign the line to a variable underneath my if statements as well. And then at the beginning of my function I can assign this variables to an empty string. Like
info_high = ""
info_low = ""
info_high = line
info_low = line
and it will be able to save the row information I need, and then I would just index the information that I need.
I have a dataframe stockData which looks like this:
Name: BBG.XLON.VOD.S_MKTCAP_EUR,
04/02/2008 125761.8868
05/02/2008 124513.4973
06/02/2008 124299.8368
07/02/2008 122973.7429
08/02/2008 123451.0086
11/02/2008 122948.5002
12/02/2008 124336.3475
13/02/2008 124546.6607
14/02/2008 124434.8762
15/02/2008 123370.2129
18/02/2008 123246.854
19/02/2008 121965.328
20/02/2008 119154.8945
I am trying to create an exponentially weighted moving average with an alpha of 0.1, so the resulting dataframe should look like:
Name: BBG.XLON.VOD.S_MKTCAP_EUR, expon
04/02/2008 125761.8868 125761.8868
05/02/2008 124513.4973 125637.0478
06/02/2008 124299.8368 125503.3267
07/02/2008 122973.7429 125250.3683
08/02/2008 123451.0086 125070.4324
11/02/2008 122948.5002 124858.2391
12/02/2008 124336.3475 124806.05
13/02/2008 124546.6607 124780.111
14/02/2008 124434.8762 124745.5876
15/02/2008 123370.2129 124608.0501
18/02/2008 123246.854 124471.9305
19/02/2008 121965.328 124221.2702
20/02/2008 119154.8945 123714.6327
I have tried using the following from panadas:
stockData['expon'] = pd.ewma(stockData[unique_id+"_MKTCAP_EUR"], span = 0.1)
but get a result which does not equal what I am expecting:
Name: BBG.XLON.VOD.S_MKTCAP_EUR, expon
04/02/2008 125761.8868 125761.8868
05/02/2008 124513.4973 123681.2377
06/02/2008 124299.8368 124062.4362
07/02/2008 122973.7429 121107.3884
08/02/2008 123451.0086 124216.9907
11/02/2008 122948.5002 122075.8313
12/02/2008 124336.3475 126868.3597
13/02/2008 124546.6607 124942.6688
14/02/2008 124434.8762 124220.0306
15/02/2008 123370.2129 121296.275
18/02/2008 123246.854 123004.4148
19/02/2008 121965.328 119431.9075
20/02/2008 119154.8945 113577.3494
Could someone let me know what I need to do in order to return the expected result please.
Also if I wanted just to return the last value in the exponentially weighted series (123714.6327) could someone also let me know how that would be possible please?
Thanks
Just simplifying column names:
df.columns = ['date', 'ticker']
Use adjust=False (see docs for calculation of weights)
df['emwa'] = pd.ewma(df.ticker, alpha=0.1, adjust=False)
date ticker emwa
0 04/02/2008 125761.8868 125761.886800
1 05/02/2008 124513.4973 125637.047850
2 06/02/2008 124299.8368 125503.326745
3 07/02/2008 122973.7429 125250.368361
4 08/02/2008 123451.0086 125070.432384
5 11/02/2008 122948.5002 124858.239166
6 12/02/2008 124336.3475 124806.049999
7 13/02/2008 124546.6607 124780.111069
8 14/02/2008 124434.8762 124745.587583
9 15/02/2008 123370.2129 124608.050114
10 18/02/2008 123246.8540 124471.930503
11 19/02/2008 121965.3280 124221.270253
12 20/02/2008 119154.8945 123714.632677
and to get the last value:
df.emwa.iloc[-1]
123714.632677