Is there an optimal solution for this? - python

I have a large data that contains: ID,year,program. I want to filter out IDs if a particular program say A, has more value of year(example : 2019>2018) than another program say B. I have a solution but that involves a loop. I want to know if there's another way of doing this.
My code :
unique = list(set(finalAD['ID']))
IDFiltered = []
for i in unique:
data = finalAD[finalAD['ID'] == i]
AD1 = data[data['Program'].str.match('AD')]
ind = list(AD1.index.values)
AD2 = data.drop(ind)
date1 = AD1['Year'].max()
date2 = AD2['Year'].min()
if(date2 > date1):
IDFiltered.append(i)
newData = finalAD[finalAD['ID'].isin(IDFiltered)]
newData.reset_index(drop = True, inplace = True)
newData.head()

Related

How can I optimize the groupby.apply(function) in Python?

I have a function that uses deque.collections to track daily stock in based on FIFO. An order will be fulfilled if possible and is substracted from stock accordingly. I use a function in groupby.apply(my_function).
I have struggles where to place the second loop. Both loops work properly when run on their own. But I do not get them working combined.
The dataset is about 1.5 million rows.
Thanks.
DOS = 7
WIP = 1
df_fin['list_stock'] = 0
df_fin['stock_new'] = 0
def create_stocklist(x):
x['date_diff'] = x['dates'] - x['dates'].shift()
x['date_diff'] = x['date_diff'].fillna(0)
x['date_diff'] = (x['date_diff'] / np.timedelta64(1, 'D')).astype(int)
x['list_stock'] = x['list_stock'].astype(object)
x['stock_new'] = x['stock_new'].astype(object)
var_stock = DOS*[0]
sl = deque([0],maxlen=DOS)
for i in x.index:
order = x['order_bin'][i]
if x['date_diff'][i] > 0:
for p in range(0,x['date_diff'][i]):
if p == WIP:
sl.appendleft(x.return_bin[i-1])
else:
sl.appendleft(0)
sl_list = list(sl)
sl_list.reverse()
new_list = []
#from here the loop does not work as I wanted it to work.
#I want to loop over de created sl_list
#and then start the loop above with the outcome of the loop below.
for elem in sl_list:
while order > 0:
val = max(0,elem-order)
order = (abs(min(0,elem-order)))
new_list.append(val)
break
else:
new_list.append(elem)
new_list.reverse()
x.at[i,'list_stock'] = new_list
sl = deque(new_list)
return x
df_fin.groupby(by=['ID']).apply(create_stocklist)
You do not have access to sl_list inside the second loop, you should just define it in the upper scope: for example just after the first global for loop:
for i in x.index:
# define it just here
sl_list = []
order = x['order_bin'][i]

Is there a way to optimize this code in order to run faster?

Hi there I am working in an application and I am using this piece of code to create new columns in a data frame so I can make some calculations, however it is really slow and I would like to try a new approach.
I have read about Multiprocessing, but I am not sure how and where to use it, so I am asking for your help.
def create_exposed_columns(df):
df['MONTH_INITIAL_DATE'] = df['INITIAL_DATE'].dt.to_period(
'M')
df['MONTH_FINAL_DATE'] = df['FINAL_DATE'].dt.to_period(
'M')
df['Diff'] = df['MONTH_FINAL_DATE'] - df['MONTH_INITIAL_DATE']
list_1 = []
for index, row in df.iterrows():
valor = 1
initial_date = row['INITIAL_DATE']
diff = row['Diff']
temporal_list = {}
list_1.append(temporal_list)
for i in range(meses_iterables + 1):
date = initial_date + relativedelta(months=+1 * i)
if len(str(date.month)) == 1:
value = {str(date.year) + '-0' + str(date.month): valor}
temporal_list.update(value)
else:
value = {str(date.year) + '-' + str(date.month): valor}
temporal_list.update(value)
df_2 = pd.DataFrame(list_1)
df = df.reset_index()
df = pd.concat([df, df_2], axis=1)
return df
I have no idea where to start, so any kind of help will be useful.
Thanks

Setting dictionary values while iterating through a 'for-loop'

I'm trying to create a nested dictionary with a set of values that are pulled from a for-loop, to measure growth and revenue amounts for various customer-product pairings. However, when I loop through a dataframe to set elements of the dictionary, each dictionary element ends up with the same values. What's going on, here?
I have already tried changing various elements of how the lists are built, but to no avail.
'''
TP_Name = customer name
Service_Level_1 = service name
100.2014 is just a marker to show that someone has started consuming the service
tpdict is already created with necessary nesting below with empty values at each endpoint
'''
for col in pivotdf.columns:
growthlist = []
amountlist = []
first = True
TP_Name, Service_Level_1 = col.split('___')
for row in pivotdf[col]:
if first == True:
past = row+.00001
first = False
if row == 0 and past <.0001 :
growth = 0
elif row != 0 and past == .00001:
growth = 100.2014
else:
current = row
growth = (current-past)/past
growth = round(growth,4)
growthlist.append(growth)
past = row +.00001
amountlist.append(row)
tpdict[TP_Name][Service_Level_1]['growth'] = growthlist
tpdict[TP_Name][Service_Level_1]['amount'] = amountlist
'''
problem: Each value ends up being the same thing
'''
Expected results:
{'CUSTOMER NAME': {'PRODUCT1': {'growth': [unique_growthlist], 'amount': [unique_amountlist]}, 'PRODUCT2': {'growth': [unique_growthlist],'amount': [unique_amountlist]}}}
A dictionary is a key value pair (as I am sure you may know). If you ever try to write to a dictionary with a key that already exists in the dictionary then the dictionary will overwrite the value for that key.
Example:
d = dict()
d[1] = 'a' # d = {1: 'a'}
d[1] = 'b' # d = {1: 'b'}
Your project seems like it may be a good use of a namedtuple in python.
A namedtuple is basically a light weight class/object.
My example code may be wrong because I don't know how your for loop is working (commenting helps everyone). That being said here is an example.
I only make this recommendation as dictionaries consume ~33% more memory then the objects they hold (though they are much faster).
from collections import namedtuple
Customer = namedtuple('Customer', 'name products')
Product = namedtuple('Product', 'growth amount')
customers = []
for col in pivotdf.columns:
products = []
growthlist = []
amountlist = []
first = True
TP_Name, Service_Level_1 = col.split('___')
for row in pivotdf[col]:
if first == True:
past = row + .00001
first = False
if row == 0 and past < .0001 :
growth = 0
elif row != 0 and past == .00001:
growth = 100.2014
else:
current = row
growth = (current - past) / past
growth = round(growth, 4)
growthlist.append(growth)
past = row + .00001
amountlist.append(row)
cur_product = Product(growth=growthlist, amount=amountlist) # Create a new product
products.append(cur_product) # Add that product to our customer
# Create a new customer with our products
cur_customer = Customer(name=TP_Name, products=products)
customers.append(cur_customer) # Add our customer to our list of customers
Here customers is a list of Customer namedtuples that we can use as objects.
For example this is how we can print them out.
for customer in customers:
print(customer.name, customer.products) # Print each name and their products
for growth, amount in customer.products:
print(growth, amount) # Print growth and amount for each product.

Summarizing data in pandas dataframe

I have a dataframe that looks like:
respondent_id,group_number,member_id
1,1,3
1,1,4
1,2,1
....
My goal is to output two counts for each respondent ID; the number of groups that include themselves as a member ID, and those which don't.
For example, the above table would output:
respondent_id,my_groups,other_groups
1,1,1
My best guess is to do something like:
rg_g = df.groupby(['respondent_id','group_number'])
rg_g.apply(lambda g: g.respondent_id in g.id.values)
But I don't know where to go from there.
Updated answer(it is not the best code but it works):
Initialization:
test_data = pd.DataFrame(np.random.randint(5, size=(10, 3)),columns=['respondent_id','group_number','member_id'])
test_data['member_id'][3]=None
test_data['member_id'][5]=None
test_data['member_id'][7]=None
test_data['member_id'][8]=None
test_data['member_id'][9]=None
test_data['member_id'][10]=None
Code:
# calculate the groups where respondent have the member_id
d_nn = test_data[test_data.member_id.notnull()]
# or for example: test_data[test_data.member_id != 0]
d_is_n = test_data[test_data.member_id.isnull()]
d_nn = pd.DataFrame({'count' : d_nn.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_is_n = pd.DataFrame({'count' : d_is_n.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_nn['is_member'] = 1
d_is_n['is_member'] = 0
# merge
result = d_nn.copy()
for idx1 in range(len(d_is_n)):
merge = True
for idx2 in range(len(d_nn)):
if d_nn.iloc[idx2].respondent_id == d_is_n.iloc[idx1].respondent_id and \
d_nn.iloc[idx2].group_number == d_is_n.iloc[idx1].group_number:
merge = False
if merge:
temp_d = d_is_n.iloc[idx1]
result = result.append(temp_d, ignore_index=True)
#group by respondent_id and is_member
result = pd.DataFrame({'group_number' : result.groupby( [ "respondent_id", "is_member"] ).size()}).reset_index()
print result
So, here's what I ended up doing. Maybe not ideal, but it seems to work. :)
import pandas as pd
rg = pd.read_csv('./in_file.csv')
rg_g = rg.groupby(['respondent_id','group_number'])
in_g = rg_g.filter(lambda g: g.respondent_id in g.id.values)
out_g = rg_g.filter(lambda g: g.respondent_id not in g.id.values)
my_count = in_g.groupby('respondent_id').group_number.nunique()
other_count = out_g.groupby('respondent_id').group_number.nunique()
pd.concat([my_count,other_count], axis=1).to_csv('./out_file.csv')

For-loop using dictionary key reference not working

I have never used Python before but have decided to start learning it by manipulating some market data. I am having trouble using the dictionary structures. In the code for read_arr_price below the command dict_price_recalc[price_id][year_to_index(year), Q] = float(line2)/7.5 assigns float(line2)/7.5 to all arrays, regardless of their price_id affiliation. I wonder if it is because I didn't initialize dict_price correctly.
def read_dict_price(dat_filename, dict_price):
## Load data set
dat_file = open(dat_filename,'r')
## Copy arr_price
dict_price_recalc = dict_price
## Iterate through each row in the data set, assigning values to variables
for line in dat_file:
year = int(line[11:15])
price_id = line[0:4]
Q = 0
Q1 = line[19:21]
Q2 = line[23:25]
Q3 = line[27:29]
Q4 = line[31:33]
## Truncate year_list to prepare for another run of the nested loop
year_list = []
year_list[:] = []
## Repopulate
year_list = [Q1, Q2, Q3, Q4]
#### This is where the mistake happens ####
## Iterate through each row in year_list, populating dict_price_recalc with price data
for line2 in year_list:
dict_price_recalc[price_id][year_to_index(year), Q] = float(line2)/7.5
Q += 1
return dict_price_recalc
My code initializing dict_price is below:
def init_dict_price(dat_filename):
price_id= {}
dat_file = open(dat_filename,'r')
np_array = np.zeros(shape=(100,4)) # Zeros as placeholders
np_array[:] = np.NaN
for line in dat_file:
key = line[:11]
price_id[key] = np_array
return price_id
I appreciate any pointers you can provide.
this line price_id[key] = np_array is setting the same array to each key so every key points to the same array. you probably meant price_id[key] = np_array.copy()

Categories