how to iterate over a list with condition - python

I know this question might be a bit inappropriate for this platform but I have nowhere else to ask for help.
I'm new to python and I'm trying to learn how to iterate over a list with some conditions. I have the next problem - for each unique link from Where to Where, I want to choose one of the most profitable suppliers. A profitable supplier is a supplier that most of the days of the week turned out to be cheaper (that is, had a lower cost) than other suppliers. The dataset is the following where columns are: 1st-From, 2nd-To, 3rd-Day in a week, 4th-supplier's number, 5th-Cost.
To solve this task, I've decided firstly to create a new column and list with unique routes.
df_routes['route'] = df_routes['From'] + '-' + df_routes['Where']
routes = df_routes['route'].unique()
len(routes)
And then iterate over it but I do not fully understand how the structure should look like. My guess is that it should be something like this:
for i, route in enumerate(routes):
x = df_routes[df_routes['route'] == route]
if x['supplier'].nunique() == 1:
print(route, supplier)
else:
...
I don't know how to structure it further and whether this is the right structure. So how it should look like?
I will really appreciate any help (tips, hints, snippets of code) on this question.

This is more efficiently solved with pandas functions rather than looping
Let df be a portion of your dataframe for the first two routes. First we sort by cost and group by the route and the 'Day'. This will tell us for each day and each route which supplier is the cheapest:
df1 = df.sort_values('Cost', ascending = True).groupby(['From','To', 'Day']).first()
df1 looks like this:
Supplier Cost
From To Day
BGW MOW 1 3 75910
2 3 75990
3 3 27340
4 3 75990
5 11 19880
6 3 75440
7 11 24740
OSS UUS 1 47 65650
2 47 47365
3 47 70635
4 47 47365
5 47 62030
6 47 62030
7 47 71010
Next we count the number of mentions for each supplier for each route:
df2 = df1.groupby(['From','To'])['Supplier'].value_counts().rename('days').reset_index(level=2)
df2 looks like this:
Supplier days
From To
BGW MOW 3 5
MOW 11 2
OSS UUS 47 7
eg for the first route, supplier 3 was the cheapest for 5 days and supplier 11 for 2 days
Now we just pick the first (most-mentioned) supplier for each route:
df3 = df2.groupby(['From','To']).first()
df3 is the final output and looks like this:
Supplier days
From To
BGW MOW 3 5
OSS UUS 47 7

Groupby dataframe based on columns (['From','To', 'Day']) and use aggregate min ('Cost')
function to get result
df.groupby(['From','To', 'Day']).min('Cost').reset_index()

Related

looking for better iteration approach for slicing a dataframe

First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))
For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505

find first unique items selected by user and ranking them in order of user selection by date

I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.

Pair of employees that worked together for the longest period of time - Python/Pandas

I recently had to do a code, which returns the pair of employees that have worked the most together on a common project. This is the code I came up with:
Note 1: Null is read by the program as "Today"
Note 2: The data comes from a .txt file in this form:
EmpID,ProjectID,DateFrom,DateTo
1,101,2014-11-01,2015-05-01
1,103,2013-11-01,2016-05-01
2,101,2013-12-06,2014-10-06
2,103,2014-06-05,2015-05-14
3,100,2016-03-01,2018-07-03
3,102,2015-06-04,2017-09-04
3,103,2015-06-04,2017-09-04
4,102,2013-11-13,2014-03-13
4,103,2016-02-14,2017-03-15
4,104,2014-10-01,2015-12-01
5,100,2013-03-07,2015-11-07
5,101,2015-07-09,2019-01-19
5,102,2014-03-15,NULL
6,101,2014-03-15,2014-03-16
The problem that I currently have is that I have to adapt/change the code to return the pair of employees that have worked together with each other the longest (not on a single project, but all projects combined). I am having troubles of adapting my current code, which runs perfectly fine for what it is, and I am wondering if I should just scratch all of this and start from the beginning (but it would cost me a lot of time, which I don't have currently). I am having difficulties with obtaining the combinations of employees that have worked together on projects.
I would very much appreciate it if anyone can give me any tips! Thanks!
Edit 1: A person in the comments reminded me to mention that overlapping days should be counted as for example:
Person A and B work on two projects for the entirety of June. This means that it should be counted as 30 days total common work (for the two projects), not adding both project times together, which would result in 60 days.
Here's one of the more straight-forward ways I can think of doing this.
Expand the timespans to a single row per date.
Merge all Days on the same project (to get all combinations of people who worked together)
Remove duplicated rows of people who work together on the same day, but different projects.
Just find how many rows are within each worker pairing.
Code:
import pandas as pd
import numpy as np
def expand_period_daily(df, start, stop):
# Allows it to work for one day spans.
df.loc[df[stop].notnull(), stop] = (df.loc[df[stop].notnull(), stop]
+ pd.Timedelta(hours=1))
real_span = df[[start, stop]].notnull().all(1)
# Resample timespans to daily fields.
df['temp_id'] = range(len(df))
dailydf = (df.loc[real_span, ['temp_id', start, stop]].set_index('temp_id').stack()
.reset_index(level=-1, drop=True).rename('period').to_frame())
dailydf = (dailydf.groupby('temp_id').apply(lambda x: x.set_index('period')
.resample('d').asfreq()).reset_index())
# Merge back other information
dailydf = (dailydf.merge(df, on=['temp_id'])
.drop(columns=['temp_id', start, stop]))
return dailydf
# Make dates, fill missings.
df[['DateFrom', 'DateTo']] = df[['DateFrom', 'DateTo']].apply(pd.to_datetime, errors='coerce')
df[['DateFrom', 'DateTo']] = df[['DateFrom', 'DateTo']].fillna(pd.to_datetime('today').normalize())
dailydf = expand_period_daily(df.copy(), start='DateFrom', stop='DateTo')
# Merge, remove rows of employee with him/herself.
m = (dailydf.merge(dailydf, on=['period', 'ProjectID'])
.loc[lambda x: x.EmpID_x != x.EmpID_y])
# Ensure A-B and B-A are grouped the same
m[['EmpID_x', 'EmpID_y']] = np.sort(m[['EmpID_x', 'EmpID_y']].to_numpy(), axis=1)
# Remove duplicated projects on same date between employee pairs
m = m.drop_duplicates(['period', 'EmpID_x', 'EmpID_y'])
m.groupby(['EmpID_x', 'EmpID_y']).size().to_frame('Days_Together')
Output:
Days_Together
EmpID_x EmpID_y
1 2 344
3 333
4 78
2 6 2
3 4 396
5 824
Test Case
To give a bit more clarity on how it handles overlaps, and combines different projects, Here's the following test case:
EmpID ProjectID DateFrom DateTo
0 1 101 2014-11-01 2014-11-15
1 1 103 2014-11-01 2014-11-15
2 1 105 2015-11-02 2015-11-03
3 2 101 2014-11-01 2014-11-15
4 2 103 2014-11-01 2014-11-15
5 2 105 2015-10-02 2015-11-05
6 3 101 2014-11-01 2014-11-15
Employees 1 and 2 perfectly overlap for 15 days on 2 projects in Nov 2014. They then work together for 2 additional days on another project in 2015. 1, 2 and 3 all work together for 15 days on a single Project.
Running with this test case we obtain:
Days_Together
EmpID_x EmpID_y
1 2 17
3 15
2 3 15

Divide 2 columns and create new column with results

I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.
It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000
You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000
I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000

Automated combinatorial DataFrame generation in Python/pandas

I'm quite new to pandas and python, and I'm coming from a background in biochemistry and drug discovery. One frequent task that I'd like to automate is the conversion of a list of combination of drug treatments and proteins to a format that contains all such combinations.
For instance, if I have a DataFrame containing a given set of combinations:
https://github.com/colinhiggins/dillydally/blob/master/input.csv, I'd like to turn it into https://github.com/colinhiggins/dillydally/blob/master/output.csv such that each protein (1, 2, and 3) are copied n times to an output DataFrame where the number of rows, n, is the number of drugs and drug concentrations plus one for a no-drug row of each protein.
Ideally, the degree of combination would be dictated by some other table that indicates relationships, for example if proteins 1 and 2 are to be treated with drugs 1, 2, and 3 but that protein 2 isn't treated with any drugs.
I'm thinking some kind of nested for loop is going to be required, but I can't wrap my head around just quite how to start it.
Consider the following solution
from itertools import product
import pandas
protein = ['protein1' , 'protein2' , 'protein3' ]
drug = ['drug1' , 'drug2', 'drug3']
drug_concentration = [100,30,10]
df = pandas.DataFrame.from_records( list( i for i in product(protein, drug, drug_concentration ) ) , columns=['protein' , 'drug' , 'drug_concentration'] )
>>> df
protein drug drug_concentration
0 protein1 drug1 100
1 protein1 drug1 30
2 protein1 drug1 10
3 protein1 drug2 100
4 protein1 drug2 30
5 protein1 drug2 10
6 protein1 drug3 100
7 protein1 drug3 30
8 protein1 drug3 10
9 protein2 drug1 100
10 protein2 drug1 30
11 protein2 drug1 10
12 protein2 drug2 100
13 protein2 drug2 30
14 protein2 drug2 10
15 protein2 drug3 100
16 protein2 drug3 30
17 protein2 drug3 10
18 protein3 drug1 100
19 protein3 drug1 30
20 protein3 drug1 10
21 protein3 drug2 100
22 protein3 drug2 30
23 protein3 drug2 10
24 protein3 drug3 100
25 protein3 drug3 30
26 protein3 drug3 10
This is basically a cartesian product you're after, which is the functionality of the product function in the itertools module. I'm admitedly confused why you want the empty rows that just list out the proteins with nan's in the other columns. Not sure if that was intentional or accidental. If the datatypes were uniform and numeric this is similar functionality to what's known as a meshgrid.
I've worked through part of this with the help of add one row in a pandas.DataFrame using the method recommended by ShikharDua of creating a list of dicts, each dict corresponding to a row in the eventual DataFrame.
The code is:
data = pandas.read_csv('input.csv')
dict1 = {"protein":"","drug":"","drug_concentration":""} #should be able to get this automatically using the dataframe columns, I think
rows_list = []
for unique_protein in data.protein.unique():
dict1 = {"protein":unique_protein,"drug":"","drug_concentration":""}
rows_list.append(dict1)
for unique_drug in data.drug.unique():
for unique_drug_conc in data.drug_concentration.unique():
dict1 = {"protein":unique_protein,"drug":unique_drug,"drug_concentration":unique_drug_conc}
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
df
It isn't as flexible as I was hoping, since the extra row from protein with no drugs is hard-coded into the nested for loops, but at least its a start. I guess I can add some if statements within each for loop.
I've improved upon the earlier version
enclosed it in a function
added a check for proteins that won't be treated with drugs from another input CSV file that contains the same proteins in column A and either true or false in column B labeled "treat with drugs"
Skips null values. I noticed that my example input.csv had equal length columns, and the function started going a little nuts with NaN rows if they had unequal lengths.
Initial dictionary keys are set from the columns from the initial input CSV instead of hard-coding them.
I tested this with some real data (hence the change from input.csv to realinput.csv), and it works quite nicely.
Code for a fully functional python file follows:
import pandas
import os
os.chdir("path_to_directory_containing_realinput_and_boolean_file")
realinput = pandas.read_csv('realinput.csv')
rows_list = []
dict1 = dict.fromkeys(realinput.columns,"")
prot_drug_bool = pandas.read_csv('protein_drug_bool.csv')
prot_drug_bool.index = prot_drug_bool.protein
prot_drug_bool = prot_drug_bool.drop("protein",axis=1)
def null_check(value):
return pandas.isnull(value)
def combinator(input_table):
for unique_protein in input_table.protein.unique():
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
rows_list.append(dict1)
if prot_drug_bool.ix[unique_protein]:
for unique_drug in input_table.drug.unique():
if not null_check(unique_drug):
for unique_drug_conc in input_table.drug_concentration.unique():
if not null_check(unique_drug_conc):
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
dict1['drug']=unique_drug
dict1['drug_concentration']=unique_drug_conc
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
return df
df2 = combinator(realinput)
df2.to_csv('realoutput.csv')
I'd still like to make it more versatile by getting away from hard-coding any dictionary keys and letting the user-defined input.csv column headers dictate the output. Also, I'd like to move away from the defined three-column setup to handle any number of columns.

Categories