Merge 3 multiindex dataframes to make one multiindex dataframe - python

I have 3 dataframes:
df1 is
Total Total Total
Tool Technology One Two Three
Alt AI 15 6 9
df2 is
Check Check Check
Tool Technology One Two Three
Alt AI 10 4 6
df3 is
Uncheck Uncheck Uncheck
Tool Technology One Two Three
Alt AI 18 11 7
After merging the final data frame should be like
Total Total Total Check Check Check Uncheck Uncheck Uncheck
Tool Technology One Two Three One Two Three One Two Three
Alt AI 10 4 6 15 6 9 18 11 7
How can I achieve this?

Three approaches to merge work with multi-index rows. It's not clear if your DataFrames have multi-index columns as well. Provide using to_dict() method
import io
df1 = pd.read_csv(io.StringIO(""" Total Total Total
Tool Technology One Two Three
Alt AI 15 6 9 """), sep="\s+", header=1).set_index(["Tool","Technology"])
df1 = df1.rename(columns={c:f"Total {c}" for c in df1.columns})
df2 = pd.read_csv(io.StringIO(""" Check Check Check
Tool Technology One Two Three
Alt AI 10 4 6"""), sep="\s+", header=1).set_index(["Tool","Technology"])
df2 = df2.rename(columns={c:f"Check {c}" for c in df2.columns})
print(pd.concat([df1,df2], axis=1).to_string())
print(df1.join(df2).to_string())
print(df1.merge(df2, on=["Tool","Technology"]).to_string())
output
Total One Total Two Total Three Check One Check Two Check Three
Tool Technology
Alt AI 15 6 9 10 4 6
Total One Total Two Total Three Check One Check Two Check Three
Tool Technology
Alt AI 15 6 9 10 4 6
Total One Total Two Total Three Check One Check Two Check Three
Tool Technology
Alt AI 15 6 9 10 4 6

Related

How to change only second row in multiple groups of a dataframe

I would like for each group in a data frame df_task containing three rows, to modify the second row of the column Task.
import pandas as pd
df_task = pd.DataFrame({'Days':[5,5,5,20,20,20,10,10],
'Task':['Programing','Presentation','Training','Development','Presentation','Workshop','Coding','Communication']},)
df_task.groupby(["Days"])
This is the expected output, if the group contain three rows, the value of task from the first row is added to the value of Task from the second row, as shown in the new column New_Task, if the group has two rows, nothing is modified:
Days Task New_Task
0 5 Programing Programing
1 5 Presentation Presentation,Programing
2 5 Training Training
3 20 Development Development
4 20 Presentation Presentation,Development
5 20 Workshop Workshop
6 10 Coding Coding
7 10 Communication Communication
Your requirement are pretty straight-forward. Try:
groups = df_task.groupby('Days')
# enumeration of the rows within groups
enums = groups.cumcount()
# sizes of the groups broadcast to each row
sizes = groups['Task'].transform('size')
# so update the correct rows
df_task['New_Task'] = np.where(enums.eq(1) & sizes.gt(2),
df_task['Task'] + ',' + groups['Task'].shift(fill_value=''),
df_task['Task'])
print(df_task)
Output:
Days Task New_Task
0 5 Programing Programing
1 5 Presentation Presentation,Programing
2 5 Training Training
3 20 Development Development
4 20 Presentation Presentation,Development
5 20 Workshop Workshop
6 10 Coding Coding
7 10 Communication Communication

how to iterate over a list with condition

I know this question might be a bit inappropriate for this platform but I have nowhere else to ask for help.
I'm new to python and I'm trying to learn how to iterate over a list with some conditions. I have the next problem - for each unique link from Where to Where, I want to choose one of the most profitable suppliers. A profitable supplier is a supplier that most of the days of the week turned out to be cheaper (that is, had a lower cost) than other suppliers. The dataset is the following where columns are: 1st-From, 2nd-To, 3rd-Day in a week, 4th-supplier's number, 5th-Cost.
To solve this task, I've decided firstly to create a new column and list with unique routes.
df_routes['route'] = df_routes['From'] + '-' + df_routes['Where']
routes = df_routes['route'].unique()
len(routes)
And then iterate over it but I do not fully understand how the structure should look like. My guess is that it should be something like this:
for i, route in enumerate(routes):
x = df_routes[df_routes['route'] == route]
if x['supplier'].nunique() == 1:
print(route, supplier)
else:
...
I don't know how to structure it further and whether this is the right structure. So how it should look like?
I will really appreciate any help (tips, hints, snippets of code) on this question.
This is more efficiently solved with pandas functions rather than looping
Let df be a portion of your dataframe for the first two routes. First we sort by cost and group by the route and the 'Day'. This will tell us for each day and each route which supplier is the cheapest:
df1 = df.sort_values('Cost', ascending = True).groupby(['From','To', 'Day']).first()
df1 looks like this:
Supplier Cost
From To Day
BGW MOW 1 3 75910
2 3 75990
3 3 27340
4 3 75990
5 11 19880
6 3 75440
7 11 24740
OSS UUS 1 47 65650
2 47 47365
3 47 70635
4 47 47365
5 47 62030
6 47 62030
7 47 71010
Next we count the number of mentions for each supplier for each route:
df2 = df1.groupby(['From','To'])['Supplier'].value_counts().rename('days').reset_index(level=2)
df2 looks like this:
Supplier days
From To
BGW MOW 3 5
MOW 11 2
OSS UUS 47 7
eg for the first route, supplier 3 was the cheapest for 5 days and supplier 11 for 2 days
Now we just pick the first (most-mentioned) supplier for each route:
df3 = df2.groupby(['From','To']).first()
df3 is the final output and looks like this:
Supplier days
From To
BGW MOW 3 5
OSS UUS 47 7
Groupby dataframe based on columns (['From','To', 'Day']) and use aggregate min ('Cost')
function to get result
df.groupby(['From','To', 'Day']).min('Cost').reset_index()

Python: Clustering with grouped data

With grouped data I mean the following: Assume we have a data set which is grouped by a single feature, e.g. customer data, which is grouped by the single customer:
Customer | Purchase Nr | Item | Paid Amount ($)
1 1 TShirt 15
1 2 Trousers 25
1 3 Scarf 10
2 1 Underwear 5
2 2 Dress 35
2 3 Trousers 30
2 4 TShirt 10
3 1 TShirt 8
3 2 Socks 5
4 1 Shorts 13
I want to find clusters in a way, that a customers purchases are in one single cluster, in other words, that that a customer is not appearing in two clusters.
I thought about grouping the data set by the customer with a groupby, though it is difficult to express all the information of the columns for one customer in only one column. Futher, the order of purchases is important to me, e.g. if a T-Shirt was bought first or second.
Is there any cluster algorithm which includes information about groups like this?
Thank you!

Fill DataFrame row values based on another dataframe row's values pandas

DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.
It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.
This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.
pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.

Automated combinatorial DataFrame generation in Python/pandas

I'm quite new to pandas and python, and I'm coming from a background in biochemistry and drug discovery. One frequent task that I'd like to automate is the conversion of a list of combination of drug treatments and proteins to a format that contains all such combinations.
For instance, if I have a DataFrame containing a given set of combinations:
https://github.com/colinhiggins/dillydally/blob/master/input.csv, I'd like to turn it into https://github.com/colinhiggins/dillydally/blob/master/output.csv such that each protein (1, 2, and 3) are copied n times to an output DataFrame where the number of rows, n, is the number of drugs and drug concentrations plus one for a no-drug row of each protein.
Ideally, the degree of combination would be dictated by some other table that indicates relationships, for example if proteins 1 and 2 are to be treated with drugs 1, 2, and 3 but that protein 2 isn't treated with any drugs.
I'm thinking some kind of nested for loop is going to be required, but I can't wrap my head around just quite how to start it.
Consider the following solution
from itertools import product
import pandas
protein = ['protein1' , 'protein2' , 'protein3' ]
drug = ['drug1' , 'drug2', 'drug3']
drug_concentration = [100,30,10]
df = pandas.DataFrame.from_records( list( i for i in product(protein, drug, drug_concentration ) ) , columns=['protein' , 'drug' , 'drug_concentration'] )
>>> df
protein drug drug_concentration
0 protein1 drug1 100
1 protein1 drug1 30
2 protein1 drug1 10
3 protein1 drug2 100
4 protein1 drug2 30
5 protein1 drug2 10
6 protein1 drug3 100
7 protein1 drug3 30
8 protein1 drug3 10
9 protein2 drug1 100
10 protein2 drug1 30
11 protein2 drug1 10
12 protein2 drug2 100
13 protein2 drug2 30
14 protein2 drug2 10
15 protein2 drug3 100
16 protein2 drug3 30
17 protein2 drug3 10
18 protein3 drug1 100
19 protein3 drug1 30
20 protein3 drug1 10
21 protein3 drug2 100
22 protein3 drug2 30
23 protein3 drug2 10
24 protein3 drug3 100
25 protein3 drug3 30
26 protein3 drug3 10
This is basically a cartesian product you're after, which is the functionality of the product function in the itertools module. I'm admitedly confused why you want the empty rows that just list out the proteins with nan's in the other columns. Not sure if that was intentional or accidental. If the datatypes were uniform and numeric this is similar functionality to what's known as a meshgrid.
I've worked through part of this with the help of add one row in a pandas.DataFrame using the method recommended by ShikharDua of creating a list of dicts, each dict corresponding to a row in the eventual DataFrame.
The code is:
data = pandas.read_csv('input.csv')
dict1 = {"protein":"","drug":"","drug_concentration":""} #should be able to get this automatically using the dataframe columns, I think
rows_list = []
for unique_protein in data.protein.unique():
dict1 = {"protein":unique_protein,"drug":"","drug_concentration":""}
rows_list.append(dict1)
for unique_drug in data.drug.unique():
for unique_drug_conc in data.drug_concentration.unique():
dict1 = {"protein":unique_protein,"drug":unique_drug,"drug_concentration":unique_drug_conc}
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
df
It isn't as flexible as I was hoping, since the extra row from protein with no drugs is hard-coded into the nested for loops, but at least its a start. I guess I can add some if statements within each for loop.
I've improved upon the earlier version
enclosed it in a function
added a check for proteins that won't be treated with drugs from another input CSV file that contains the same proteins in column A and either true or false in column B labeled "treat with drugs"
Skips null values. I noticed that my example input.csv had equal length columns, and the function started going a little nuts with NaN rows if they had unequal lengths.
Initial dictionary keys are set from the columns from the initial input CSV instead of hard-coding them.
I tested this with some real data (hence the change from input.csv to realinput.csv), and it works quite nicely.
Code for a fully functional python file follows:
import pandas
import os
os.chdir("path_to_directory_containing_realinput_and_boolean_file")
realinput = pandas.read_csv('realinput.csv')
rows_list = []
dict1 = dict.fromkeys(realinput.columns,"")
prot_drug_bool = pandas.read_csv('protein_drug_bool.csv')
prot_drug_bool.index = prot_drug_bool.protein
prot_drug_bool = prot_drug_bool.drop("protein",axis=1)
def null_check(value):
return pandas.isnull(value)
def combinator(input_table):
for unique_protein in input_table.protein.unique():
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
rows_list.append(dict1)
if prot_drug_bool.ix[unique_protein]:
for unique_drug in input_table.drug.unique():
if not null_check(unique_drug):
for unique_drug_conc in input_table.drug_concentration.unique():
if not null_check(unique_drug_conc):
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
dict1['drug']=unique_drug
dict1['drug_concentration']=unique_drug_conc
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
return df
df2 = combinator(realinput)
df2.to_csv('realoutput.csv')
I'd still like to make it more versatile by getting away from hard-coding any dictionary keys and letting the user-defined input.csv column headers dictate the output. Also, I'd like to move away from the defined three-column setup to handle any number of columns.

Categories