DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.
It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.
This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.
pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.
Related
I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).
Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df
consider the following dataframe. I have the columns 'product', 'buys', 'buy_again', 'again_index'.
PS: index is buy_again/buys.
product buys buy_again again_index
a 3 2 0.667
b 40 10 0.25
c 2 1 0.5
d 420 70 0.166
e 87 21 0.241
f 28 4 0.142
Now over here, the numbers for buys and buy_again are very skewed and it is unfair to compare product d to product a based on its index. I want to normalize the data using pandas in such a way that the index can be used to directly compare one product to another irrespective of it being new (eg: one with just 3 buys) or old (eg: one with 420 buys) so that the index can be my deciding factor, for a products performance.
I have 3 dataframes:
df1 is
Total Total Total
Tool Technology One Two Three
Alt AI 15 6 9
df2 is
Check Check Check
Tool Technology One Two Three
Alt AI 10 4 6
df3 is
Uncheck Uncheck Uncheck
Tool Technology One Two Three
Alt AI 18 11 7
After merging the final data frame should be like
Total Total Total Check Check Check Uncheck Uncheck Uncheck
Tool Technology One Two Three One Two Three One Two Three
Alt AI 10 4 6 15 6 9 18 11 7
How can I achieve this?
Three approaches to merge work with multi-index rows. It's not clear if your DataFrames have multi-index columns as well. Provide using to_dict() method
import io
df1 = pd.read_csv(io.StringIO(""" Total Total Total
Tool Technology One Two Three
Alt AI 15 6 9 """), sep="\s+", header=1).set_index(["Tool","Technology"])
df1 = df1.rename(columns={c:f"Total {c}" for c in df1.columns})
df2 = pd.read_csv(io.StringIO(""" Check Check Check
Tool Technology One Two Three
Alt AI 10 4 6"""), sep="\s+", header=1).set_index(["Tool","Technology"])
df2 = df2.rename(columns={c:f"Check {c}" for c in df2.columns})
print(pd.concat([df1,df2], axis=1).to_string())
print(df1.join(df2).to_string())
print(df1.merge(df2, on=["Tool","Technology"]).to_string())
output
Total One Total Two Total Three Check One Check Two Check Three
Tool Technology
Alt AI 15 6 9 10 4 6
Total One Total Two Total Three Check One Check Two Check Three
Tool Technology
Alt AI 15 6 9 10 4 6
Total One Total Two Total Three Check One Check Two Check Three
Tool Technology
Alt AI 15 6 9 10 4 6
I currently have a massive dataset with a large amount of rows and I wanted to create a smaller dataframe that only pulls 2 columns from the larger one and how many times each name occurred in that chapter in this instance 'Occurrence'
The below code is what I am using
df1 = (Dec16.groupby(["BNF Chapter", "Name"]).size().reset_index(name="Occurrence"))
df1
It plots this
BNF Chapter Name Occurrence
1 Aluminium hydroxide 2
1 Aluminium hydroxide + Magnesium trisilicate 2
1 Alverine 702
.......
21 Polihexanide 2
21 Potassium hydroxide 32
21 Sesame oil 22
21 Sodium chloride 222
What I would like to get is the top 10 most occurred names for a certain chapter as the dataset is so large.
For example a dataframe that only pulls
The top 10 most common names in chapter 1
How would I go about doing this?
Many thanks!!!
You can use this pandas.DataFrame.count
This Count Values In Pandas Dataframe here can help you out I hope
I'm quite new to pandas and python, and I'm coming from a background in biochemistry and drug discovery. One frequent task that I'd like to automate is the conversion of a list of combination of drug treatments and proteins to a format that contains all such combinations.
For instance, if I have a DataFrame containing a given set of combinations:
https://github.com/colinhiggins/dillydally/blob/master/input.csv, I'd like to turn it into https://github.com/colinhiggins/dillydally/blob/master/output.csv such that each protein (1, 2, and 3) are copied n times to an output DataFrame where the number of rows, n, is the number of drugs and drug concentrations plus one for a no-drug row of each protein.
Ideally, the degree of combination would be dictated by some other table that indicates relationships, for example if proteins 1 and 2 are to be treated with drugs 1, 2, and 3 but that protein 2 isn't treated with any drugs.
I'm thinking some kind of nested for loop is going to be required, but I can't wrap my head around just quite how to start it.
Consider the following solution
from itertools import product
import pandas
protein = ['protein1' , 'protein2' , 'protein3' ]
drug = ['drug1' , 'drug2', 'drug3']
drug_concentration = [100,30,10]
df = pandas.DataFrame.from_records( list( i for i in product(protein, drug, drug_concentration ) ) , columns=['protein' , 'drug' , 'drug_concentration'] )
>>> df
protein drug drug_concentration
0 protein1 drug1 100
1 protein1 drug1 30
2 protein1 drug1 10
3 protein1 drug2 100
4 protein1 drug2 30
5 protein1 drug2 10
6 protein1 drug3 100
7 protein1 drug3 30
8 protein1 drug3 10
9 protein2 drug1 100
10 protein2 drug1 30
11 protein2 drug1 10
12 protein2 drug2 100
13 protein2 drug2 30
14 protein2 drug2 10
15 protein2 drug3 100
16 protein2 drug3 30
17 protein2 drug3 10
18 protein3 drug1 100
19 protein3 drug1 30
20 protein3 drug1 10
21 protein3 drug2 100
22 protein3 drug2 30
23 protein3 drug2 10
24 protein3 drug3 100
25 protein3 drug3 30
26 protein3 drug3 10
This is basically a cartesian product you're after, which is the functionality of the product function in the itertools module. I'm admitedly confused why you want the empty rows that just list out the proteins with nan's in the other columns. Not sure if that was intentional or accidental. If the datatypes were uniform and numeric this is similar functionality to what's known as a meshgrid.
I've worked through part of this with the help of add one row in a pandas.DataFrame using the method recommended by ShikharDua of creating a list of dicts, each dict corresponding to a row in the eventual DataFrame.
The code is:
data = pandas.read_csv('input.csv')
dict1 = {"protein":"","drug":"","drug_concentration":""} #should be able to get this automatically using the dataframe columns, I think
rows_list = []
for unique_protein in data.protein.unique():
dict1 = {"protein":unique_protein,"drug":"","drug_concentration":""}
rows_list.append(dict1)
for unique_drug in data.drug.unique():
for unique_drug_conc in data.drug_concentration.unique():
dict1 = {"protein":unique_protein,"drug":unique_drug,"drug_concentration":unique_drug_conc}
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
df
It isn't as flexible as I was hoping, since the extra row from protein with no drugs is hard-coded into the nested for loops, but at least its a start. I guess I can add some if statements within each for loop.
I've improved upon the earlier version
enclosed it in a function
added a check for proteins that won't be treated with drugs from another input CSV file that contains the same proteins in column A and either true or false in column B labeled "treat with drugs"
Skips null values. I noticed that my example input.csv had equal length columns, and the function started going a little nuts with NaN rows if they had unequal lengths.
Initial dictionary keys are set from the columns from the initial input CSV instead of hard-coding them.
I tested this with some real data (hence the change from input.csv to realinput.csv), and it works quite nicely.
Code for a fully functional python file follows:
import pandas
import os
os.chdir("path_to_directory_containing_realinput_and_boolean_file")
realinput = pandas.read_csv('realinput.csv')
rows_list = []
dict1 = dict.fromkeys(realinput.columns,"")
prot_drug_bool = pandas.read_csv('protein_drug_bool.csv')
prot_drug_bool.index = prot_drug_bool.protein
prot_drug_bool = prot_drug_bool.drop("protein",axis=1)
def null_check(value):
return pandas.isnull(value)
def combinator(input_table):
for unique_protein in input_table.protein.unique():
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
rows_list.append(dict1)
if prot_drug_bool.ix[unique_protein]:
for unique_drug in input_table.drug.unique():
if not null_check(unique_drug):
for unique_drug_conc in input_table.drug_concentration.unique():
if not null_check(unique_drug_conc):
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
dict1['drug']=unique_drug
dict1['drug_concentration']=unique_drug_conc
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
return df
df2 = combinator(realinput)
df2.to_csv('realoutput.csv')
I'd still like to make it more versatile by getting away from hard-coding any dictionary keys and letting the user-defined input.csv column headers dictate the output. Also, I'd like to move away from the defined three-column setup to handle any number of columns.