Automated combinatorial DataFrame generation in Python/pandas

Automated combinatorial DataFrame generation in Python/pandas - python

I'm quite new to pandas and python, and I'm coming from a background in biochemistry and drug discovery. One frequent task that I'd like to automate is the conversion of a list of combination of drug treatments and proteins to a format that contains all such combinations.
For instance, if I have a DataFrame containing a given set of combinations:
https://github.com/colinhiggins/dillydally/blob/master/input.csv, I'd like to turn it into https://github.com/colinhiggins/dillydally/blob/master/output.csv such that each protein (1, 2, and 3) are copied n times to an output DataFrame where the number of rows, n, is the number of drugs and drug concentrations plus one for a no-drug row of each protein.
Ideally, the degree of combination would be dictated by some other table that indicates relationships, for example if proteins 1 and 2 are to be treated with drugs 1, 2, and 3 but that protein 2 isn't treated with any drugs.
I'm thinking some kind of nested for loop is going to be required, but I can't wrap my head around just quite how to start it.

Consider the following solution
from itertools import product
import pandas
protein = ['protein1' , 'protein2' , 'protein3' ]
drug = ['drug1' , 'drug2', 'drug3']
drug_concentration = [100,30,10]
df = pandas.DataFrame.from_records( list( i for i in product(protein, drug, drug_concentration ) ) , columns=['protein' , 'drug' , 'drug_concentration'] )
>>> df
protein drug drug_concentration
0 protein1 drug1 100
1 protein1 drug1 30
2 protein1 drug1 10
3 protein1 drug2 100
4 protein1 drug2 30
5 protein1 drug2 10
6 protein1 drug3 100
7 protein1 drug3 30
8 protein1 drug3 10
9 protein2 drug1 100
10 protein2 drug1 30
11 protein2 drug1 10
12 protein2 drug2 100
13 protein2 drug2 30
14 protein2 drug2 10
15 protein2 drug3 100
16 protein2 drug3 30
17 protein2 drug3 10
18 protein3 drug1 100
19 protein3 drug1 30
20 protein3 drug1 10
21 protein3 drug2 100
22 protein3 drug2 30
23 protein3 drug2 10
24 protein3 drug3 100
25 protein3 drug3 30
26 protein3 drug3 10
This is basically a cartesian product you're after, which is the functionality of the product function in the itertools module. I'm admitedly confused why you want the empty rows that just list out the proteins with nan's in the other columns. Not sure if that was intentional or accidental. If the datatypes were uniform and numeric this is similar functionality to what's known as a meshgrid.

I've worked through part of this with the help of add one row in a pandas.DataFrame using the method recommended by ShikharDua of creating a list of dicts, each dict corresponding to a row in the eventual DataFrame.
The code is:
data = pandas.read_csv('input.csv')
dict1 = {"protein":"","drug":"","drug_concentration":""} #should be able to get this automatically using the dataframe columns, I think
rows_list = []
for unique_protein in data.protein.unique():
dict1 = {"protein":unique_protein,"drug":"","drug_concentration":""}
rows_list.append(dict1)
for unique_drug in data.drug.unique():
for unique_drug_conc in data.drug_concentration.unique():
dict1 = {"protein":unique_protein,"drug":unique_drug,"drug_concentration":unique_drug_conc}
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
df
It isn't as flexible as I was hoping, since the extra row from protein with no drugs is hard-coded into the nested for loops, but at least its a start. I guess I can add some if statements within each for loop.

I've improved upon the earlier version
enclosed it in a function
added a check for proteins that won't be treated with drugs from another input CSV file that contains the same proteins in column A and either true or false in column B labeled "treat with drugs"
Skips null values. I noticed that my example input.csv had equal length columns, and the function started going a little nuts with NaN rows if they had unequal lengths.
Initial dictionary keys are set from the columns from the initial input CSV instead of hard-coding them.
I tested this with some real data (hence the change from input.csv to realinput.csv), and it works quite nicely.
Code for a fully functional python file follows:
import pandas
import os
os.chdir("path_to_directory_containing_realinput_and_boolean_file")
realinput = pandas.read_csv('realinput.csv')
rows_list = []
dict1 = dict.fromkeys(realinput.columns,"")
prot_drug_bool = pandas.read_csv('protein_drug_bool.csv')
prot_drug_bool.index = prot_drug_bool.protein
prot_drug_bool = prot_drug_bool.drop("protein",axis=1)
def null_check(value):
return pandas.isnull(value)
def combinator(input_table):
for unique_protein in input_table.protein.unique():
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
rows_list.append(dict1)
if prot_drug_bool.ix[unique_protein]:
for unique_drug in input_table.drug.unique():
if not null_check(unique_drug):
for unique_drug_conc in input_table.drug_concentration.unique():
if not null_check(unique_drug_conc):
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
dict1['drug']=unique_drug
dict1['drug_concentration']=unique_drug_conc
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
return df
df2 = combinator(realinput)
df2.to_csv('realoutput.csv')
I'd still like to make it more versatile by getting away from hard-coding any dictionary keys and letting the user-defined input.csv column headers dictate the output. Also, I'd like to move away from the defined three-column setup to handle any number of columns.

Related

How can i create Dataframe with help of two list and a variable which is under a loop

I have a variable whose name is Strike, in Strike variable values regularly change because it is under a loop.
This is my Dataframe
code example -
for i in range(len(df.strike)):
Strike = df.strike.iloc[i]
list1 = ['0-50', '50-100', '100-150'.......]
list2 = [2000, 132.4, 1467.40, ..........]
df = [] # Here i have to create dataframe
Strike contains values like - 33000, 33100, 33200, 33300....... so on it contains at least 145 values.
which I want to make rows.
And I have two list which is also changing from time to time because it is also under a loop.
list1 = ['0-50', '50-100', '100-150'.......]
list1 I want to make columns.
and list2 contains numeric values -
list2 = [2000, 132.4, 1467.40, ..........]
I need dataframe in this format.
List 1 should we column name, and list 2 should we values and strike variable should be rows.
but I don't understand how can I create this data frame.

IIUC you could use the DataFrame constructor directly with a reshaped numpy array as input:
# dummy example
list2 = list(range(4*7))
list1 = ['0-50', '50-100', '100-150', '150-200']
# replace by df.strike
strike = [33000, 33100, 33200, 33300, 33400, 33500, 33600]
df = pd.DataFrame(np.array(list2).reshape((-1, len(list1))),
index=strike, columns=list1)
output:
0-50 50-100 100-150 150-200
33000 0 1 2 3
33100 4 5 6 7
33200 8 9 10 11
33300 12 13 14 15
33400 16 17 18 19
33500 20 21 22 23
33600 24 25 26 27

looking for better iteration approach for slicing a dataframe

First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))

For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help

First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

Python processing CSV file really slow

So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of
ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR
I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.
Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.
My code goes as follows:
df = pd.read_csv(filename, delimiter=';')
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array.
for index, row in df.iterrows():
# PROCESSING LOGIC GOES HERE
# IT CONSISTS OF FEW IF STATEMENTS
list_dates[row.DAY][row.MONTH][row.YEAR] += 1
row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
df_fixed = df_fixed.append(row.to_frame().T)
df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8')
I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets.
at row 43000 it needs 29 seconds and so on...
Thanks for all future help :)
EDIT:
I am adding additional information about my CSV and exptected output
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011
I have to replace - in the special ID field to a proper number.
For example for a row with
ID = 2 the SPECIAL_ID will be
26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on...
So exptected output for above rows would be
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011
EDIT:
I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete
list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index()
for index, row in df_dates.iterrows():
df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
for index, row in df_candidates.iterrows():
vrstica = {}
vrstica['ID'] = row['identifikator']
vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
vrstica['day'] = row['day']
vrstica['MONTH'] = row['MONTH']
vrstica['YEAR'] = row['YEAR']
list_popravljeni.append(vrstica)
pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())

I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.
df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])
I added a fake record at the end to confirm it does increment properly:
SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003
If you want to get rid of counts, you just need:
df.drop('counts', inplace=True, axis=1)

Fill DataFrame row values based on another dataframe row's values pandas

DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.

It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.

This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.

pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Automated combinatorial DataFrame generation in Python/pandas - python

Related

How can i create Dataframe with help of two list and a variable which is under a loop

looking for better iteration approach for slicing a dataframe

Average for similar looking data in a column using Pandas

Python processing CSV file really slow

Fill DataFrame row values based on another dataframe row's values pandas

Categories

Resources