Lookup based on row and column header Pandas - python

How do I use the QuantityFormula column to iterate over the column headers. For example to find
where count (from QuantityFormula) == count (from headers.
Take the value of that row
To produce a new column called Quantity, with that value.
Do the same for all Count, Area, Volume
It needs to work if new rows are added too.
I found this code online, to start with looking to modify it or create a new piece of code to do what I need. How do I loop and compare Column to header (lookup_array == lookup_value) and store row value of that.
Note: the NaN columns (count, area, volume) could have values in them in future tables
def xlookup(lookup_value, lookup_array, return_array, if_not_found:str = ''):
match_value = return_array.loc[lookup_array == lookup_value]
if match_value.empty:
return f'"{lookup_value}" not found!' if if_not_found == '' else if_not_found
else:
return match_value.tolist()[0]
Merged['Quantity'] = Merged['QuantityFormula'].apply(xlookup, args = (Merged['NRM'], left['UoM']))
I have a XLOOKUP functionality but I need something slightly different.

here is one way to do it
I used a made-up Dataframe, if you had shared the dataframe as a code (preferably) or text, I would have used that. Refer to https://stackoverflow.com/help/minimal-reproducible-example
# use apply, to capture a row value for a column in forumla, along x-axis
df['quantity']=df.apply(lambda x: x[x['formula']] , axis=1)
df
count area formula quantity
0 1.0 NaN count 1.0
1 1.0 NaN count 1.0
2 NaN 1.4 area 1.4
3 NaN 0.6 area 0.6

With your current data, you have nan in the columns that aren't the one you want, and only have a real value in the one you do.
So, I say you just add up those three columns, which will effectively be the_number_you_want + 0 + 0. You can use np.nansum() to properly add the nan as zero.
...
import numpy as np
...
df['Quantity'] = np.nansum(df[['Count','Area','Volume']],axis=1)

Related

Pandas - Lookup value for each item in list

I am relatively new to Python and Pandas. I have two dataframes, one contains a column of codes separated by a comma - the number of codes in each list can vary and can contain a string such as 'Not Applicable' or a blank. The other is a lookup table of the codes and a value. I want to lookup the value of each individual code in each list and calculate the maximum value within that list. For example ['H302','H304'] would be [18,11] and the maximum value of those two would be 18. I then want to return the maximum value of each list as a new column to df2. If it contains anything else, return blank.
This process was originally written in VBA, I solved the problem there by splitting each set of codes by delimiter to a new column, then dynamically running index/matches against each code to return the value. Then it would calculate the maximum value and delete out all the generated columns. I thought at the time it was a messy way to do it and I don't want to replicate this in the Python version.
I would post what I've tried by I can't figure out how I'd go about this - any help is appreciated!
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])
df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values
First, df2 seems to be defined wrongly (single quotes between comas are required). Also, don't generate a data frame of it since you need to be flexible to have any number of elements.
Second, you would need to define the codes as the index to look for elements in the data frame. So, you would define the data frame as:
df1 = pd.DataFrame(df1, columns=['Code', 'Value']).set_index('Code')
Third, you need to loop through the second list of lists and index the elements you want before calculating the maximum using .loc. Also, you need to filter out the codes that are not in the first data frame.
result = []
for codes in df2:
c = [_ for _ in codes if _ in df1.index]
result.append(df1.loc[c,'Value'].max())
Try:
df2.join(df2['Code'].str.split(',')\
.explode()\
.map(df1.set_index('Code')['Value']).groupby(level=0).max().rename('Value'))
Output:
Code Value
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN

Skip Empty cell Python Pandas

I am writing a script to count a percentage of cells that has a specific value. However, when it counts the rows it does not count out the cells that are NaN. Basically I do not want the script to count a cell with the value NaN as a row. I have tried everything from != ""
to .isnan
What im trying to do is calculating the percentage of cells that has a specific value which is not possible if the function counts the rows with NaN value.
RELEVANT CODE
df2 = pd.DataFrame(supplier_data_df, columns=['supplier keywords', 'supplier in ocr'])
total_suppliers = df2[(df2["supplier in ocr"] != "") & (df2["supplier keywords"] != "")]
percentilesupplierkeyword = len(supplier_filtered_df)/len(total_suppliers) * 100
print(percentilesupplierkeyword,"% of supplier-keywords have an issue")
Thank you in advance.
I hope you're doing good.
You can either consider dropping the NaN values or excluding them from your dataframe and then perform your following computations.
If you want to drop the NaN values
df2.dropna(inplace=True)
Or you could use the fillna method to fill the nan values with 0.
df2.fillna(0, inplace=True)
If you want to get the index list of the nan values
df2[df2["col1"].isna()].index.tolist()

How to extract a set of rows from one column of a dataframe using a variable column header?

I have a dataframe of multiple columns: the first is 'qty' and contains several (usually 4) replicates for several different quantities. The remaining columns represent the result of a test for the corresponding replicate at the corresponding quantity -- either a numeric value or a string ('TND' for target not detected, 'Ind' for indeterminate, etc.). Each of the columns (other than the first) represent the results for given 'targets', and there can be any number of targets in a given dataset. An example might be
qty target1 target2
1 TND TND
1 724 TND
1 TND TND
1 674 TND
5 1.4E+04 TND
5 9.2E+03 194
5 1.1E+04 TND
5 9.9E+03 TND
The ultimate goal is to get the probability of detecting each target at each concentration/quantity, so I initially calculated this using the function
def hitrate(qty,df):
t_s = df[df.qty == qty].result
t_s = t_s.apply(pd.to_numeric, args=('coerce',)).isna()
return (len(t_s)-t_s.sum())/len(t_s)
but this was when I only needed to evaluate probabilities for a single target. Before calling hitrate, I'd just ask the user what the header for their target column was, assign it to the variable tar, and use df = df.rename(columns={tar:'result'}).
Now that there are multiple targets, I can't use the hitrate function I wrote, as I need to call it in a loop such as
qtys = df['qty'].unique()
probs = np.zeros([len(qtys),len(targets)])
for i, tar in targets:
for idx, val in enumerate(qtys):
probs[idx,i] = hitrate(val,data)
But the hitrate function explicitly pulls the result/target column for a given quantity by using df[df.qty == qty].result. This no longer works, since the target column changes, and trying to use something like df[df.qty == qty].targets[i] or df[df.qty == qty].tar throws an error, presumably because you can't reference a dataframe column with a variable containing the column name (like you can with the column name directly, i.e. df.result).
In the end, I need to end up with two arrays or dataframes such as (with the above example table as reference):
Table for target_1:
qty probability
1 0.5
5 1.0
Table for target_2:
qty probability
1 0.0
5 0.25
I'm sorry if the question is confusing... If so, leave a comment and I'll try to be a bit clearer. It's been a long day. Any help would be appreciated!
The most basic way of accessing a column from a DataFrame is to use square brackets (like a dict):
df['some_column']
Attribute indexing is nice, but it doesn't work in many cases (Column names with spaces, for example).
So, try something like:
target = 'target1'
...
df[df.qty == qty][target]

Combine paired rows after pandas groupby, give NaN value if ID didn't occur twice in df

I have a single dataframe containing an ID column id, and I know that the ID will exist either exactly in one row ('mismatched') or two rows ('matched') in the dataframe.
In order to select the mismatched rows and the pairs of matched rows I can use a groupby on the ID column.
Now for each group, I want to take some columns from the second (pair) row, rename them, and copy them to the first row. I can then discard all the second rows and return a single dataframe containing all the modified first rows (for each and every group).
Where there is no second row (mismatched) - it's fine to put NaN in its place.
To illustrate this see table below id=1 and 3 are a matched pair, but id=2 is mismatched:
entity id partner value
A 1 B 200
B 1 A 300
A 2 B 600
B 3 C 350
C 3 B 200
The resulting transformation should leave me with the following:
entity id partner entity_value partner_value
A 1 B 200 300
A 2 B 600 NaN
B 3 C 350 200
What's baffling me is how to come up with a generic way of getting the matching partner_value from row 2, copied into row 1 after the groupby, in a way that also works when there is no matching id.
Solution (this was tricky):
dfg = df.groupby('id', sort=False)
# Create 'entity','id','partner','entity_value' from the first row...
df2 = dfg['entity','id','partner','value'].first().rename(columns={'value': 'entity_value'})
# Now insert 'partner_value' from those groups that have a second row...
df2['partner_value'] = nan
df2['partner_value'] = dfg['value'].nth(n=1)
entity id partner entity_value partner_value
id
1 A 1 B 200 300.0
2 A 2 B 600 NaN
3 B 3 C 350 200.0
This was tricky to get working. The short answer is that although pd.groupby(...).agg(...) in principle allows you to specify a list of tuples of (column, aggregate_function), and you could then chain those into a rename, that won't work here since we're trying to do two separate aggregate operations both on value column, and rename both their results (you get pandas.core.base.SpecificationError: Function names must be unique, found multiple named value).
Other complications:
We can't directly use groupby.nth(n) which sounds useful at first glance, except it's only on a DataFrame not a Series like df['value'], and also it silently drops groups which don't have an n'th element, not what we want. (But it does keep the index, so we can use it by first initializing the column as all-NaNs, then selectively inserting on that column, as above).
In any case the pd.groupby.agg() syntax won't even let you call nth() by just passing 'nth' as the agg_func name, since nth() is missing its n argument; you'd have to declare a lambda.
I tried defining the following function second_else_nan to use inside an agg() as above, but after much struggling I couldn't get this as this to work for multiple reasons, only one of which is you can't do two aggs on the same column:
Code:
def second_else_nan(v):
if v.size == 2:
return v[1]
else:
return pd.np.nan
(i.e. the equivalent on a list of the dict.get(key, default) builtin)
I would do that. First, get the first value:
df_grouped = df.reset_index().groupby('id').agg("first")
Then retrieve the values that are duplicated and insert them:
df_grouped["partner_value"] = df.groupby("id")["value"].agg("last")
The only thing is that you have a repeated value in case it's not duplicated (instead of a NaN).
What about something like this?
grouped = df.groupby("id")
first_values = grouped.agg("first")
sums = grouped.agg("sum")
first_values["partner_value"] = sums["value"] - first_values["value"]
first_values["partner_value"].replace(0, np.nan, inplace=True)
transformed_df = first_values.copy()
Group the data by id, take the first row, take the sum of the 'value' column for each group, from this subtract 'value' from the first row. Then replace 0's in the resulting column with np.nan (making the assumption here that data from the 'value' column is never 0)

Pandas not saving changes when iterating rows

let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row
Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']
Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.

Categories