Filling columns based on other dataframe columns - python

I have two data sets
df1 = pd.DataFrame ({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame ({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
I want to insert sales price and regular price in price with conditions:
If df1 skuid and df2 skuid matches and df2 salesprice is not zero, use salesprice as price value. if sku's match and df2 salesprice is zero, use regularprice. if not use zero as price value.
def pric(df1,df2):
if (df1['skuid'] == df2['skuid'] and salesprice !=0):
price = salesprice
elif (df1['skuid'] == df2['skuid'] and regularprice !=0):
price = regularprice
else:
price = 0
I made a function with similar conditions but its not working. the result should look like in df1
skuid price
A 10
B 10
C 0
D 30
Thanks.

So there are a number of issues with the function given above. Here are a few in no particular order:
Indentation in python matters https://docs.python.org/2.0/ref/indentation.html
Vectorized functions versus loops. The function you give looks vaguely like it expects to be applied on a vectorized basis, but python doesn't work like that. You need to loop through the rows you want to look at (https://wiki.python.org/moin/ForLoop). While there is support for column transformations in python (which work without loops), they need to be invoked specifically (here's some documentation for one instance of such functionality https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html).
Relatedly, accessing dataframe elements and indexing Indexing Pandas data frames: integer rows, named columns
Return: if you want your python function to give you a result, you should have it return the value. Not all programming languages require this (julia), but in python you should/must.
Generality. This isn't strictly necessary in a one-off application, but your function is vulnerable to breaking if you change, for example, the column names in the dataframe. It is better practice to allow the user to give the relevant names in the input, for this reason and for simple flexibility.
Here is a version of your function which was more or less minimally change to fix the above specific issues
import pandas as pd
df1 = pd.DataFrame({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
def pric(df1, df2, id_colname,df1_price_colname, df2_salesprice_colname,df2_regularprice_colname):
for i in range(df1.shape[0]):
for j in range(df2.shape[0]):
if (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_salesprice_colname] != 0):
df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_salesprice_colname]
break
elif (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_regularprice_colname] != 0):
df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_regularprice_colname]
break
return df1
for which entering
df1_imputed=pric(df1,df2,'skuid','price','salesprice','regularprice')
print(df1_imputed['price'])
gives
0 10
1 10
2 0
3 30
Name: price, dtype: int64
Notice how the function loops through row indices before checking equality conditions on specific elements given by a row-index / column pair.
A few things to consider:
Why does the code loop through df1 "above" the loop through df2? Relatedly, what purpose does the break condition serve?
Why was the else condition omitted?
What is 'df1.loc[df1.index[i],id_colname]' all about? (hint: check one of the above links)

Related

Define new column based on matching values between multiple columns in two dataframes

I'm currently trying to define a class label for a dataset I'm building. I have two different datasets that I need to consult, with df_port_call being the one that will ultimately contain the class label.
The conditions in the if statements need to be satisfied for the row to receive a class label of 1. Basically, if a row exists in df_deficiency that matches the if statement conditions listed below, the Class column in df_port_call should get a label of 1. But I'm not sure how to vectorize this and the loop is running very slowly (will take about 8 days to terminate). Any assistance here would be great!
df_port_call["Class"] = 0
for index, row in tqdm(df_port_call.iterrows()):
for index_def, row_def in df_deficiency.iterrows():
if row['MMSI'] == row_def['Primary VIN'] or row['IMO'] == row_def['Primary VIN'] or row['SHIP NAME'] == row_def['Vessel Name']:
if row_def['Inspection Date'] >= row['ARRIVAL IN USA (UTC)'] and row_def['Inspection Date'] <= row['DEPARTURE (UTC)']:
row['Class'] = 1
Without input data and expected outcome, it's difficult to answer. However you can use something like this with np.where:
df_port_call['Class'] = \
np.where(df_port_call['MMSI'].eq(df_deficiency['Primary VIN'])
| df_port_call['IMO'].eq(df_deficiency['Primary VIN'])
| df_port_call['SHIP NAME'].eq(df_deficiency['Vessel Name'])
& df_deficiency['Inspection Date'].between(df_port_call['ARRIVAL IN USA (UTC)'],
df_port_call['DEPARTURE (UTC)']),
1, 0)
Adapt to your code but I think this is the right way.

Please explain the following line of code for me. i.e pandas series creation using 2 columns of a dataframe

industry_usa = f500["industry"][f500["country"] == "USA"].value_counts().head(2)
This is a dataframe where some of its columns are industry and country. So why do we need to locate the 2 columns side by side while creating the indsutry_usa series. Please explain.
I will break it down for you:
f500["industry"]: This selects the series (column) with the same name.
f500["country"] == "USA": This returns a boolean index containing True for all the rows which have their country column as USA.
f500["industry"][f500["country"] == "USA"]: As you might have guessed, this now is just like any other indexing we do in pandas. So, it selects all those "industry"s where the country is "USA".
.value_counts() : is just to do a count of the unique values. Like we have in Counter class in python
NOTE: The interesting fact is that you could change the order to - f500[f500["country"] == "USA"]["industry"] and still get the same result!!

How to optimally update cells based on previous cell value / How to elegantly spread values of cell to other cells?

I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

beginner panda change row data based upon code

I'm a beginner in panda and python, trying to learn it.
I would like to iterate over panda rows, to apply simple coded logic.
Instead of fancy mapping functions, I just want simple coded logic.
So then I can easily adapt it later for other coded logic rules as well.
In my dataframe dc,
I like to check if column AgeUnkown == 1 (or >0 )
And if so it should move the value of column Age to AgeUnknown.
And then make Age equal to 0.0
I tried various combinations of my below code but it won't work.
# using a row reference #########
for index, row in dc.iterrows():
r = row['AgeUnknown']
if (r>0):
w = dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Another attempt
for index in dc.index:
r = dc.at[index,'AgeUnknown'].[0] # also tried .sum here
if (r>0):
w= dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Also tried
if(dc[index,'Age']>0 #wasnt allowed either
Why isn't this working as far as I understood a dataframe should be able to be addressed like above.
I realize you requested a solution involving iterating the df, but I thought I'd provide one that I think is more traditional.
A non-iterating solution to your problem is something like this- 1) get all the indexes that meet your criteria 2) set those indexes of the df to what you want.
# indexes where column AgeUnknown is >0
inds = dc[dc['AgeUnknown'] > 0].index.tolist()
# change the indexes of AgeUnknown to to the Age column
dc.loc[inds, 'AgeUnknown'] = dc.loc[inds, 'Age']
# change the Age to 0 at those indexes
dc.loc[inds, 'Age'] = 0

Categories