Inserting Values into a Multilevel Index Dataframe - python

I have created a dataframe as shown:
idx = pd.MultiIndex.from_product([['batch1', 'batch2','batch3', 'batch4', 'batch5'], ['quiz1', 'quiz2']])
cols=['noofpresent', 'lesserthan50', 'between50and60', 'between60and70', 'between70and80', 'greaterthan80']
statdf = pd.DataFrame('-', idx, cols)
statdf
statdf.loc['quiz1', 'noofpresent'] = qdf1.b4ispresent.count()
statdf.loc['quiz2', 'noofpresent'] = qdf2.b4ispresent.count()
statdf.noopresent = qdf1.b4ispresent.count()
statdf.noopresent = qdf2.b4ispresent.count()
statdf
Then I made some calculations. I now want to append that specific calculation of the figures '50' and '53' in column 'noofpresent' in 'batch4', 'quiz1' and 'quiz2' respectively. But instead this happened...
How can I insert my data into the right place?

you can index it like this.
statdf.loc['batch4','quiz1']['noofpresent'] = qdf1.b4ispresent.count()
statdf.loc['batch4','quiz2']['noofpresent'] =qdf2.b4ispresent.count()

Related

Inserting rows in df based on groupby using value of previous row

I need to insert rows based on the column week based on the groupby type, in some cases i have missing weeks in the middle of the dataframe at different positions and i want to insert rows to fill in the missing rows as copies of the last existing row, in this case copies of week 7 to fill in the weeks 8 and 9 and copies of week 11 to fill in rows for week 12, 13 and 14 : on this table you can see the jump from week 7 to 10 and from 11 to 15:
the perfect output would be as follow: the final table with incremental values in column week the correct way :
Below is the code i have, it inserts only one row and im confused why:
def middle_values(final : DataFrame) -> DataFrame:
finaltemp= pd.DataFrame()
out= pd.DataFrame()
for i in range(0, len(final)):
for f in range(1, 52 , 1):
if final.iat[i,8]== f and final.iat[i-1,8] != f-1 :
if final.iat[i,8] > final.iat[i-1,8] and final.iat[i,8] != (final.iat[i-1,8] - 1):
line = final.iloc[i-1]
c1 = final[0:i]
c2 = final[i:]
c1.loc[i]=line
concatinated = pd.concat([c1, c2])
concatinated.reset_index(inplace=True)
concatinated.iat[i,11] = concatinated.iat[i-1,11]
concatinated.iat[i,9]= f-1
finaltemp = finaltemp.append(concatinated)
if 'type' in finaltemp.columns:
for name, groups in finaltemp.groupby(["type"]):
weeks = range(groups['week'].min(), groups['week'].max()+1)
out = out.append(pd.merge(finaltemp, pd.Series(weeks, name='week'), how='right').ffill())
out.drop_duplicates(subset=['project', 'week'], keep = 'first', inplace=True)
out.drop_duplicates(inplace = True)
out.sort_values(["Budget: Budget Name", "Budget Week"], ascending = (False, True), inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
out.reset_index(inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
return out
else :
return final
For the first part of your question. Suppose we have a dataframe like the following:
df = DataFrame({"project":[1,1,1,2,2,2], "week":[1,3,4,1,2,4], "value":[12,22,18,17,18,23]})
We can create a new multi index to get the additional rows that we need
new_index = pd.MultiIndex.from_arrays([sorted([i for i in df['project'].unique()]*52),
[i for i in np.arange(1,53,1)]*df['project'].unique().shape[0]], names=['project', 'week'])
We can then apply this index to get the new dataframe that you need with blanks in the new rows
df = df.set_index(['project', 'week']).reindex(new_index).reset_index().sort_values(['project', 'week'])
You would then need to apply a forward fill (using ffill) or a back fill (using bfill) with groupby and transform to get the required values in the rows that you need.

In pandas, how do I insert a new row into a dataframe one column value at a time

I want to insert a new row into my dataframe, one value at a time, so I know exactly which values going into which column, don't judge me.
Here is what I have but printing it show empty dataframe. I am checking if date already exist to insert a new row or get the existing row for that date.
if(trade["date"] in self.df["date"]):
row = self.df[self.df.date == trade["date"]]
else:
row = self.df.append(pd.Series(), ignore_index=True)
row["date"] = trade["date"]
row["direction"] = trade["direction"]
row["type"] = trade["type"]
row["strategy"] = trade["strategy"]
row["strike"] = trade["strike"]
row["shortLeg"] = trade["shortLeg"]
row["longLeg"] = trade["longLeg"]
row["shortLeg_strike"] = trade["shortLeg_strike"]
row["longLeg_strike"] = trade["longLeg_strike"]
row["maxRisk"] = trade["maxRisk"]
row["maxReturn"] = trade["maxReturn"]
row["returnRatio"] = trade["returnRatio"]
row["breakevenPrice"] = trade["breakevenPrice"]
row["profitTargetPrice"] = trade["profitTargetPrice"]
print(self.df)
Oh figure it out, stupid mistake, the row is not linked to the original dataframe.
self.df = row

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

Looping through Pandas data frame for a calculation based off the row values in two different list

I have been able to get the calculation to work but now I am having trouble appending the results back into the data frame e3. You can see from the picture that the values are printing out.
brand_list = list(e3["Brand Name"])
product_segment_list = list(e3['Product Segment'])
# Create a list of tuples: data
data = list(zip(brand_list, product_segment_list))
for i in data:
step1 = e3.loc[(e3['Brand Name']==i[0]) & (e3['Product Segment']==i[1])]
Delta_Price = (step1['Price'].diff(1).div(step1['Price'].shift(1),axis=0).mul(100.0))
print(Delta_Price)
it's easier to use groupby. In each loop 'r' will be just the grouped rows from e3 dataframe from each category and i an index.
new_df = []
for i,r in e3.groupby(['Brand Name','Product Segment']):
price_num = r["Price"].diff(1).values
price_den = r["Price"].shift(1).values
r['Price Delta'] = price_num/price_den
new_df.append(r)
e3_ = pd.concat(new_df, axis = 1)

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories