In python, I have a Pandas dataframe (df) that can be replicated with the below.
import pandas as pd
data = [['2021-09-12', 'item1', 'IL', 5], ['2021-09-12', 'item2', 'CA', 7], ['2021-08-13', 'item2', 'CA', 8], ['2021-06-12', 'item3', 'NY', 10], ['2021-05-01', 'item1', 'IL', 11]]
df = pd.DataFrame(data, columns = ['date', 'product', 'state', 'sales'])
I also have two strings.
startdate = '2021-08-01'
enddate = '2021-09-12'
I am trying to group by product and state, and add a column df['sum_sales'] that sums up df['sales'] when df['date'] is between startdate and enddate.
I tried to do a df.groupby(['product', state']) but not sure how to add the condition above.
You can use loc and between and groupby.sum().
between will return a Boolean if the condition is satisfied - your conditions are the dates here.
loc will filter down the DataFrame using the Boolean returned
groupby.sum() will give return the sum of sales.
startdate = '2021-08-01'
enddate = '2021-09-12'
>>> df.loc[df.date.between(startdate,enddate)].groupby(['product', 'state'])['sales'].sum()
product state
item1 IL 5
item2 CA 15
Note that your date is of type object from the way you define your inputs.
Related
I wanted to left join df2 on df1 and then keep the row that matches by group and if there is no matching group then I would like to keep the first row of the group in order to achieve df3 (the desired result). I was hoping you guys could help me with finding the optimal solution.
Here is my code to create the two dataframes and the required result.
import pandas as pd
import numpy as np
market = ['SP', 'SP', 'SP']
underlying = ['TSLA', 'GOOG', 'MSFT']
# DF1
df = pd.DataFrame(list(zip(market, underlying)),
columns=['market', 'underlying'])
market2 = ['SP', 'SP', 'SP', 'SP', 'SP']
underlying2 = [None, 'TSLA', 'GBX', 'GBM', 'GBS']
client2 = [17, 12, 100, 21, 10]
# DF2
df2 = pd.DataFrame(list(zip(market2, underlying2, client2)),
columns=['market', 'underlying', 'client'])
market3 = ['SP', 'SP', 'SP']
underlying3 = ['TSLA', 'GOOG', 'MSFT']
client3 = [12, 17, 17]
# Desired
df3 = pd.DataFrame(list(zip(market3, underlying3, client3)),
columns =['market', 'underlying', 'client'])
# This works but feels sub optimal
df3 = pd.merge(df,
df2,
how='left',
on=['market', 'underlying'])
df3 = pd.merge(df3,
df2,
how='left',
on=['market'])
df3 = df3.drop_duplicates(['market', 'underlying_x'])
df3['client'] = df3['client_x'].combine_first(df3['client_y'])
df3 = df3.drop(labels=['underlying_y', 'client_x', 'client_y'], axis=1)
df3 = df3.rename(columns={'underlying_x': 'underlying'})
Hope you guys could help, thankyou so much!
Store the first value (a groupby might not be necessary if every single one in market is 'SP'), merge and fill with the first value:
fill_value = df2.groupby('market').client.first()
# if you are interested in filtering for None:
fill_value = df2.set_index('market').loc[lambda df: df.underlying.isna(), 'client']
(df
.merge(
df2,
on = ['market', 'underlying'],
how = 'left')
.set_index('market')
.fillna({'client':fill_value}, downcast='infer')
)
underlying client
market
SP TSLA 12
SP GOOG 17
SP MSFT 17
I have two dataframes. One has months 1-5 and a value for each month, which are the same for ever ID, the other has an ID and a unique multiplier e.g.:
data = [['m', 10], ['a', 15], ['c', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Unique'])
data2=[[1,0.2],[2,0.3],[3,0.01],[4,0.5],[5,0.04]]
df2 = pd.DataFrame(data2, columns=['Month', 'Value'])
I want to do sum ( value / (1+unique)^(Month/12) ). E.g. for ID m, I want to do (value/(1+10)^(Month/12)), for every row in df2, and sum them. I wrote a for-loop to do this but since my real table has 277,000 entries this takes too long!
df['baseTotal']=0
for i in df.index.unique():
for i in df2.Month.unique():
df['base']= df2['Value']/pow(1+df.loc[i,'Unique'],df2['Month']/12.0)
df['baseTotal']=df['baseTotal']+df['base']
Is there a more efficient way to do this?
df['Unique'].apply(lambda x: (df2['Value']/((1+x) ** (df2['Month']/12))).sum())
0 0.609983
1 0.563753
2 0.571392
Name: Unique, dtype: float64
I am not sure the best way to title this. If I have a dataframe and one of the columns, lets call it 'Tags', may contain a list or may not. If 'Tags' is a list, then I want to replicate that row as many times as there are unique items in the 'Tags' column but then replace the items in that column with the unique item for each row.
Example:
import pandas as pd
# create dummy dataframe
df = {'Date': ['2020-10-28'],
'Item': 'My_fake_item',
'Tags': [['A', 'B']],
'Count': 3}
df = pd.DataFrame(df, columns=['Date', 'Item', 'Tags', 'Count'])
Would result in:
And I need a function that will change the dataframe to this:
Apply the explode method, for example
df_exploded = (
df.set_index(["Date", "Item", "Count"])
.apply(pd.Series.explode)
.reset_index()
)
will result in
df_exploded
>>>
Date Item Count Tags
0 2020-10-28 My_fake_item 3 A
1 2020-10-28 My_fake_item 3 B
and there's no need to check if an element is a list or not on the column
import pandas as pd
# create dummy dataframe
df = {'Date': ['2020-10-28', '2020-11-01'],
'Item': ['My_fake_item', 'My_other_item'],
'Tags': [['A', 'B'], 'C'],
'Count': [3, 5]}
df = pd.DataFrame(df, columns=['Date', 'Item', 'Tags', 'Count'])
will result in
Date Item Count Tags
0 2020-10-28 My_fake_item 3 A
1 2020-10-28 My_fake_item 3 B
2 2020-11-01 My_other_item 5 C
Suppose I have a data frame like this. The index of this dataframe is a MultiIndex alredy, date/id.
Column N tells the price information is N periods before. How could I turn column['N'] into a MultiIndex?
In this example, suppose columns N has two unique value [0, 1], the final result would have 6 columns and it should look like [0/priceClose] [0/priceLocal] [0/priceUSD] [1/priceClose] [1/priceLocal] [1/priceUSD]
I finally found the following method works:
step 1: melt
step 2: pivot
df = pd.melt(df, id_vars=['date', 'id', 'N'],
value_vars=[p for p in df if p.startswith('price')],
value_name='price')
df = pd.pivot_table(df, values='price', index=['date', 'id'],
columns=['variable', 'N'], aggfunc='max')
I have two dataframes of lets say same shape, need to compare each cell of dataframes with one another. If they are mismatched or one value is null then have to write the bigger dataframe to excel with highlighting cells where mismatched or null value was true.
i calculated the two dataframe differences as another dataframe with boolean values.
data1 = [['tom', 10], ['nick', 15], ['juli', 14]]
data2=[['tom', 10], ['sam', 15], ['juli', 14]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['Name', 'Age'])
df2 = pd.DataFrame(data2, columns = ['Name', 'Age'])
df1.replace(r'^\s*$', np.nan, regex=True, inplace=True)
df2= pd.read_excel(excel_file, sheet_name='res', header=None)
df2.replace(r'^\s*$', np.nan, regex=True, inplace=True)
df2.fillna(0, inplace=True)
df1.fillna(0, inplace=True)
difference = df1== df2 #this have boolean values True if value match false if mismatch or null
now i want to write df1 with cells highlighted according to difference. e.g if difference cell1 value is false the i want to higlight df1 cell1 as yellow and then write the whole df1 with highlights to excel.
here is df1 and df2 i want this as final answer. In final answer nick is highlighted(i want to highlight with background color).
i already tried using pandas Styler.applymap and Styler.apply but no success as two dataframe are involved. Maybe i am not able to think this problem straight.
df1:
df2:
you can do something like:
def myfunc(x):
c1=''
c2='background-color: red'
condition=x.eq(df2)
res=pd.DataFrame(np.where(condition,c1,c2),index=x.index,columns=x.columns)
return res
df1.style.apply(myfunc,axis=None)