I am trying to apply changes to a dataframe for values only returned (to the best of my knowledge) by using groupby. So what I want is to find the minimum date values for each company so that I can apply the number 0 to first value in several columns (in this case df2['Research and Development Expense Lag'] and df2['Capital Expenditures Lag']). Here is what I have so far, a groupby that returns those minimum date values for each company:
df2.groupby('Ticker Symbol').apply(lambda d: \
d[d['Data Date'] == d['Data Date'].min()])
You are on the right track. You can get the index values for those rows and then use them with .loc[] to change values in those two columns:
df2.loc[df2.groupby('Ticker Symbol').apply(
lambda d: d[d['Data Date'] == d['Data Date'].min()]
)
.index
.get_level_values(1),
['Research and Development Expense Lag', 'Capital Expenditures Lag']
] = 0
The .get_level_values(1) function serves to extract the second level of the MultiIndex. The first level will contain Ticker Symbol values.
Related
I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:
I am trying to group by items in a list in DataFrame Series. The dataset being used is the Stack Overflow 2020 Survey.
The layout is roughly as follows:
... LanguageWorkedWith ... ConvertedComp ...
Respondent
1 Python;C 50000
2 C++;C 70000
I want to essentially want to use groupby on the unique values in the list of languages worked with, and apply the a mean aggregator function to the ConvertedComp like so...
LanguageWorkedWith
C++ 70000
C 60000
Python 50000
I have actually managed to achieve the desired output but my solution seems somewhat janky and being new to Pandas, I believe that there is probably a better way.
My solution is as follows:
# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')
# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')
# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)
# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)
# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')
# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]
# Group by LLW and apply median
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()
Output:
LWW
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: Salary, dtype: float64
which matches the value calculated when choosing specific language: sos.loc[sos["LanguageWorkedWith"].str.contains('Perl').fillna(False), "ConvertedComp"].median()
Any tips on how to improve/functions that provide this functionality/etc would be appreciated!
In the target column only data frame, decompose the language name and combine it with the salary. The next step is to convert the data from horizontal format to vertical format using melt. Then we group the language names together to get the median. melt docs
lww = sos[["LanguageWorkedWith","ConvertedComp"]]
lwws = pd.concat([lww['ConvertedComp'], lww['LanguageWorkedWith'].str.split(';', expand=True)], axis=1)
lwws.reset_index(drop=True, inplace=True)
df_long = pd.melt(lwws, id_vars='ConvertedComp', value_vars=lwws.columns[1:], var_name='lang', value_name='lang_name')
df_long.groupby('lang_name')['ConvertedComp'].median().sort_values(ascending=False).head()
lang_name
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: ConvertedComp, dtype: float64
I have two data sets
df1 = pd.DataFrame ({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame ({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
I want to insert sales price and regular price in price with conditions:
If df1 skuid and df2 skuid matches and df2 salesprice is not zero, use salesprice as price value. if sku's match and df2 salesprice is zero, use regularprice. if not use zero as price value.
def pric(df1,df2):
if (df1['skuid'] == df2['skuid'] and salesprice !=0):
price = salesprice
elif (df1['skuid'] == df2['skuid'] and regularprice !=0):
price = regularprice
else:
price = 0
I made a function with similar conditions but its not working. the result should look like in df1
skuid price
A 10
B 10
C 0
D 30
Thanks.
So there are a number of issues with the function given above. Here are a few in no particular order:
Indentation in python matters https://docs.python.org/2.0/ref/indentation.html
Vectorized functions versus loops. The function you give looks vaguely like it expects to be applied on a vectorized basis, but python doesn't work like that. You need to loop through the rows you want to look at (https://wiki.python.org/moin/ForLoop). While there is support for column transformations in python (which work without loops), they need to be invoked specifically (here's some documentation for one instance of such functionality https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html).
Relatedly, accessing dataframe elements and indexing Indexing Pandas data frames: integer rows, named columns
Return: if you want your python function to give you a result, you should have it return the value. Not all programming languages require this (julia), but in python you should/must.
Generality. This isn't strictly necessary in a one-off application, but your function is vulnerable to breaking if you change, for example, the column names in the dataframe. It is better practice to allow the user to give the relevant names in the input, for this reason and for simple flexibility.
Here is a version of your function which was more or less minimally change to fix the above specific issues
import pandas as pd
df1 = pd.DataFrame({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
def pric(df1, df2, id_colname,df1_price_colname, df2_salesprice_colname,df2_regularprice_colname):
for i in range(df1.shape[0]):
for j in range(df2.shape[0]):
if (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_salesprice_colname] != 0):
df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_salesprice_colname]
break
elif (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_regularprice_colname] != 0):
df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_regularprice_colname]
break
return df1
for which entering
df1_imputed=pric(df1,df2,'skuid','price','salesprice','regularprice')
print(df1_imputed['price'])
gives
0 10
1 10
2 0
3 30
Name: price, dtype: int64
Notice how the function loops through row indices before checking equality conditions on specific elements given by a row-index / column pair.
A few things to consider:
Why does the code loop through df1 "above" the loop through df2? Relatedly, what purpose does the break condition serve?
Why was the else condition omitted?
What is 'df1.loc[df1.index[i],id_colname]' all about? (hint: check one of the above links)
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I have 2 Dataframes:
df_Billed: pd.Dataframe({'Bill_Number':[220119, 220120, 220219, 220219, 220419, 220519, 220619, 221219],'Date': [1/31/2019, 2/20/2020, 2/28/2019, 6/30/2019,6/30/2019,6/30/2019,6/30/2019,12/31/2019], 'Amount': [3312.5, 832.0,10000.0, -3312.5,8725.0,1862.5,3637.5,1587.5]})
df_Received: pd.Dataframe({'Bill_Number':[220119, 220219, 220419, 220519, 220619],'Date':[4/16/2019,5/21/2019,8/2/2019,8/2/2019,8/2/2019],'Amount':[3312.5,6687.5,8725,1862.5,3637.5]})
I am trying to search for for each "Bill_Number" in df_Billed to see if it is present if df_Received. Ideally, if it is present, I would like to take the difference between the dates between df_Billed and df_Received for that particular bill number (to see how many days it took to get paid). If the billing number is not present in df_Received, I would like to simply return the all rows for that billing number in df_Billed.
EX: Since df_Billed Bill_Number 220119 is in df_Received, it would return 75 (which is the number of days it took for the bill to be paid 4/16/2019 - 1/31/2019).
EX: Since df_Billed Bill_Number 221219 is not in df_Received, it would return 12/31/2019 (which is the date it was billed).
You could probably use merge on Bill_Number initially
df_Billed=df_Billed.merge(df_Received,on='Bill_Number',how='left')
Then use apply and pandas.to_datetime for computing diffrence between dates
df_Billed['result']=df_Billed.apply(lambda x:x.Date_x if pd.isnull(x.Date_y)
else abs(pd.to_datetime(x.Date_x)-pd.to_datetime(x.Date_y)).days,
axis=1)
And finally, I think you want to create a new column for final result..so I'm renaming the merged columns Date_x and Amount_y back to Date and Amount below:
df_Billed.drop(['Date_y','Amount_y'],axis=1,inplace=True)
df_Billed.rename(columns={"Date_x": "Date","Amount_x":"Amount"},inplace=True)
Final Dataframe: