I have a very large dataframe that is multiindexed as ('ID','Date'); the column 'Value' tracks an individual's progress in time using a boolean.
I know that each individual starts and ends with Value = True. I've been able to locate the date for the first occurrence of False using df.loc[~df['Value'], :], but what I want to be able to do is locate the date when they switched back to True after one or more periods of False. I've tried using variations on .groupby().diff() but this is extremely slow.
Example: I want to extract "7-22-19" for individual A, below:
ID---Date------Value
A----1-30-19---True
A----3-15-19---False
A----4-1-19-----False
A----7-22-19---True
A----11-13-19--True
B----2-1-19-----True
etc.
As an extra caveat, a solution that is both fast (my dataframe has hundreds of thousands of IDs so no loops, and .groupby().diff() seems to be slow) and that works with non-booleans would be ideal (i.e. if we replace True/False with Drug X/Drug Y).
Thank you!
shift is a nice tool to detect transitions in a column. So you could find transitions from False to True for the same ID with:
df.loc[df['Value']&((~df['Value']).shift())&(df['ID']==df['ID'].shift())]
With your data, it gives as expected:
ID Date Value
3 A 7-22-19 True
Related
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I have a pandas DataFrame with data from an icecream freezer. Several columns describe the different temperatures in the system as well as some other things.
One column, named 'Defrost status', tells me when the freezer was defreezing to remove abundant ice with boolean values.
Those 'defrosts' is what I am interested in, so I added another column named "around_defrost". This column currently only has NaN values, but I want to change them to 'True' whenever there is a defrost within 30 minutes away from that specific row in the dataframe.
The data is recorded every minute so 30 minutes would mean 30 rows before a defrost and 30 rows behind it need to be set to 'True'
I have tried to do this with itterrows, ittertuples and by playing with the indexes as seen in the figure below but nu success so far. If anyone has a good idea of how this would could be done, I'd really appreciate it!
enter image description here
You need to use dataframe.rolling:
df = df.sort_values("Time") #sort by Time
df['around_defrost'] = df['Defrost status'].rolling(60, center=True, min_periods = 0).apply(
lambda x: True if True in x else False, raw=True)
EDIT: you may need rolling(61, center=True) since you want to consider the row in question AND 30 before and after.
I have 2 pandas dataframes, df_pe and df_merged. Both the dataframes have several rows, as well as several columns. Now, there are some specific things I would like to accomplish using these dataframes:
In df_merged, there is a column named ST, which contains timestamps of various events in format eg. (2017-08-27 00:00:00). In df_pe, there are 2 columns Ton and Toff which contain the time when an event started and when and event ended. Eg. (Ton value for a random row: 2018-08-17 01:20:00 while Toff value 2018-08-17 02:30:00).
Secondly, there is a column in df_pe, namely EC. I have another dataframe called df_uniqueal, which also has a column called EC. What I would like to do is:
a. For all rows in df_merged, whenever the ST value is within the duration of Ton and Toff in the df_pe, create 2 new columns in df_merged: EC and ED. Append/Put the value of the EC from data frame df_pe into this new column, while, put the value of the dataframe df_uniqueal into the new column ED (which is eventually a mapped version of the EC in df_pe for obtaining ED in df_uniqueal). If none of the conditions matches/there are NaNs (missing values) left after this procedure, put the string "NF" into df_merged's new ED column, while put the integer 0 into the df_merged's new EC column.
I have explored SO and SE, but have not found anything substantial. Any help in this regard is highly appreciated.
This is my attempt at using for loops in Python for iterating over the dataframes for accomplishing the first condition but it runs forever (never ending) and I don't think this is the best possible way to accomplish this.
for i in range(len(df_merged)):
for j in range(len(df_pe)):
if df_pe.TOn[j] < df_merged.ST[i] < df_pe.TOff[j]:
df_merged.EC[i] = df_pe.EC[j]
df_merged.ED[i] = df_uniqueal.ED[df_processed.EC[j]]
else:
df_merged.EC[i] = 0
df_merged.ED[i] = "NF"
EDIT
Please refer image for expected output and baby example of dataframes.
The relevant columns are in bold (note the column numbers may differ, but the column names are same in this sample example).
If I have understood the question correctly hopefully this will get you started.
for i,val in df_merged['ST'].items():
bool_idx = (df_pe['TOn']<val)&(val<df_pe['Toff'])
if df_pe[bool_idx]['EC'].empty:
df_merged.loc[i,'EC']=0
df_merged.loc[i,'ED']="NF"
else:
value_from_df_pe = df_pe[bool_idx]['EC']
df_merged.loc[i,'EC']= value_from_df_pe
value_from_df_uniqueal = df_uniqueal[df_uniqueal['EC']==value_from_df_pe]['EC']
df_merged.loc[i,'ED']= value_from_df_uniqueal
Please note I have note tested this code on any data.
I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]
Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]
Very basic user of Pandas but I am coming against a brick wall here.
So I have one dataframe called dg has a column called 'user_id', and two other columns which aren't needed at the moment. I also have two more dataframes(data_conv and data_retargeting) with includes the same column name and a column called 'timestamp' in it however there is multiple timestamps for each 'user_id'.
What I need to create new columns in dg for the minimum and maximum 'timestamp' found.
I am currently able to do this through some very long-winded method with iterrow rows however for a dataframe of ~16000, it took 45minutes and I would like to cut it down because I have larger dataframes to run this one.
for index,row in dg.iterrows():
user_id=row['pdp_id']
n_audft=data_retargeting[data_retargeting.pdp_id == user_id].index.min()
n_audlt=data_retargeting[data_retargeting.pdp_id == user_id].index.max()
n_convft=data_conv[data_conv.pdp_id == user_id].index.min()
n_convlt=data_conv[data_conv.pdp_id == user_id].index.max()
dg[index,'first_retargeting']=data_retargeting.loc[n_audft, 'raw_time']
dg[index,'last_retargeting']=data_retargeting.loc[n_audlt, 'raw_time']
dg[index,'first_conversion']=data_conv.loc[n_convft, 'raw_time']
dg[index,'last_conversion']=data_conv.loc[n_convlt, 'raw_time']
without going into specific code, is every user_id in dg found in data_conv and data_retargeting? if so, you can merge (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.merge.html) them into a new dataframe first, and then compute the max/min, and extract the desired columns. i suspect that might run a little bit faster.