Optimal filling of pandas DataFrame column by matching values in another DataFrame - python

Basically I have two DataFrames and want to re-populate a column of the second by matching three row elements of the second with the first. To give an example, I have columns "Period" and "Hub" in both DataFrames. For each row in the second DataFrame, I want to take the value of Index (which is a date) and "Product"/"Hub" (which are strings) and find the row in the first DataFrame that has these same values (in the corresponding columns) and return the value of "Period" from that row. I can then populate my row in the second DataFrame with this value.
I have a working solution, but it's really slow. Perhaps this is just due to the size of the DataFrames (approx. 100k rows) but it's taking over an hour to process!
Anyway, this is my working solution - any tips on how to speed it up would be really appreciated!
def selectData(hub, product):
qry = "Hub=='"+hub+"' and Product=='"+product+"'"
return data_1.query(qry)
data_2["Period"] = data_2.apply(lambda row: selectData(row["Hub"], row["Product"]).ix[row.index, "Period"], axis=1)
EDIT: I should note that the first DataFrame is guaranteed to have a unique result to my query but contains a larger set of data than that required to populate data_2
EDIT2: I just realised this is not in fact a working solution...

if i understand your problem correctly, you want merge these 2 dataframe on index(date), Product, Hub and obtain Period from data_1
I don't have data but tested it on random ints. It should be very fast with 100k rows in data_1
#data_1 is the larger dictonary
n=100000
data_1 = pd.DataFrame(np.random.randint(1,100,(n,3)),
index=pd.date_range('2012-01-01',periods=n, freq='1Min').date,
columns=['Product', 'Hub', 'Period']).drop_duplicates()
data_1.index.name='Date'
#data_2 is a random subset, w/o column Period
data_2 = data_1.ix[np.random.randint(0,len(data_1),1000), ['Product','Hub']]
To join on index + some columns, you can do this:
data_3 = data_2.reset_index().merge(data_1.reset_index(), on=['Date','Product','Hub'], how='left')

Related

Get timedelta for next row within group of another pandas column

I have two columns in a pandas dataframe ('df'): one with an ID I want to group the data by, and another with a timestamp. I want to create a new column that calculates the difference between the time in the current row and the time in the next row that contains the same "ID" value. The below code achieves this, but is horribly inefficient.
dfs = []
def get_delta(id):
id_subset = [df['ID']==id]
id_subset['delta'] = (id_subset['time']-id_subset['time'].shift()).fillna('NaT')
dfs.append(id_subset)
[get_delta(i) for i in df['ID'].unique()]
final_df = pd.concat(dfs)
What is a less computationally expensive way to do this?

How can i keep original index when doing outer merge and dropping rows?

I have a big df (rates) that contains all information, then I have a second dataframe (aig_df) that contains a couple of rows of the first one.
I need to get a 3rd dataframe that is basically the big one (rates) without the rows on the second one (aig_df), but I need to keep the corresponding indices of the rows that results of rates without aig_df.
With the code I have now, I can get the 3rd dataframe with all the information needed but with int index and I need the index corresponding to each row (Index = Stock Ticker).
rates = pd.read_sql("SELECT Ticker, Carrier, Product, Name, CDSC,StrategyTerm,ParRate,Spread,Fee,Cap FROM ProductRates ", conn).set_index('Ticker')
aig_df = rates.query('Product == "X5 Advantage AnnuitySM"')
competitors_df = pd.merge(rates, aig_df[['Carrier', 'Product', 'Name','CDSC','StrategyTerm','ParRate','Spread','Fee','Cap']],indicator=True,
how='outer').query('_merge=="left_only"').drop('_merge',axis=1)
¿Is there any way to do what I need?
Thanks for your attention
In your specific case, you don't need a merge to do what you want:
result = rates[rates["Product"] != "X5 Advantage AnnuitySM"]

How to modify column names when combining rows of multiple columns into single rows in a dataframe based on a categorical value. + selective sum/mean

I'm using the pandas .groupby function to combine multiple rows into a single row.
Currently I have a dataframe df_clean which has 405 rows, and 85 columns. There are up to 3 rows that correspond to a single Batch.
My current code for combining the multiple rows is:
num_V = 84 #number of rows -1 for the row "Batch" that they are being grouped by
max_row = df_clean.groupby('Batch').Batch.count().max()
df2= (
df_clean.groupby('Batch')
.apply(lambda x: x.values[:,1:].reshape(1,-1)[0])
.apply(pd.Series)
)
This code works creating a dataframe df2 which groups the rows by Batch, however the columns in the resulting dataframe are simply numbered (0,1,2,3,...249,250,251) note that 84*3=252, ((number of columns - Batch column)*3)=252, Batch becomes the index.
I'm cleaning some data for analysis and I want to combine the data of several (generally 1-3) Sub_Batch values on separate rows into a single row based on their Batch. Ideally I would like to be able to determine which columns are grouped into a row and remain separate columns in the row, as well as which columns the average, or total value is reported.
for example desired input/output:
Original dataframe
Output dataframe
note: naming of columns, and that all columns are copied over and that the columns are ordered according to which sub-batch they belong to. ie Weight_2 will always correspond to the second sub_batch that is a part of that Batch, Weight_3 will correspond to the third Sub_batch that is part of the Batch.
Ideal output dataframe
note: naming of columns, and that in this dataframe there is only a single column that records the Color as they are identical for all Sub-Batch values within a Batch. The individual Temperature values are recorded, as well as the average of the Temperature values for a Batch. The individual Weight values are recorded as well as the sum of the weight values in the column 'Total_weight`
I am 100% okay with the Output Dataframe scenario as I will simply add the values that I want afterwards using .mean and .sum for the values that I desire, I am simply asking if it can be done using `.groupby' as it is not something that I have worked with before, and I know that it does have some ability to sum or average results.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Changing pandas DataFrame values based on values from the same and previous rows

I have the following pandas df:
it is sorted by 'patient_id', 'StartTime', 'hour_counter'.
I'm looking to perform two conditional operations on the df:
Change the value of the Delta_Value column
Delete the entire row
Where the condition depends on the values of ParameterID or patient_id in the current row and the row before.
I managed to do that using classic programming (i.e. a simple loop in Python), but not using Pandas.
Specifically, I want to change the 'Delta_Value' to 0 or delete the entire row, if the ParameterID in the current row is different from the one at the row before.
I've tried to use .groupby().first(), but that won't work in some cases because the same patient_id can have multiple occurrences of the same ParameterID with a different
ParameterID in between those occurrences. For example record 10 in the df.
And I need the records to be sorted by the StartTime & hour_counter.
Any suggestions?

Categories