I have two columns in a pandas dataframe ('df'): one with an ID I want to group the data by, and another with a timestamp. I want to create a new column that calculates the difference between the time in the current row and the time in the next row that contains the same "ID" value. The below code achieves this, but is horribly inefficient.
dfs = []
def get_delta(id):
id_subset = [df['ID']==id]
id_subset['delta'] = (id_subset['time']-id_subset['time'].shift()).fillna('NaT')
dfs.append(id_subset)
[get_delta(i) for i in df['ID'].unique()]
final_df = pd.concat(dfs)
What is a less computationally expensive way to do this?
Related
I have a DataFrame that is the result of a large SQL query. I am trying to sort the DataFrame into 2 separate DataFrames. NVI and Main. They are both a list of repairs to trucks. I need to sort it based on if there is a specific profile id which is 7055. Which will go into the NVI DataFrame
If that job is encountered I need to grab the values from the "RO" "Unit Number" and Repair Date column. I then need to take those values and search the DataFrame again and grab any rows that have a matching RO and Unit number or a matching Unit number and a Repair date that is equal to or earlier than the date value in the the row that the 7055 was found. Those rows then need to go into the NVI df. Any remaining rows that do not match will go into the Main df.
The only static value is the profile id of 7055. The RO Unit Number and Repair date will all be different.
class nvi_dict(dict):
def __setitem__(self, key, value):
key = key.profile()
super().__setitem__(key, value)
nvisort = pd.DataFrame()
def sort_nvi_dict(row, component):
if row ['PROFILE_ID'] in cfg[component]['nvi']:
nvi_ro = nvi_dict()
nvi_ro ['RO'] = row ['RO']
nvi_ro ['UnitNum'] = row ['VFUNIT']
nvi_ro ['date']= row['REPAIR_DATE']
nvisort = nvidf.apply(lambda x: sort_nvi_dict(x, 'nvi_ro'), axis=1, result_type='expand')
I thought about trying to use a class to create a temp dict object to store the values from RO, UnitNum and Date. Which I can then call on to iterate over the df again looking for matching values.
I am using a .yml file to store dictionaries. That I am using to further sort each of the NVI and Main df's after they have been sorted out. Because they will then need to each be sorted by truck manufacturer
I think this might work, unable to test without the test data though...
df1 = nvisort[nvisort['profile_id'] = 7055]
df2 = pd.merge(nvisort,df1[['RO','Unit Number']],on=['RO','Unit number'],how='right')
df3 = pd.merge(nvisort,df1[['Unit Number','Repair Date']],on='Unit Number'],how='right')
df3 = df3[df3['Repair Date_x'] <= df3['Repair Date_y']]
df3 = df3.drop(columns='Repair Date_y']
df3 = df3.rename(columns={'Repair Date_x':'Repair Date'})
NVI = pd.concat([df1,df2,df3])
Main = pd.concat([NVI,nvisort]).drop_duplicates(keep=False)
I'm assuming that your original/starting dataframe here is the nvisort, and then we filter that just to get profile_id of 7055 and call that df1
Then we are going to get your two different pieces of criteria into df2 and df3.
df2 is just a filter on the original dataframe where RO and Unit Number match, so we can use pd.merge() to effectively get that filter.
df3 is a more complicated filter since it is the less than or equal, not the equal. So first we do the merge to filter on matching unit numbers, but we also bring over the Repair Date from both tables into df3, and these get appended _x and _y on the column names. So then we filter where the date on the _x is less than on _y and then clean it up.
Last, you get Main by finding everything from the original nvisort that is not in NVI. Since NVI is a subset of nvisort, you can just concat them and drop all duplicates, leaving only data that exists in one of the dataframes.
From what i understand of your question, you want to divide a dataframe into 2 based on certain conditions?
df1 = df[<condition>]
condition can be - df[profile id] == 7055 and Allunits.contains(df[unit])
I have a big df (rates) that contains all information, then I have a second dataframe (aig_df) that contains a couple of rows of the first one.
I need to get a 3rd dataframe that is basically the big one (rates) without the rows on the second one (aig_df), but I need to keep the corresponding indices of the rows that results of rates without aig_df.
With the code I have now, I can get the 3rd dataframe with all the information needed but with int index and I need the index corresponding to each row (Index = Stock Ticker).
rates = pd.read_sql("SELECT Ticker, Carrier, Product, Name, CDSC,StrategyTerm,ParRate,Spread,Fee,Cap FROM ProductRates ", conn).set_index('Ticker')
aig_df = rates.query('Product == "X5 Advantage AnnuitySM"')
competitors_df = pd.merge(rates, aig_df[['Carrier', 'Product', 'Name','CDSC','StrategyTerm','ParRate','Spread','Fee','Cap']],indicator=True,
how='outer').query('_merge=="left_only"').drop('_merge',axis=1)
¿Is there any way to do what I need?
Thanks for your attention
In your specific case, you don't need a merge to do what you want:
result = rates[rates["Product"] != "X5 Advantage AnnuitySM"]
I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.
This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows
You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want
I am trying to extract data from Quandl and I want to get the Date and 'Open' value (respectively) for each row. However, I am not sure what I should. Been trying different method that hasn't worked out. Below is an example:
data = quandl.get("EOD/PG", trim_start = "2011-12-12", trim_end =
"2011-12-30", authtoken=quandl.ApiConfig.api_key)
data = data.reset_index()
sta = data[['Date','Open']]
for row in sta:
price = row.iloc[:,1]
date = row.iloc[:, 0]
What you're doing with the code you have provided is iterating through the column names, i.e. you get 'Date' on the first iteration, and 'Open' on the next (and last).
To iterate through a dataframe by row, you can use any one the .iterrows(), .iteritems() or .itertuples() methods.
For example,
for row in data.itertuples():
price = row.Open
date = row.Date
Having said so, iterating through a pandas dataframe is really slow. Chances are, whatever you intend to do could be done faster by making use of pandas' vectorization, i.e. without a loop.
Basically I have two DataFrames and want to re-populate a column of the second by matching three row elements of the second with the first. To give an example, I have columns "Period" and "Hub" in both DataFrames. For each row in the second DataFrame, I want to take the value of Index (which is a date) and "Product"/"Hub" (which are strings) and find the row in the first DataFrame that has these same values (in the corresponding columns) and return the value of "Period" from that row. I can then populate my row in the second DataFrame with this value.
I have a working solution, but it's really slow. Perhaps this is just due to the size of the DataFrames (approx. 100k rows) but it's taking over an hour to process!
Anyway, this is my working solution - any tips on how to speed it up would be really appreciated!
def selectData(hub, product):
qry = "Hub=='"+hub+"' and Product=='"+product+"'"
return data_1.query(qry)
data_2["Period"] = data_2.apply(lambda row: selectData(row["Hub"], row["Product"]).ix[row.index, "Period"], axis=1)
EDIT: I should note that the first DataFrame is guaranteed to have a unique result to my query but contains a larger set of data than that required to populate data_2
EDIT2: I just realised this is not in fact a working solution...
if i understand your problem correctly, you want merge these 2 dataframe on index(date), Product, Hub and obtain Period from data_1
I don't have data but tested it on random ints. It should be very fast with 100k rows in data_1
#data_1 is the larger dictonary
n=100000
data_1 = pd.DataFrame(np.random.randint(1,100,(n,3)),
index=pd.date_range('2012-01-01',periods=n, freq='1Min').date,
columns=['Product', 'Hub', 'Period']).drop_duplicates()
data_1.index.name='Date'
#data_2 is a random subset, w/o column Period
data_2 = data_1.ix[np.random.randint(0,len(data_1),1000), ['Product','Hub']]
To join on index + some columns, you can do this:
data_3 = data_2.reset_index().merge(data_1.reset_index(), on=['Date','Product','Hub'], how='left')