I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]
Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]
Related
I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance
You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.
You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.
I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
Original Data Set
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
Results of two lines of code
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query
I have a very large dataframe that is multiindexed as ('ID','Date'); the column 'Value' tracks an individual's progress in time using a boolean.
I know that each individual starts and ends with Value = True. I've been able to locate the date for the first occurrence of False using df.loc[~df['Value'], :], but what I want to be able to do is locate the date when they switched back to True after one or more periods of False. I've tried using variations on .groupby().diff() but this is extremely slow.
Example: I want to extract "7-22-19" for individual A, below:
ID---Date------Value
A----1-30-19---True
A----3-15-19---False
A----4-1-19-----False
A----7-22-19---True
A----11-13-19--True
B----2-1-19-----True
etc.
As an extra caveat, a solution that is both fast (my dataframe has hundreds of thousands of IDs so no loops, and .groupby().diff() seems to be slow) and that works with non-booleans would be ideal (i.e. if we replace True/False with Drug X/Drug Y).
Thank you!
shift is a nice tool to detect transitions in a column. So you could find transitions from False to True for the same ID with:
df.loc[df['Value']&((~df['Value']).shift())&(df['ID']==df['ID'].shift())]
With your data, it gives as expected:
ID Date Value
3 A 7-22-19 True
I am new to Python and I'm trying to produce a similar result of Excel's IndexMatch function with Python & Pandas, though I'm struggling to get it working.
Basically, I have 2 separate DataFrames:
The first DataFrame ('market') has 7 columns, though I only need 3 of those columns for this exercise ('symbol', 'date', 'close'). This df has 13,948,340 rows.
The second DataFrame ('transactions') has 14 columns, though only I only need 2 of those columns ('i_symbol', 'acceptance_date'). This df has 1,428,026 rows.
My logic is: If i_symbol is equal to symbol and acceptance_date is equal to date: print symbol, date & close. This should be easy.
I have achieved it with iterrows() but because of the size of the dataset, it returns a single result every 3 minutes - which means I would have to run the script for 1,190 hours to get the final result.
Based on what I have read online, itertuples should be a faster approach, but I am currently getting an error:
ValueError: too many values to unpack (expected 2)
This is the code I have written (which currently produces the above ValueError):
for i_symbol, acceptance_date in transactions.itertuples(index=False):
for symbol, date in market.itertuples(index=False):
if i_symbol == symbol and acceptance_date == date:
print(market.symbol + market.date + market.close)
2 questions:
Is itertuples() the best/fastest approach? If so, how can I get the above working?
Does anyone know a better way? Would indexing work? Should I use an external db (e.g. mysql) instead?
Thanks, Matt
Regarding question 1: pandas.itertuples() yields one namedtuple for each row. You can either unpack these like standard tuples or access the tuple elements by name:
for t in transactions.itertuples(index=False):
for m in market.itertuples(index=False):
if t.i_symbol == m.symbol and t.acceptance_date == m.date:
print(m.symbol + m.date + m.close)
(I did not test this with data frames of your size but I'm pretty sure it's still painfully slow)
Regarding question 2: You can simply merge both data frames on symbol and date.
Rename your "transactions" DataFrame so that it also has columns named "symbol" and "date":
transactions = transactions[['i_symbol', 'acceptance_date']]
transactions.columns = ['symbol','date']
Then merge both DataFrames on symbol and date:
result = pd.merge(market, transactions, on=['symbol','date'])
The result DataFrame consists of one row for each symbol/date combination which exists in both DataFrames. The operation only takes a few seconds on my machine with DataFrames of your size.
#Parfait provided the best answer below as a comment. Very clean, worked incredibly fast - thank you.
pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol',
'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol',
'acceptance_date']).
No need for looping.
Very basic user of Pandas but I am coming against a brick wall here.
So I have one dataframe called dg has a column called 'user_id', and two other columns which aren't needed at the moment. I also have two more dataframes(data_conv and data_retargeting) with includes the same column name and a column called 'timestamp' in it however there is multiple timestamps for each 'user_id'.
What I need to create new columns in dg for the minimum and maximum 'timestamp' found.
I am currently able to do this through some very long-winded method with iterrow rows however for a dataframe of ~16000, it took 45minutes and I would like to cut it down because I have larger dataframes to run this one.
for index,row in dg.iterrows():
user_id=row['pdp_id']
n_audft=data_retargeting[data_retargeting.pdp_id == user_id].index.min()
n_audlt=data_retargeting[data_retargeting.pdp_id == user_id].index.max()
n_convft=data_conv[data_conv.pdp_id == user_id].index.min()
n_convlt=data_conv[data_conv.pdp_id == user_id].index.max()
dg[index,'first_retargeting']=data_retargeting.loc[n_audft, 'raw_time']
dg[index,'last_retargeting']=data_retargeting.loc[n_audlt, 'raw_time']
dg[index,'first_conversion']=data_conv.loc[n_convft, 'raw_time']
dg[index,'last_conversion']=data_conv.loc[n_convlt, 'raw_time']
without going into specific code, is every user_id in dg found in data_conv and data_retargeting? if so, you can merge (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.merge.html) them into a new dataframe first, and then compute the max/min, and extract the desired columns. i suspect that might run a little bit faster.