Comparing Pandas Dataframe Rows & Dropping rows with overlapping dates - python

I have a dataframe filled with trades taken from a trading strategy. The logic in the trading strategy needs to be updated to ensure that trade isn't taken if the strategy is already in a trade - but that's a different problem. The trade data for many previous trades is read into a dataframe from a csv file.
Here's my problem for the data I have:
I need to do a row-by-row comparison of the dataframe to determine if Entrydate of rowX is less than ExitDate rowX-1.
A sample of my data:
Row 1:
EntryDate ExitDate
2012-07-25 2012-07-27
Row 2:
EntryDate ExitDate
2012-07-26 2012-07-29
Row 2 needs to be deleted because it is a trade that should not have occurred.
I'm having trouble identifying which rows are duplicates and then dropping them. I tried the approach in answer 3 of this question with some luck but it isn't ideal because I have to manually iterate through the dataframe and read each row's data. My current approach is below and is ugly as can be. I check the dates, and then add them to a new dataframe. Additionally, this approach gives me multiple duplicates in the final dataframe.
for i in range(0,len(df)+1):
if i+1 == len(df): break #to keep from going past last row
ExitDate = df['ExitDate'].irow(i)
EntryNextTrade = df['EntryDate'].irow(i+1)
if EntryNextTrade>ExitDate:
line={'EntryDate':EntryDate,'ExitDate':ExitDate}
df_trades=df_trades.append(line,ignore_index=True)
Any thoughts or ideas on how to more efficiently accomplish this?
You can click here to see a sampling of my data if you want to try to reproduce my actual dataframe.

You should use some kind of boolean mask to do this kind of operation.
One way is to create a dummy column for the next trade:
df['EntryNextTrade'] = df['EntryDate'].shift()
Use this to create the mask:
msk = df['EntryNextTrade'] > df'[ExitDate']
And use loc to look at the subDataFrame where msk is True, and only the specified columns:
df.loc[msk, ['EntryDate', 'ExitDate']]

Related

Python, pandas: summing a column based on multiple other columns and putting it into a new dataframe

I have this data set that I have been able to organise to the best of my abilities. I`m stuck on the next step. Here is a picture of the df:
My goal is to organise it in a way so that I have the columns month, genres, and time_watched_hours.
If I do the following:
df = df.groupby(['month']).sum().reset_index()
It only sums down the 1`s in the genre columns, whereas I need to add each instance of that genre occurring with the time_watched_hours. For example, in the first row, it would add 4.84 hours for genre comedies. In the third row, 0.84 hours for genre_Crime, and so on.
Once that`s organised, I will use the following to get it in the format I need:
df_cleaned = df.melt(id_vars='month',value_name='time_watched_hours',var_name='Genres').rename(columns=str.title)
Any advice on how to tackle this problem would be greatly appreciated! Thanks!
EDIT: Looking at this further, it would also work to replace the "1" in each row with the time_watched_hours value, then I can groupby().sum() down. Note there may be more than one value of "1" per row.
Ended up finding and using mask for each column which worked perfectly. Downside was I had to list it for each column
df['genre_Action & Adventure'].mask(df['genre_Action & Adventure'] == 1, df['time_watched_hours'], inplace=True)

Changing pandas DataFrame values based on values from the same and previous rows

I have the following pandas df:
it is sorted by 'patient_id', 'StartTime', 'hour_counter'.
I'm looking to perform two conditional operations on the df:
Change the value of the Delta_Value column
Delete the entire row
Where the condition depends on the values of ParameterID or patient_id in the current row and the row before.
I managed to do that using classic programming (i.e. a simple loop in Python), but not using Pandas.
Specifically, I want to change the 'Delta_Value' to 0 or delete the entire row, if the ParameterID in the current row is different from the one at the row before.
I've tried to use .groupby().first(), but that won't work in some cases because the same patient_id can have multiple occurrences of the same ParameterID with a different
ParameterID in between those occurrences. For example record 10 in the df.
And I need the records to be sorted by the StartTime & hour_counter.
Any suggestions?

How to insert data into a existing dataframe, replacing values according to a conditional

I'm looking to insert information into a existing dataframe, this dataframe shape is 2001 rows × 13 columns, however, only the first column has information.
I have 12 more columns, but these are not the same dimension as the main dataframe, so I'd like to insert this additional columns into the main one using a conditional.
Example dataframe:
This in an example, I want to insert the var column into the 2001 × 13 dataframe, using the date as a conditional and in case there is no date, it skips the row or simply adds a 0.
I'm really new to python and programming in general.
Without a minimal working example it is hard to provide you with clear recommendations, but I think what you are looking for is the .loc a pd.DataFrame. What I would recommend you doing is the following:
Selection of rows with .loc works better in your case if the dates are first converted to date-time, so a first step is to make this conversion as:
# Pandas is quite smart about guessing date format. If this fails, please check the
# documentation https://docs.python.org/3/library/datetime.html to learn more about
# format strings.
df['date'] = pd.to_datetime(df['date'])
# Make this the index of your data frame.
df.set_index('date', inplace=True)
It is not clear how you intend to use conditionals/what is the content of your other columns. Using .loc this is pretty straightforward
# At Feb 1, 2020, add a value to columns 'var'.
df.loc['2020-02-01', 'var'] = 0.727868
This could also be used for ranges:
# Assuming you have a second `df2` which as a datetime columns 'date' with the
# data you wish to add to `df`. This will only work if all df2['date'] are found
# in df.index. You can workout the logic for your case.
df.loc[df2['date'], 'var2'] = df2['vals']
If the logic is to complex and the dataframe is not too large, iterating with .iterrows could be easier, specially if you are beginning with Python.
for idx, row in df.iterrows():
if idx in list_of_other_dates:
df.loc[i, 'var'] = (some code here)
Please clarify a bit your problem and you will get better answers. Do not forget to check the documentation.

Creating a table with boolean condition between two columns

I'm currently looking at Reddit data set which has comments and subreddit type as two of its columns. My goal is, as there's too many rows, want to restrict the dataset to something smaller.
By looking at df['subreddit'].value_counts > 10000, I am looking for subreddits with more than 10000 comments. How do I create a new dataframe that meets this condition? Would I use loc or set up some kind of if statement?
First you are performing df['subreddit'].value_counts(). This returns a series, what you might want to do, is transform this into a dataframe to later perform some filtering.
What I would do is;
aux_df = df['subreddit'].value_counts().reset_index()
filtered_df = aux_df[aux_df['subreddit'] > 10000].rename(columns={'index':'subreddit','subreddit':'amount'})
Optionally with loc:
filtered_df = aux_df.loc[aux_df['subreddit'].gt(10000)].rename(columns={'index':'subreddit','subreddit':'amount'})
Edit
Based on the comment, I would first create a list of all subreddits with more than 10000 comments, which is provided above, and then simply filter the original dataframe with those values:
df = df[df['subreddit'].isin(list(filtered_df['subreddit']))]

Cryptocurrency correlation in python, working with dictionaries

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I
To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)

Categories