Efficient search through a dataframe - python

I am trying to search through a large data frame for a specific date. The date may have multiple values in the data_value column. After finding the date, I am extracting the maximum value from the set of possible values associated with that data.
Is there a way to make this more efficient? It runs slowly now.
max_temps = []
for date in dates:
value = data_w[data_w['Date']==date]['Data_Value'].max()
max_temps.append(value)

If I understood your problem properly then you need like this,
temp=data_w[data_w['Date'].isin(dates)]
print temp.groupby('Date')['Data_Value'].max()
Explanation:
First apply isin in your large dataframe, then apply groupby and take max out of that

Related

Getting the maximum date in a dataframe Pandas does not return the maximum

I am trying to extract the most recent date from a dataframe pandas.
I have a fairly simple line of code which seems to be the recommended way from many sources online
maxdate = df['DATE'].max()
maxdate
'31/12/2021'
However I know my dataframe contains a record of '11/09/2022'. Is the issue to do with the date format? I don't want to change the date format as it will impact on my other things
Any help appreciated
Pass the column values to pd.to_datetime function, then take the index for maximum value, and get the DATE value at that index, this way, you'll get the maximum DATE without having to modify existing column.
df.loc[pd.to_datetime(df['DATE'], dayfirst=True).idxmax(), 'DATE']

How to Index a dataframe based on an applied function? -Pandas

I have a dataframe that I created from a master table in SQL. That new dataframe is then grouped by type as I want to find the outliers for each group in the master table.
The function finds the outliers, showing where in the GroupDF they outliers occur. How do I see this outliers as a part of the original dataframe? Not just volume but also location, SKU, group etc.
dataframe: HOSIERY_df
Code:
##Sku Group Data Frames
grouped_skus = sku_volume.groupby('SKUGROUP')
HOSIERY_df = grouped_skus.get_group('HOSIERY')
hosiery_outliers = find_outliers_IQR(HOSIERY_df['VOLUME'])
hosiery_outliers
#.iloc[[hosiery_outliers]]
#hosiery_outliers
Picture to show code and output:
I know enough that I need to find the rows based on location of the index. Like Vlookup in Excel but i need to do it with in Python. Not sure how to pull only the 5, 6, 7...3888 and 4482nd place in the HOSIERY_df.
You can provide a list of index numbers as integers to iloc, which it looks like you have tried based on your commented-out code. So, you may want to make sure that find_outliers_IQR is returning a list of int so it will work properly with iloc, or convert it's output.
It looks like it's currently returning a DataFrame. You can get the index of that frame as a list like this:
hosiery_outliers.index.tolist()

How do I separate this dataframe column by month?

A few rows of my dataframe
The third column shows the time of completion of my data. Ideally, I'd want the second row to just show the date, removing the second half of the elements, but I'm not sure how to change the elements. I was able to change the (second) column of strings into a column of floats without the pound symbol in order to find the sum of costs. However, this column has no specific keyword I just select for all of the elements to remove.
Second part of my question is is it is possible to easy create another dataframe that contains 2021-05-xx or 2021-06-xx. I know there's a way to make another dataframe selecting certain rows like the top 15 or bottom 7. But I don't know if there's a way to make a dataframe finding what I mentioned. I'm thinking it follows the Series.str.contains(), but it seems like when I put '2021-05' in the (), it shows a entire dataframe of False's.
Extracting just the date and ignoring the time from the datetime column can be done by changing the formatting of the column.
df['date'] = pd.to_datetime(df['date']).dt.date
To the second part of the question about creating a new dataframe that is filtered down to only contain rows between 2021-05-xx and 2021-06-xx, we can use pandas filtering.
df_filtered = df[(df['date'] >= pd.to_datetime('2021-05-01')) & (df['date'] <= pd.to_datetime('2021-06-30'))]
Here we take advantage of two things: 1) Pandas making it easy to compare the chronology of different dates using numeric operators. 2) Us knowing that any date that contains 2021-05-xx or 2021-06-xx must come on/after the first day of May and on/before the last day of June.
There are also a few GUI's that make it easy to change the formatting of columns and to filter data without actually having to write the code yourself. I'm the creator of one of these tools, Mito. To filter dates in Mito, you can just enter the dates using our calendar input fields and Mito will generate the equivalent pandas code for you!

Cryptocurrency correlation in python, working with dictionaries

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I
To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)

Pandas: get average over certain rows and return as a dataframe

I have a df like this
It contains speed and dir at different date's hour minute. For example, the first row records that at 7:11, 20060101, the dir=87, speed=5.
Now, I think the data might be too precise, and I want to use the average at each hour for later computation. How can I do it?
I can do it by groupy
df['Hr']=df['HrMn'].apply(lambda x: str(x)[:-2])
df.groupby(['date', 'Hr'])['speed'].mean()
which would return what I want
But it is not a dataframe, and how can I use for later computation? Specifically, I want to know
If the groupby approach I'm using is the right approach for this problem? If so, how to use it later as a dataframe? (I also need to get dir, dir_max and other attributes as well)
The result groupby return is not well-orderd (in date and Hr), is there anyway to re-order it?
Update:
If I do df.groupby(['date', 'Hr'])['speed'].mean().unstack(), it would return
The data is certainly correct, but I still want to it follow the initial dataframe form as
Except that HrMn -> Hr
What you are getting is a multi-index dataframe. you can try
df.groupby(['date', 'Hr'])['speed'].mean().reset_index()
If you want mean for rest of the data, try
df.groupby(['date', 'Hr'])['speed', 'dir_max', 'speed_max'].mean().reset_index()
EDIT:
Applying mean on speed column and max on dir_max and speed_max
df.groupby(['date', 'Hr']).agg({'speed' : np.mean,'dir_max' : np.max, 'speed_max': np.max}).reset_index()

Categories