for loop list to dataframe - python

I have the following for loop for a dataframe
# this is my data
df=yf.download('AAPL', period='max', interval='1d' )
vwap15 = []
for i in range(0,len(df)-1):
if(i>=15):
vwap15.append(sum(df["Close"][i-15:i]*df["Volume"][i-15:i])/sum(df["Volume"][i-15:i]))
else:
vwap15.append(None)
When I created the above for loop it generated a list.
I actually want to have it as a dataframe that I can join to my original dataframe df
any insights would be appreciated
thanks

Maybe you mean something like (right after the loop):
df["vwap15"] = vwap15
Note that you will need to fix your for loop like so (otherwise lengths will not match):
for i in range(len(df)):
Maybe you want to have a look at currently available packages for Technical Analysis indicators in Python with Pandas.
Also, try to use NaN instead of None and consider using the Pandas .rolling method when computing indicators over a time window.

Related

Best way to loop through a filtered pandas Dataframe

I need to loop through a pandas DataFrame, but first I have to filter it. I need to look at how many "old_id"s are attached to each new ID.
I wrote this code and is working fine, but it doesn't scale really well.
d = dict()
for new_id in (new_id_list):
d[new_id] = df[df['new_id_col'] == new_id]['old_id'].nunique()
How can I make this more efficient?
Looks like you're looking for groupby + nunique. This fetches the number of unique "old_id"s per "new_id_col":
out = df.groupby('new_id_col')['old_id'].nunique().to_dict()

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

Creating multiples pandas dataframes as outputs of a function iteration over a list

I'm trying to use the function itis.hierarchy_full of the pytaxize package in order to retrieve information about a biological species from a specific Id.
The function takes only one values/Id and save all the taxonomic information inside a pandas dataframe that I can edit later.
import pandas as pd
from pytaxize import itis
test1 = itis.hierarchy_full(180530, as_dataframe = True)
I have something like 800 species Ids, and I want to automate the process to obtain 800 different dataframes.
I have somehow created a test with a small list (be aware, I am a biologist so the code is really basic and maybe inefficient:
species = [180530, 48739, 567823]
tx = {}
for e in species2:
tx[e] = pd.DataFrame(itis.hierarchy_full(e, as_dataframe = True))
Now if I input tx (I'm using a Jupyter Notebook) I obtain a dictionary of pandas dataframes (I think it is a nested dictionary). And if I input tx[180530] I obtain exactly a single dataframe equal to the ones that I can create with the original function.
from pandas.testing import assert_frame_equal
assert_frame_equal(test_180530, sp_180530)
Now I can write something to save each result stored in dictionary as a separate dataframe:
sp_180530 = tx[180530]
sp_48739 = tx[48739]
sp_567823 = tx[567823]
There is a way to automate the process and save each dataframe to a sp_id? Or even better, there is a way to include in the original function where I create tx, to output directly multiple dataframes?
Not exactly what you asked, but to be able to elaborate a bit more on working with the dataframes in the dictionary... To work with the dictionary, loop over the dict and then use every contained dataframe one by one...
for key in tx.keys():
df_temp = tx[key]
# < do all your stuff to df_temp .....>
# Save the dataframe as you want/need (I assume as csv for here)
df_temp.to_csv(f'sp_{key}.csv')

Unable to loop through Dataframe rows: Length of values does not match length of index

I'm not entirely sure why I am getting this error as I have a very simple dataframe that I am currently working with. Here is a sample of the dataframe (the date column is the index):
date
News
2021-02-01
This is a news headline. This is a news summary.
2021-02-02
This is another headline. This is another summary
So basically, all I am trying to do is loop through the dataframe one row at a time and pull the News item, use the Sentiment Intensity Analyzer on it and store the compound value into a separate list (which I am appending to an empty list). However, when I run the loop, it gives me this error:
Length of values (5085) does not match the length of index (2675)
Here is a sample of the code that I have so far:
sia = SentimentIntensityAnalyzer()
news_sentiment_list = []
for i in range (0, (df_news.shape[0]-1)):
n = df_news.iloc[i][0]
news_sentiment_list.append(sia.polarity_scores(n)['compound'])
df['News Sentiment'] = news_sentiment_list
I've tried the loop statement a number of different ways using the FOR loop, and I always return that error. I am honestly lost at this point =(
edit: The shape of the dataframe is: (5087, 1)
The target dataframe is df whereas you loop on df_news, the indexes are probably not the same. You might need to merge the dataframes before doing so.
Moreover, there is an easier approach to your problem that would avoid having to loop on it. Assuming your dataframe df_news holds the column News (as shown on your table), you can add a column to this dataframe simply by doing:
sia = SentimentIntensityAnalyzer()
df_news['News Sentiment'] = df_news['News'].apply(lambda x: sia.polarity_scores(x)['compound'])
A general rule when using pandas is to avoid as much as possible using for-loops, except when you have a very specific edge case panda's built-in methods will be sufficient.

Creating dataframe by merging a number of unknown length dataframes

I am trying to do some analysis on baseball pitch F/x data. All the pitch data is stored in a pandas dataframe with columns like 'Pitch speed' and 'X location.' I have a wrapper function (using pandas.query) that, for a given pitch, will find other pitches with similar speed and location. This function returns a pandas dataframe of unknown size. I would like to use this function over large numbers of pitches; for example, to find all pitches similar to those thrown in a single game. I have a function that does this correctly, but it is quite slow (probably because it is constantly resizing resampled_pitches):
def get_pitches_from_templates(template_pitches, all_pitches):
resampled_pitches = pd.DataFrame(columns = all_pitches.columns.values.tolist())
for i, row in template_pitches.iterrows():
resampled_pitches = resampled_pitches.append( get_pitches_from_template( row, all_pitches))
return resampled_pitches
I have tried to rewrite the function using pandas.apply on each row, or by creating a list of dataframes and then merging, but can't quite get the syntax right.
What would be the fastest way to this type of sampling and merging?
it sounds like you should use pd.concat for this.
res = []
for i, row in template_pitches.iterrows():
res.append(resampled_pitches.append(get_pitches_from_template(row, all_pitches)))
return pd.concat(res)
I think that a merge might be even faster. Usage of df.iterrows() isn't recommended as it generates a series for every row.

Categories