How to edit a dataframe row by row while itterating? - python
So I am using a script to read a CSV and create a data frame from which it then scrapes price data using tickers from said data frame. The original data frame has the following columns, note NO 'Price'.
df.columns = ['Ticker TV', 'Ticker YF', 'TV Name', 'Sector', 'Industry', 'URLTV']
I've printed the below first couple of outputs from my "updated" data frame
Ticker TV Ticker YF ... URLTV Price
1 100D 100D.L ... URL NaN
2 1GIS 1GIS.L ... URL NaN
3 1MCS 1MCS.L ... URL NaN
... ... ... ... ... ...
2442 ZYT ZYT.L ...URL NaN
100D.L NaN NaN .. NaN 9272.50
1GIS.L NaN NaN ...NaN 8838.50
1MCS.L NaN NaN ...NaN 5364.00
As you can see it's not working as intended I would like to create a new column with the name of Price and attach each price with the correct ticker so 100D.L should be 9272.50 then when the script iterates to the next ticker it adds the next price value to 1GIS and so forth.
tickerList = df['Ticker YF']
for tick in tickerList:
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
df.at[tick,'Price'] = currentPriceData
Assign price using apply method:
df['Price'] = df['Ticker YF'].apply(lambda x: str(priceData(getSummary(x))))
tick is just the value from your 'Ticker YF' column. so you can use enumerate to get also the index. And if you want to access the former price to add them up you can then just use idx-1
tickerList = df['Ticker YF']
for idx, tick in enumerate(tickerList):
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
if idx!=0:
df.at[idx+1,'Price'] = float(currentPriceData)+float(df.at[idx,'Price'])
else:
df.at[idx+1,'Price'] = float(currentPriceData)
A more "elegant" idea could be something like:
df["Single_Price"]=df["Ticker YF"].apply(lambda x: priceData(getSummary(x)))
to get the value of the single prices. And then create the next column with the added prices:
df["Price"]=df["Ticker"].apply(lambda x: df["Single_Price"][df["Ticker"]<x["Ticker"]].sum())
this will add up every Single_Price (df["Single_Price"]) from every row that is before your current row Ticker x (df["Ticker"] < x["Ticker"]) and creates a new column Price in your dataframe.
after that cou can simply delete the single prices if you don't need them with:
del df["Single_Price"]
Related
Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date
I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data. First dataframe df: earnDate anncTod 103 2015-11-18 0900 104 2016-02-24 0900 105 2016-05-18 0900 ... .......... ....... 128 2022-03-01 0900 129 2022-05-18 0900 130 2022-08-17 0900 Second dataframe af: Datetime Open High Low Close Volume 2005-01-03 36.3458 36.6770 35.5522 35.6833 3343500 ........... ......... ......... ......... ........ ........ 2022-04-22 246.5500 247.2000 241.4300 241.9100 1817977 I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df. I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop. Here's the full code: import pandas as pd import requests import datetime as dt ticker = 'TGT' ## pull orats earnings dates and store in pandas dataframe url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}' response = requests.get(url, allow_redirects=True) data = response.json() df = pd.DataFrame(data['data']) ## reduce number of dates to last 28 quarters and remove updatedAt column n = len(df.index)-28 df.drop(index=df.index[:n], inplace=True) df = df.iloc[: , 1:-1] ## import daily OHLC stock data file loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt" af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume']) ## create total return, overnight and intraday columns in df df['Total Move'] = '' ##col #2 df['Overnight'] = '' ##col #3 df['Intraday'] = '' ##col #4 for date in df['earnDate']: if df.iloc[date,1] == '0900': priorday = af.loc[af.index.get_loc(date)-1,0] priorclose = af.loc[priorday,4] open = af.loc[date,1] close = af.loc[date,4] df.iloc[date,2] = close/priorclose df.iloc[date,3] = open/priorclose df.iloc[date,4] = close/open else: print('afternoon') I get an error: if df.iloc[date,1] == '0900': ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types Converting the date columns to integers creates another error. Is there a better way I should go about doing this? Ideal output would look like (made up numbers, abbreviated output): earnDate anncTod Total Move Overnight Move Intraday Move 2015-11-18 0900 9% 7.2% 1.8% But would include all the dates given in the first dataframe. UPDATE I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot. Here is updated and simplified code (all else remains the same): import pandas as pd import requests import datetime as dt ticker = 'TGT' ## pull orats earnings dates and store in pandas dataframe url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}' response = requests.get(url, allow_redirects=True) data = response.json() df = pd.DataFrame(data['data']) ## reduce number of dates to last 28 quarters and remove updatedAt column n = len(df.index)-28 df.drop(index=df.index[:n], inplace=True) df = df.iloc[: , 1:-1] ## set index to earnDate df = df.set_index(pd.DatetimeIndex(df['earnDate'])) ## import daily OHLC stock data file loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt" af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume']) ## create total return, overnight and intraday columns in df df['Total Move'] = '' ##col #2 df['Overnight'] = '' ##col #3 df['Intraday'] = '' ##col #4 for date in df['earnDate']: if df.loc[date, 'anncTod'] == '0900': print(af.loc[date,'Open']) ##this is line generating error else: print('afternoon') I now get KeyError:'2015-11-18'
To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX: import pandas as pd df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'], 'anncTod': ['0900', '1000', '0800'], 'Open': [111, 222, 333]}) df = df.set_index(df["earnDate"]) for date in df['earnDate']: if df.loc[date, 'anncTod'] == '0900': print(df.loc[date, 'Open']) # prints # 111
RSI in spyder using data in excel
So I have an excel file containing data on a specific stock. My excel file contains about 2 months of data, it monitors the Open price, Close price, High Price, Low Price and Volume of trades in 5 minute intervals, so there are about 3000 rows in my file. I want to calculate the RSI (or EMA if it's easier) of a stock daily, I'm making a summary table that collects the daily data so it converts my table of 3000+ rows into a table with only about 60 rows (each row represents one day). Essentially I want some sort of code that sorts the excel data by date then calculates the RSI as a single value for that day. RSI is given by: 100-(100/(1+RS)) where RS = average gain of up periods/average loss of down periods. Note: My excel uses 'Datetime' so each row's 'Datetime' looks something like '2022-03-03 9:30-5:00' and the next row would be '2022-03-03 9:35-5:00', etc. So the code needs to just look at the date and ignore the time I guess. Some code to maybe help understand what I'm looking for: So here I'm calling my excel file, I want the code to take the called excel file, group data by date and then calculate the RSI of each day using the formula I wrote above. dat = pd.read_csv('AMD_5m.csv',index_col='Datetime',parse_dates=['Datetime'], date_parser=lambda x: pd.to_datetime(x, utc=True)) dates = backtest.get_dates(dat.index) #create a summary table cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary summary_table = pd.DataFrame(index = dates, columns=cols) # loop backtest by dates This is the code I used to fill out the other columns in my summary table, I'll put my SMA (simple moving average) function below. for d in dates: this_dat = dat.loc[dat.index.date==d] #find the number of observations in date d summary_table.loc[d]['Num. Obs.'] = this_dat.shape[0] #get trading (i.e. position holding) signals signals = backtest.SMA(this_dat['Close'].values, window=10) #find the number of trades in date d summary_table.loc[d]['Num. Trade'] = np.sum(np.diff(signals)==1) #find PnLs for 100 shares shares = 100 PnL = -shares*np.sum(this_dat['Close'].values[1:]*np.diff(signals)) if np.sum(np.diff(signals))>0: #close position at market close PnL += shares*this_dat['Close'].values[-1] summary_table.loc[d]['PnL'] = PnL #find the win ratio ind_in = np.where(np.diff(signals)==1)[0]+1 ind_out = np.where(np.diff(signals)==-1)[0]+1 num_win = np.sum((this_dat['Close'].values[ind_out]-this_dat['Close'].values[ind_in])>0) if summary_table.loc[d]['Num. Trade']!=0: summary_table.loc[d]['Win. Ratio'] = 1. *num_win/summary_table.loc[d]['Num. Trade'] This is my function for calculating Simple Moving Average. I was told to try and adapt this for RSI or for EMA (Exponential Moving Average). Apparently adapting this for EMA isn't too troublesome but I can't figure it out. def SMA(p,window=10,signal_type='buy only'): #input price "p", look-back window "window", #signal type = buy only (default) --gives long signals, sell only --gives sell signals, both --gives both long and short signals #return a list of signals = 1 for long position and -1 for short position signals = np.zeros(len(p)) if len(p)<window: #no signal if no sufficient data return signals sma = list(np.zeros(window)+np.nan) #the first few prices does not give technical indicator values sma += [np.average(p[k:k+window]) for k in np.arange(len(p)-window)] for i in np.arange(len(p)-1): if np.isnan(sma[i]): continue #skip the open market time window if sma[i]<p[i] and (signal_type=='buy only' or signal_type=='both'): signals[i] = 1 elif sma[i]>p[i] and (signal_type=='sell only' or signal_type=='both'): signals[i] = -1 return signals
I have two solutions to this. One is to loop through each group, then add the relevant data to the summary_table, the other is to calculate the whole series and set the RSI column as this. I first recreated the data: import yfinance import pandas as pd # initially created similar data through yfinance, # then copied this to Excel and changed the Datetime column to match yours. df = yfinance.download("AAPL", period="60d", interval="5m") # copied it and read it as a dataframe df = pd.read_clipboard(sep=r'\s{2,}', engine="python") df.head() # Datetime Open High Low Close Adj Close Volume #0 2022-03-03 09:30-05:00 168.470001 168.910004 167.970001 168.199905 168.199905 5374241 #1 2022-03-03 09:35-05:00 168.199997 168.289993 167.550003 168.129898 168.129898 1936734 #2 2022-03-03 09:40-05:00 168.119995 168.250000 167.740005 167.770004 167.770004 1198687 #3 2022-03-03 09:45-05:00 167.770004 168.339996 167.589996 167.718094 167.718094 2128957 #4 2022-03-03 09:50-05:00 167.729996 167.970001 167.619995 167.710007 167.710007 968410 Then I formatted the data and created the summary_table: df["date"] = pd.to_datetime(df["Datetime"].str[:16], format="%Y-%m-%d %H:%M").dt.date # calculate percentage change from open and close of each row df["gain"] = (df["Close"] / df["Open"]) - 1 # your summary table, slightly changing the index to use the dates above cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary summary_table = pd.DataFrame(index=df["date"].unique(), columns=cols) Option 1: # loop through each group, calculate the average gain and loss, then RSI for grp, data in df.groupby("date"): # average gain for gain greater than 0 average_gain = data[data["gain"] > 0]["gain"].mean() # average loss for gain less than 0 average_loss = data[data["gain"] < 0]["gain"].mean() # add to relevant cell of summary_table summary_table["RSI"].loc[grp] = 100 - (100 / (1 + (average_gain / average_loss))) Option 2: # define a function to apply in the groupby def rsi_calc(series): avg_gain = series[series > 0].mean() avg_loss = series[series < 0].mean() return 100 - (100 / (1 + (avg_gain / avg_loss))) summary_table["RSI"] = df.groupby("date")["gain"].apply(lambda x: rsi_calc(x)) Output (same for each): summary_table.head() # Num. Obs. Num. Trade PnL Win. Ratio RSI #2022-03-03 NaN NaN NaN NaN -981.214015 #2022-03-04 NaN NaN NaN NaN 501.950956 #2022-03-07 NaN NaN NaN NaN -228.379066 #2022-03-08 NaN NaN NaN NaN -2304.451654 #2022-03-09 NaN NaN NaN NaN -689.824739
How to get values from a dict into a new column, based on values in column
I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'. I have a CSV file that looks like this: ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000 A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000 A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000 I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector. Code: for ticker in company_dic: sf1['sector'] = sf1['ticker'].apply(company_dic[ticker]) The code is giving me problems. For example, the first sector is healthcare, I get this error: ValueError: Healthcare is an unknown string function Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?
Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation. .map will return NaN if the ticker is not in the dict. .apply can be used, but .map should be used df['sector'] = df.ticker.apply(lambda x: company_dict.get(x)) .get will return None if the ticker isn't in the dict. import pandas as pd # test dataframe for this example df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']}) # in your case, load the data from the file df = pd.read_csv('file.csv') # display(df) ticker dimension calendardate datekey 0 AAPL ARQ 1999-12-31 2000-03-15 1 AAPL ARQ 2000-03-31 2000-06-12 2 AAPL ARQ 2000-06-30 2000-09-01 # dict of sectors company_dict = {'AAPL': 'tech'} # insert the sector column using map, into a specific column index df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict)) # display(df) ticker sector dimension calendardate datekey 0 AAPL tech ARQ 1999-12-31 2000-03-15 1 AAPL tech ARQ 2000-03-31 2000-06-12 2 AAPL tech ARQ 2000-06-30 2000-09-01 # write the updated data back to the csv file df.to_csv('file.csv', index=Fales)
temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop) sf1['sector'] = temp You can pass na_action='ignore' if you have NAN's in tickers column
Efficient iteration over pandas dataframe rows to calculate values for new data frame
I'm trying to create a dataframe where columns relate to the ID of sold items and the row indices are IDs of the customers who bought those items. The cells should show how much every customer bought of every item. To get this information I read CSV file containing a row for every transaction made by customers. The file is parsed into the frame_ variable. I retrieve the customer and article IDs using the unique() function on the corresponding columns and use them to create a new dataframe with those IDs as column headers and row indices. with open(f"{file_path}") as file: frame_ = pd.read_csv(file, sep="\t", header=None) customer_ids = list(frame_[customer_index].unique()) item_ids = list(frame_[item_index].unique()) frame = pd.DataFrame.from_dict( dict.fromkeys(item_ids, dict.fromkeys(customer_ids, 0))) For the next step I want to iterate over frame_ to check every row for 3 values: customer ID item ID amount of sold items The amount should be added to the current value at frame.at[customer_id, item_id]. for index, row in frame_.iterrows(): customer = row[customer_index] item = row[item_index] amount = abs(float(row[2])) frame.at[customer, item] += amount This part is especially slow due to me using iterrows(). I looked through some questions but because I don't quite know what I'm looking for exactly I couldn't find any solution on how to perform my task more efficiently. Thank you for your time and any suggestions you can offer. Edit: The original file and the frame_ dataframe contain around ~2.5mil rows Edit 2: added excerpt from frame_, "..." contain other information not relevant for this part. Column headers are actually 0-8, "ID", "amount", "itemID" and "customerID" were added for readability: ID ... amount ... ... itemID ... customerID ... 1 ... -5.0 ... ... 1258 ... 805214 ... 2 ... -10.0 ... ... 3658 ... 798125 ... 3 ... -7.5 ... ... 2056 ... 589012 ... Edit 3: Expected output would look something like this: 1258 3658 2056 805214 5.0 0 0 798125 0 10.0 0 589012 0 0 7.5
Start by preparing another column of absolute values of amounts (though I do not fully understand what you need abs and float - aren't your amounts already positive and numeric?): import numpy as np frame_["amount1"] = np.abs(frame_["amount"].astype(float)) Then aggregate by customer and item indexes: frame = frame_.groupby(["customerID", "itemID"])["amount1"].sum() No explicit iterations needed. You can convert the result to a "wide" format if you want: frame.unstack().fillna(0) #itemID 1258 2056 3658 #customerID #589012 0.0 7.5 0.0 #798125 0.0 0.0 10.0 #805214 5.0 0.0 0.0
Extracting data from a column in a data frame in Python
I want to extract out the "A"(s) from this column. After doing that I want to be able to print the other data from other columns associated with "A" in the same row. However, my code printed this instead: outputs: UniqueCarrier NaN CancellationCode NaN Name: CancellationCode, dtype: object None The column CancellationCode looks like this: CancellationCode: NaN A NaN B NaN I want to get it to print in a data frame format with the filtered rows and columns. Here is my code below: cancellation_reason = (flight_data_finalcopy["CancellationCode"] == "A") cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]] print(display(cancellation_reasons_filtered))
try this cancellation_reason=flight_data_finalcopy[flight_data_finalcopy["CancellationCode"] == "A"] cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]] print(display(cancellation_reasons_filtered))