How can I correctly write a function? (Python) - python

Here is my definition:
def fill(df_name):
"""
Function to fill rows and dates.
"""
# Fill Down
for row in df_name[0]:
if 'Unnamed' in row:
df_name[0] = df_name[0].replace(row, np.nan)
df_name[0] = df_name[0].ffill(limit=2)
df_name[1] = df_name[1].ffill(limit=2)
# Fill in Dates
for col in df_name.columns:
if col >= 3:
old_dt = datetime(1998, 11, 15)
add_dt = old_dt + relativedelta(months=col - 3)
new_dt = add_dt.strftime('%#m/%d/%Y')
df_name = df_name.rename(columns={col: new_dt})
and then I call:
fill(df_cars)
The first half of the formula works (columns 0 and 1 have filled in correctly). However, as you can see, the columns are labeled 0-288. When I delete this function and simply run the code (changing df_name to df_cars) it runs correctly and the column names are the dates specified in the second half of the function.
What could be causing this to not execute the # Fill in Dates portion when defined in a function? Does it have to do with local variables?
0 1 2 3 4 5 ... 287 288 289 290 291 292
0 France NaN Market 3330 7478 2273 ... NaN NaN NaN NaN NaN NaT
1 France NaN World 362 798 306 ... NaN NaN NaN NaN NaN NaT
2 France NaN % 0.108709 0.106713 0.134624 ... NaN NaN NaN NaN NaN NaT
3 Germany NaN Market 1452 2025 1314 ... NaN NaN NaN NaN NaN NaT
4 Germany NaN World 209 246 182 ... NaN NaN NaN NaN NaN NaT
.. ... ... ... ... ... ... ... ... ... ... ... ... ..
349 Slovakia 0 World 1 1 0 ... NaN NaN NaN NaN NaN NaT
350 Slovakia 0 % 0.5 0.5 0 ... NaN NaN NaN NaN NaN NaT

Related

tabula.read_pdf in python, getting a list variable and can't read it

I am using tabula to extract some data from a pdf, when I read the file, it outputs a list, not a dataframe, and I'm having problems reading the values,
file = "example.pdf"
path = 'data/' + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = False)
cliente_raw = tabula.read_pdf(path, pages=1,output_format="dataframe")
print(cliente_raw)
This is the output
[ Beneficiario: Nury García Unnamed: 1 NIT/Cédula:
0 Dirección: Calle 115 #53-74 Apto 307 NaN Ciudad:
1 Referencia Descripción NaN
2 Spectral + Porcelai Perfect Face Kit, -/- NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
39564525 Teléfono: 601 6299329 Unnamed: 5 Unnamed: 6
0 BOGOTA (C/MARCA) País: COLOMBIA NaN NaN
1 Cantidad IVA Valor Unitario NaN Valor Total
2 1 19% 125,210 NaN 125,210
3 NaN Subtotal NaN 125,210
4 NaN IVA NaN 23,790
5 NaN TOTAL NaN 149,000 ]
The len of this variable is 1, so I dont know how to extract the values, any help?

pandas.read_html tables not found

I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?
You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN

AssertionError when use df.loc in python

I created a script to load data, check NA values, and fill all NA values. Here is my code:
import pandas as pd
def filter_df(merged_df, var_list):
ind = merged_df.Name.isin(var_list)
return merged_df[ind]
def pivot_df(df):
return df.pivot(index='Date', columns='Name', values=['Open', 'High', 'Low', 'Close'])
def validation_df(input, summary = False):
df = input.copy()
# na check
missing = df.isna().sum().sort_values(ascending=False)
percent_missing = ((missing / df.isnull().count()) * 100).sort_values(ascending=False)
missing_df = pd.concat([missing, percent_missing], axis=1, keys=['Total', 'Percent'], sort=False)
# fill na
columns = list(missing_df[missing_df['Total'] >= 1].reset_index()['index'])
for col in columns:
null_index = df.index[df[col].isnull() == True].tolist()
null_index.sort()
for ind in null_index:
if ind > 0:
print(df.loc[ind, col])
print(df.loc[ind - 1, col])
df.loc[ind, col] = df.loc[ind - 1, col]
if ind == 0:
df.loc[ind, col] = 0
# outliers check
count = []
for col in df.columns:
count.append(sum(df[col] > df[col].mean() + 2 * df[col].std()) + sum(df[col] < df[col].mean() - 2 * df[col].std()))
outliers_df = pd.DataFrame({'Columns': df.columns, 'Count': count}).sort_values(by = 'Count')
if summary == True:
print('missing value check:/n')
print(missing_df)
print('/n outliers check:/n')
print(outliers_df)
return df
def join_df(price_df, transaction_df, var_list):
price_df = filter_df(price_df, var_list)
price_df = pivot_df(price_df)
joined_df = transaction_df.merge(price_df, how = 'left', on = 'Date')
#joined_df = validation_df(joined_df)
return joined_df
token_path = 'https://raw.githubusercontent.com/Carloszone/Cryptocurrency_Research_project/main/datasets/1_token_df.csv'
transaction_path = 'https://raw.githubusercontent.com/Carloszone/Cryptocurrency_Research_project/main/datasets/transaction_df.csv'
var_list = ['Bitcoin', 'Ethereum', 'Golem', 'Solana']
token_df = pd.read_csv(token_path)
transaction_df = pd.read_csv(transaction_path)
df = join_df(token_df, transaction_df, var_list)
df = validation_df(df)
But it did not work. I checked my code and found this issue came from the loc(). For example:
df = join_df(token_df, transaction_df, var_list)
print(df[df.columns[15]])
print(df.loc[1,df.columns[15]])
what I got is:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
2250 NaN
2251 NaN
2252 NaN
2253 NaN
2254 NaN
Name: (High, Solana), Length: 2255, dtype: float64
AssertionError Traceback (most recent call last)
<ipython-input-19-75f01cc22c9c> in <module>()
2
3 print(df[df.columns[15]])
----> 4 print(df.loc[1,df.columns[15]])
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
923 with suppress(KeyError, IndexError):
924 return self.obj._get_value(*key, takeable=self._takeable)
--> 925 return self._getitem_tuple(key)
926 else:
927 # we by definition only have the 0th axis
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1107 return self._multi_take(tup)
1108
-> 1109 return self._getitem_tuple_same_dim(tup)
1110
1111 def _get_label(self, label, axis: int):
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _getitem_tuple_same_dim(self, tup)
807 # We should never have retval.ndim < self.ndim, as that should
808 # be handled by the _getitem_lowerdim call above.
--> 809 assert retval.ndim == self.ndim
810
811 return retval
AssertionError:
I don't know why df[column_name] is available, but df.loc[index,columns_name] is wrong.
You can check my code on Colab: https://colab.research.google.com/drive/1Yg280JRwFayW1tdp4OJqTO5-X3dGsItB?usp=sharing
The issue is that you're merging two DataFrames on a column they don't share in common (because you pivoted price_df, Date column became the index). Also the Date columns don't have a uniform format, so you have to make them the same. Replace your join_df function with the one below and it will work as expected.
I added comments on the lines that had to be added.
def join_df(price_df, transaction_df, var_list):
price_df = filter_df(price_df, var_list)
price_df = pivot_df(price_df)
# After pivot the Date column is the index, and price_df has MultiIndex columns
# since we want to merge it with transaction_df, we need to first flatten the columns
price_df.columns = price_df.columns.map('.'.join)
# and reset_index so that we have the index as the Date column
price_df = price_df.reset_index()
# the Dates are formatted differently across the two DataFrames;
# one has the following format: '2016-01-01' and the other '2016/1/1'
# to have a uniform format, we convert the both Date columns to datetime objects
price_df['Date'] = pd.to_datetime(price_df['Date'])
transaction_df['Date'] = pd.to_datetime(transaction_df['Date'])
joined_df = transaction_df.merge(price_df, how = 'left', on = 'Date')
#joined_df = validation_df(joined_df)
return joined_df
Output:
Date total_transaction_count Volume gas_consumption \
0 2016-01-01 2665 NaN NaN
1 2016-01-02 4217 NaN NaN
2 2016-01-03 4396 NaN NaN
3 2016-01-04 4776 NaN NaN
4 2016-01-05 26649 NaN NaN
... ... ... ... ...
2250 2022-02-28 1980533 1.968686e+06 8.626201e+11
2251 2022-03-01 2013145 2.194055e+06 1.112079e+12
2252 2022-03-02 1987934 2.473327e+06 1.167615e+12
2253 2022-03-03 1973190 3.093248e+06 1.260826e+12
2254 2022-03-04 1861286 4.446204e+06 1.045814e+12
old_ave_gas_fee new_avg_gas_fee new_avg_base_fee \
0 0.000000e+00 0.000000e+00 0.000000e+00
1 0.000000e+00 0.000000e+00 0.000000e+00
2 0.000000e+00 0.000000e+00 0.000000e+00
3 0.000000e+00 0.000000e+00 0.000000e+00
4 0.000000e+00 0.000000e+00 0.000000e+00
... ... ... ...
2250 6.356288e-08 6.356288e-08 5.941877e-08
2251 5.368574e-08 5.368574e-08 4.982823e-08
2252 5.567472e-08 5.567472e-08 4.782055e-08
2253 4.763823e-08 4.763823e-08 4.140883e-08
2254 4.566440e-08 4.566440e-08 3.547666e-08
new_avg_priority_fee Open.Bitcoin Open.Ethereum ... High.Golem \
0 0.000000e+00 430.0 NaN ... NaN
1 0.000000e+00 434.0 NaN ... NaN
2 0.000000e+00 433.7 NaN ... NaN
3 0.000000e+00 430.7 NaN ... NaN
4 0.000000e+00 433.3 NaN ... NaN
... ... ... ... ... ...
2250 4.144109e-09 37707.2 2616.34 ... 0.48904
2251 3.857517e-09 43187.2 2922.44 ... 0.48222
2252 7.854179e-09 44420.3 2975.80 ... 0.47550
2253 6.229401e-09 NaN NaN ... NaN
2254 1.018774e-08 NaN NaN ... NaN
High.Solana Low.Bitcoin Low.Ethereum Low.Golem Low.Solana \
0 NaN 425.9 NaN NaN NaN
1 NaN 430.7 NaN NaN NaN
2 NaN 423.1 NaN NaN NaN
3 NaN 428.6 NaN NaN NaN
4 NaN 428.9 NaN NaN NaN
... ... ... ... ... ...
2250 NaN 37458.9 2574.12 0.41179 NaN
2251 NaN 42876.6 2858.54 0.45093 NaN
2252 NaN 43361.3 2914.70 0.43135 NaN
2253 NaN NaN NaN NaN NaN
2254 NaN NaN NaN NaN NaN
Close.Bitcoin Close.Ethereum Close.Golem Close.Solana
0 434.0 NaN NaN NaN
1 433.7 NaN NaN NaN
2 430.7 NaN NaN NaN
3 433.3 NaN NaN NaN
4 431.2 NaN NaN NaN
... ... ... ... ...
2250 43188.2 2922.50 0.47748 NaN
2251 44420.3 2975.81 0.47447 NaN
2252 43853.2 2952.47 0.43964 NaN
2253 NaN NaN NaN NaN
2254 NaN NaN NaN NaN
[2255 rows x 24 columns]

Need to iterate over row to check conditions and retrieve values from different columns if the conditions are met

I have a daily price data for a stock. Pasting last 31 rows of the data as an example dataset as below:
Date RSI Smooth max min
110 2019-02-13 38.506874 224.006543 NaN NaN
111 2019-02-14 39.567068 227.309923 NaN NaN
112 2019-02-15 43.774479 229.830776 NaN NaN
113 2019-02-18 43.651440 231.690179 NaN NaN
114 2019-02-19 43.467237 232.701976 NaN NaN
115 2019-02-20 44.370123 233.526131 NaN NaN
116 2019-02-21 45.605073 233.834988 233.834988 NaN
117 2019-02-22 46.837518 232.335179 NaN NaN
118 2019-02-25 42.087860 229.570711 NaN NaN
119 2019-02-26 39.008014 226.379526 NaN NaN
120 2019-02-27 39.542339 225.607475 NaN 225.607475
121 2019-02-28 39.051104 228.305615 NaN NaN
122 2019-03-01 48.191687 232.544289 NaN NaN
123 2019-03-05 51.909527 237.063534 NaN NaN
124 2019-03-06 52.988668 240.243201 NaN NaN
125 2019-03-07 54.205990 242.265173 NaN NaN
126 2019-03-08 54.967076 243.912033 NaN NaN
127 2019-03-11 58.080738 244.432163 244.432163 NaN
128 2019-03-12 55.587328 243.573710 NaN NaN
129 2019-03-13 51.714123 241.191933 NaN NaN
130 2019-03-14 48.948075 238.470485 NaN NaN
131 2019-03-15 46.615111 236.144640 NaN NaN
132 2019-03-18 48.219815 233.588265 NaN NaN
133 2019-03-19 41.866898 230.271903 NaN 230.271903
134 2019-03-20 34.818844 239.457110 NaN NaN
135 2019-03-22 42.167870 246.824173 NaN NaN
136 2019-03-25 60.228588 255.294124 NaN NaN
137 2019-03-26 66.896640 267.069173 NaN NaN
138 2019-03-27 68.823285 278.222343 NaN NaN
139 2019-03-28 63.654023 289.042091 289.042091 NaN
I am trying to develop a logic of code which as below:
if max > 0, then search for the previous non-zero max value which and assign it to max2. Also, assign the corresponding RSI of previous non-zero max as RSI2.
Desired output:
For line 139 in the data set, max2 will be 244.432163 and RSI2 will be 58.080738
For line 138 in the data set, max2 will be 0 and RSI 2 will be 0 and so on...
I tried different approached but was unsuccessful at getting any outputs so I do not have a sample code to paste.
I also tried using if loops but I am unable to make it work. I am very new at programming.
First you will need to iterate the dataframe.
Then you will need to store the previous values that you will need to save on the next hit. Since you are always going back to the previous max, you can reuse that as you loop through.
Something like this (did not test, just for an idea):
last_max = 0
last_rsi = 0
for index, row in df.iterrows():
if row['max']:
row['max2'] = last_max
row['rsi2'] = last_rsi
last_max = row['max'] # store this max/rsi for next time
last_rsi = row['rsi']
The right answer is to add a line of code as below:
df[['max2', 'RSI2']] = df[['max', 'RSI']].dropna(subset=['max']).shift(1).fillna(0)

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Categories