Accessing nested columns - python

I have the following DataFrame containing stock data from several finance institutions.
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| Bank Ticker | BAC | C |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| Stock Info | Open | High | Low | Close | Volume | Open | High | Low | Close | Volume |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| Date | | | | | | | | | | |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| 2015-12-31 | 52.07 | 52.39 | 51.75 | 51.75 | 11274831.0 | 17.01 | 17.07 | 16.83 | 16.83 | 47106760.0 |
| 2015-12-30 | 52.84 | 52.94 | 52.25 | 52.30 | 8763137.0 | 17.20 | 17.24 | 17.04 | 17.05 | 35035518.0 |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
The values are nested columns within the ticker (not sure if such structure exists in Python or if it's some sort of multiindex), with Bank Ticker and Stock Info being the column names.
I need to access all certain columns, like 'Close' from all institutions. I managed to achieve the result with the following for loop for obtaining the maximum close values.
for x in tickers:
print(x, bank_stocks[x, 'Close'].max())
BAC 54.9
C 60.34
GS 247.92
JPM 70.08
MS 89.3
WFM 73.0
Since I'm not an advanced user, I was wondering if there's a better way to get the result, using pandas itself.

Try:
df.loc[:, (slice(None), 'Close')].droplevel(level=1, axis=1).max()
or
df.loc(axis=1)[:, 'Close'].droplevel(leve=1, axis=1).max()
or
df.loc[:, pd.IndexSlice[:, 'Close']].droplevel(level=1, axis=1).max()

Related

How to apply an aggregate function to all columns of a pivot table in Pandas

A pivot table is counting the monthly occurrences of a phenomenon. Here's the simplified sample data followed by the pivot:
+--------+------------+------------+
| ad_id | entreprise | date |
+--------+------------+------------+
| 172788 | A | 2020-01-28 |
| 172931 | A | 2020-01-26 |
| 172793 | B | 2020-01-26 |
| 172768 | C | 2020-01-19 |
| 173219 | C | 2020-01-14 |
| 173213 | D | 2020-01-13 |
+--------+------------+------------+
My pivot_table code is the following:
my_pivot_table = pd.pivot_table(df[(df['date'] >= some_date) & ['date'] <= some_other_date)],
values=['ad_id'], index=['entreprise'],
columns=['year', 'month'], aggfunc=['count'])
The resulting table looks like this:
+-------------+---------+----------+-----+----------+
| | 2018 | | | |
+-------------+---------+----------+-----+----------+
| entreprise | january | february | ... | december |
| A | 12 | 10 | ... | 8 |
| B | 24 | 12 | ... | 3 |
| ... | ... | ... | ... | ... |
| D | 31 | 18 | ... | 24 |
+-------------+---------+----------+-----+----------+
Now, I would like to add a column that gives me the monthly average, and perform other operations such as comparing last month's count to the monthly average of, say, the last 12 months...
I tried to fiddle with the aggfunc parameter of the pivot_table, as well as trying to add an average column to the original dataframe, but without success.
Thanks in advance!
Because you get Multiindex table after pivot_table you can use:
df1 = df.mean(axis=1, level=0)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['mean']])
Or:
df2 = df.mean(axis=1, level=1)
df2.columns = pd.MultiIndex.from_product([['all_years'], df2.columns])

how to iterate over Pandas data frame and update based on previous rows

I have some code which I got to work but it's rather slow. I need to update a table of trades and quotes. The base table is like this:
+--------+-----------+----------+----------+--------+----------+
| Symbol | Timestamp | BidPrice | AskPrice | Price | Quantity |
+--------+-----------+----------+----------+--------+----------+
| MSFT | 9:00 | | | 46.98 | 140 |
| MSFT | 9:01 | | | 46.99 | 100 |
| MSFT | 9:02 | | | 47 | 400 |
| MSFT | 9:03 | | | 47 | 100 |
| MSFT | 9:04 | 46.87 | 46.99 | | |
| MSFT | 9:05 | | | 46.89 | 100 |
| MSFT | 9:06 | | | 46.95 | 600 |
| MSFT | 9:07 | 46.91 | 46.99 | | |
| MSFT | 9:08 | 46.91 | 46.97 | | |
| MSFT | 9:09 | | | 46.935 | 100 |
| MSFT | 9:10 | 46.89 | 46.96 | | |
| MSFT | 9:11 | | | 46.93 | 100 |
| MSFT | 9:12 | | | 46.91 | 100 |
+--------+-----------+----------+----------+--------+----------+
I need to set the bid and price for each trade (there is a Price but no bid/ask). So starting with bid = 46.8 and ask = 47, set the values, and when those values change, set new values. Like this:
+--------+-----------+----------+----------+--------+----------+
| Symbol | Timestamp | BidPrice | AskPrice | Price | Quantity |
+--------+-----------+----------+----------+--------+----------+
| MSFT | 9:00 | 46.8 | 47 | 46.98 | 140 |
| MSFT | 9:01 | 46.8 | 47 | 46.99 | 100 |
| MSFT | 9:02 | 46.8 | 47 | 47 | 400 |
| MSFT | 9:03 | 46.8 | 47 | 47 | 100 |
| MSFT | 9:04 | 46.87 | 46.99 | | |
| MSFT | 9:05 | 46.87 | 46.99 | 46.89 | 100 |
| MSFT | 9:06 | 46.87 | 46.99 | 46.95 | 600 |
| MSFT | 9:07 | 46.91 | 46.99 | | |
| MSFT | 9:08 | 46.91 | 46.97 | | |
| MSFT | 9:09 | 46.91 | 46.97 | 46.935 | 100 |
| MSFT | 9:10 | 46.89 | 46.96 | | |
| MSFT | 9:11 | 46.89 | 46.96 | 46.93 | 100 |
| MSFT | 9:12 | 46.89 | 46.96 | 46.91 | 100 |
+--------+-----------+----------+----------+--------+----------+
I worked this out iterating over rows, but for 112k rows, it takes 35 seconds.
for i, row in qts_trd.iterrows():
if np.isnan(row['Price']):
bid = row['BidPrice']
ask = row['AskPrice']
if np.isnan(row['BidPrice']):
qts_trd.at[i,'BidPrice'] = bid
qts_trd.at[i,'AskPrice'] = ask
I know the basics of lambda functions, applying the same one to every row. I think it's quicker, but as you see it changes. Is there any more efficient/quicker way to do it?
This is Python 3.7 in Spyder.
Try pandas fillna() function using the method='ffill'
So:
qts_trd.BidPrice.fillna(method='ffill', inplace=True)
qts_trd.AskPrice.fillna(method='ffill', inplace=True)
In my experience it's very quick
Edit:
I just realised this wont fill your first values, the below code will insert a row at the top to fill from, and then delete it.
qts_trd.loc[-1] = ['', '', 46.8, 47, '', '']
qts_trd.index += 1
qts_trd.sort_index(inplace=True)
qts_trd.BidPrice.fillna(method='ffill', inplace=True)
qts_trd.AskPrice.fillna(method='ffill', inplace=True)
qts_trd.drop(0,0,inplace=True)
qts_trd.reset_index(drop=True, inplace=True)
Edit 2.0...thanks to #no_body 's comment:
qts_trd.BidPrice.fillna(method='ffill', inplace=True).fillna(46.8)
qts_trd.AskPrice.fillna(method='ffill', inplace=True).fillna(47)

copy dataframe to postgres table with column that has defalut value

I have the following postgreSql table stock, there the structure is following with column insert_time has a default value now()
| column | pk | type |
+-------------+-----+-----------+
| id | yes | int |
| type | yes | enum |
| c_date | | date |
| qty | | int |
| insert_time | | timestamp |
I was trying to copy the followning df
| id | type | date | qty |
+-----+------+------------+------+
| 001 | CB04 | 2015-01-01 | 700 |
| 155 | AB01 | 2015-01-01 | 500 |
| 300 | AB01 | 2015-01-01 | 1500 |
I was using psycopg to upload the df to the table stock
cur.copy_from(df, stock, null='', sep=',')
conn.commit()
getting this error.
DataError: missing data for column "insert_time"
CONTEXT: COPY stock, line 1: "001,CB04,2015-01-01,700"
I was expecting with the psycopg copy_from function, my postgresql table will auto-populate the rows along side the insert time.
| id | type | date | qty | insert_time |
+-----+------+------------+------+---------------------+
| 001 | CB04 | 2015-01-01 | 700 | 2018-07-25 12:00:00 |
| 155 | AB01 | 2015-01-01 | 500 | 2018-07-25 12:00:00 |
| 300 | AB01 | 2015-01-01 | 1500 | 2018-07-25 12:00:00 |
You can specify columns like this:
cur.copy_from(df, stock, null='', sep=',', columns=('id', 'type', 'c_date', 'qty'))

Select certain row values and make them columns in pandas

I have a dataset that looks like the below:
+-------------------------+-------------+------+--------+-------------+--------+--+
| | impressions | name | shares | video_views | diff | |
+-------------------------+-------------+------+--------+-------------+--------+--+
| _ts | | | | | | |
| 2016-09-12 23:15:04.120 | 1 | Vidz | 7 | 10318 | 15mins | |
| 2016-09-12 23:16:45.869 | 2 | Vidz | 7 | 10318 | 16mins | |
| 2016-09-12 23:30:03.129 | 3 | Vidz | 18 | 29291 | 30mins | |
| 2016-09-12 23:32:08.317 | 4 | Vidz | 18 | 29291 | 32mins | |
+-------------------------+-------------+------+--------+-------------+--------+--+
I am trying to build a dataframe to feed to a regression model, and I'd like to parse out specific rows as features. To do this I would like the dataframe to resemble this
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| | name | 15min_shares | 15min_impressions | 15min_video_views | 30min_shares | 30min_impressions | 30min_video_views |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| _ts | | | | | | | |
| 2016-09-12 23:15:04.120 | Vidz | 7 | 1 | 10318 | 18 | 3 | 29291 |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
What would be the best way to do this? I think this would be easier if I were only trying to select 1 row (15mins), just parse out the unneeded rows and pivot.
However, I need 15min and 30min features and am unsure on how to proceed of the need for these columns
You could take subsets of your DF to include rows for 15mins and 30mins and concatenate them by backfilling NaN values of first row(15mins) with that of it's next row(30mins) and dropping off the next row(30mins) as shown:
prefix_15="15mins"
prefix_30="30mins"
fifteen_mins = (df['diff']==prefix_15)
thirty_mins = (df['diff']==prefix_30)
df = df[fifteen_mins|thirty_mins].drop(['diff'], axis=1)
df_ = pd.concat([df[fifteen_mins].add_prefix(prefix_15+'_'), \
df[thirty_mins].add_prefix(prefix_30+'_')], axis=1) \
.fillna(method='bfill').dropna(how='any')
del(df_['30mins_name'])
df_.rename(columns={'15mins_name':'name'}, inplace=True)
df_
stacking to pivot and collapsing your columns
df1 = df.set_index('diff', append=True).stack().unstack(0).T
df1.columns = df1.columns.map('_'.join)
To see just the first row
df1.iloc[[0]].dropna(1)

Resampling Pandas time series data with keeping only valid numbers for each row

I have a dataframe which has a list of web pages with summed hourly trraffic by unix hour.
Pivoted, it looks like this:
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| unix hour | 394533 | 394534 | 394535 | 394536 | 394537 | 394538 | 394539 | 394540 | 394541 | 394542 | 394543 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| page | | | | | | | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | | | | | | | |
| 3563667 | | | | 3481 | 2840 | 2421 | | | | | |
| 3579922 | | | | | | | 1816 | 1947 | 1878 | 2013 | 1718 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Instead of having the time be actually over time, I would like to centralize so that it looks like this:
+---------+------+------+------+------+------+
| hour | 1 | 2 | 3 | 4 | 5 |
+---------+------+------+------+------+------+
| page | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | |
| 3563667 | 3481 | 2840 | 2421 | | |
| 3579922 | 1816 | 1947 | 1878 | 2013 | 1718 |
+---------+------+------+------+------+------+
Would would be the best way to do this in pandas?
*Note - I realize the hours as columns isn't ideal, but for my full data set i have 7k pages and only over a span of 72 hours, so to me, pages as the index and hours as the columns makes the most sense.
Assuming the data is stored as float:
In [191]:
print df.dtypes
394533 float64
394534 float64
394535 float64
394536 float64
394537 float64
394538 float64
394539 float64
394540 float64
394541 float64
394542 float64
394543 float64
dtype: object
We will just do:
In [192]:
print df.apply(lambda x: pd.Series(data=x[np.isfinite(x)].values), 1)
0 1 2 3 4
page
3530765 5791 6017 5302 NaN NaN
3563667 3481 2840 2421 NaN NaN
3579922 1816 1947 1878 2013 1718
The idea is to get the valid numbers of each rows, put those into Series, but without the original UNIXtime as index. The index, therefore will become 0,1,2...., if you must you can easily make it into 1,2,3... by df2.columns = df2.columns+1, assuming the result is assigned df2.

Categories