Retrieve next row in pandas dataframe / multiple list comprehension outputs - python

I have a Pandas dataframe, wt, with a datetime index and three columns as well as dataframe t with the same datetime index and three other columns below:
wt
date 0 1 2
2004-11-19 0.2 0.3 0.5
2004-11-22 0.0 0.0 0.0
2004-11-23 0.0 0.0 0.0
2004-11-24 0.0 0.0 0.0
2004-11-26 0.0 0.0 0.0
2004-11-29 0.0 0.0 0.0
2004-11-30 0.0 0.0 0.0
t
date GLD SPY TLT
2004-11-19 0.009013068949977443 -0.011116725618999457 -0.007980218051028332
2004-11-22 0.0037963376507370583 0.004769204564810003 0.005211874008610895
2004-11-23 -0.00444938820912133 0.0015256823190370472 0.0012398557258792575
2004-11-24 0.006703910614525022 0.0023696682464455776 0.0
2004-11-26 0.005327413984461682 -0.0007598784194529085 -0.00652932567826181
2004-11-29 0.002428792227864962 -0.004562737642585524 -0.010651558073654366
2004-11-30 -0.006167400881057272 0.0006790595025889523 -0.004237773450922022
2004-12-01 0.005762411347517871 0.011366528119433505 -0.0015527950310557648
I'm currently using the Pandas iterrrows method to run through each row for processing, and as a first step, I check if the row entries are non-zero, as below:
for dt, row in t.iterrows():
if sum(wt.loc[dt]) <= 0:
...
Based on this, I'd like to assign values to dataframe wt if non-zero values don't currently exist. How can I retrieve the next row for a given dt entry (eg, '11/22/2004' for dt = '11/19/2004')?
Part 2
As an addendum, I'm setting this up using a for loop for testing but would like to use list comprehension once complete. Processing will return the wt dataframe described above, as well as an intermediate, secondary dataframe again with datetime index and a single column (sample below):
r
date r
2004-11-19 0.030202
2004-11-22 -0.01047
2004-11-23 0.002456
2004-11-24 -0.01274
2004-11-26 0.00928
Is there a way to use list comprehensions to return both the above wt and this r dataframes without simply creating two separate comprehensions?
Edit
I was able to get desired results by changing my approach, so adding for clarification (referenced dataframes are as described above). Wonder if there's any way to apply list comprehensions for this.
r = pd.DataFrame(columns=['ret'],index=wt.index.copy())
dts = wt.reset_index().date
for i, dt in enumerate(dts):
row = t.loc[dt]
dt_1 = dts.shift(-1).iloc[i]
try:
wt.loc[dt_1] = ((wt.loc[dt].tolist() * (1+row)).transpose() / np.dot(wt.loc[dt].tolist(), (1+row))).tolist()
r.loc[dt] = np.dot(wt.loc[dt], row)
except:
print(f'Error calculating for date {dt}')
continue

Related

Count All Occurrences of a Specific Value in a Dask Dataframe

I have a dask dataframe with thousands of columns and rows as follows:
pprint(daskdf.head())
grid lat lon ... 2014-12-29 2014-12-30 2014-12-31
0 0 48.125 -124.625 ... 0.0 0.0 -17.034216
1 0 48.625 -124.625 ... 0.0 0.0 -19.904214
4 0 42.375 -124.375 ... 0.0 0.0 -8.380443
5 0 42.625 -124.375 ... 0.0 0.0 -8.796803
6 0 42.875 -124.375 ... 0.0 0.0 -7.683688
I want to count all occurrences in the entire dataframe where a certain value appears. In pandas, this can be done as follows:
pddf[pddf==500].count().sum()
I'm aware that you can't translate all pandas functions/syntax with dask, but how would I do this with a dask dataframe? I tried doing:
daskdf[daskdf==500].count().sum().compute()
but this yielded a "Not Implemented" error.
As in many cases, where there is a row-wise pandas method which is not explicitly implemented yet in dask, you can use map_partitions. In this case this might look like:
ppdf.map_partitions(lambda df: df[df==500].count()).sum().compute()
You can experiment with whether also doing a .sum() within the lambda helps (it would produce smaller intermediaries) and what the meta= argument to map_partition should look like.

finding the indexes of the global min pandas

Suppose you have numerical data for some function z = f(x, y) saved in a pandas dataframe, where x is the index values, y is the column values, and the dataframe is populated with the z data. For example:
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
is there a simple pandas command, or maybe a one-line string of a few simple commands, which returns the (x, y) values corresponding to data attributes, specifically in my case as min(z)? In the example data I'd be looking for (1.0, 0.6)
I'm really just hoping there's an answer that doesn't involve parsing out the data into some other structure, because sure, just linearize the data in a numpy array and correlate the numpy array index with (x,y). But if there's something cleaner/more elegant that I simply am not finding, I'd love to learn about it.
Using pandas.DataFrame.idxmin & pandas.Series.idxmin
import pandas as pd
# df view
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
# min column
min_col_name = df.min().idxmin()
# min column index if needed
min_col_idx = df.columns.get_loc(min_col_name)
# min row index
min_row_idx = df[min_col_name].idxmin()
another option:
(df.min(axis=1).idxmin(), df.min().idxmin())

Getting NaN's instead of the correct values inside dataframe column

I created a dataframe of zeros using this syntax:
ltv = pd.DataFrame(data=np.zeros([actual_df.shape[0], 6]),
columns=['customer_id',
'actual_total',
'predicted_num_purchases',
'predicted_value',
'predicted_total',
'error'], dtype=np.float32)
It comes out perfectly as expected
customer_id | actual_total | predicted_num_purchases | predicted_value | predicted_total | error
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
When I run this syntax:
ltv['customer_id'] = actual_df['customer_id']
I get all NaNs in ltv['customer_id']. What is causing this and how can I prevent it from happening?
NB: I also checked actual_dfand there are no NaNs inside of it
You need same index values in both (and also same length of both DataFrames).
So first solution is create default RabgeIndex in actual_df, in ltv is not specify, so created by default:
actual_df = actual_df.reset_index(drop=True)
ltv['customer_id'] = actual_df['customer_id']
Or add parameter index to DataFrame constructor:
ltv = pd.DataFrame(data=np.zeros([actual_df.shape[0], 6]),
columns=['customer_id',
'actual_total',
'predicted_num_purchases',
'predicted_value',
'predicted_total',
'error'], dtype=np.float32,
index=actual_df.index)
ltv['customer_id'] = actual_df['customer_id']
Another option (more complicated than jezrael's great answer) is using pd.concat() followed by .drop():
ltv = pd.concat([ltv.drop(columns=['customer_id']),actual_df[['customer_id']]],axis=1,ignore_index=True)

Having difficulty getting multiple columns in HDF5 Table Data

I am new to hdf5 and was trying to store a DataFrame row into the hdf5 format. I was to append a row at different locations within the file; however, every time I append it shows up at an array in a single column rather than a single value in multiple columns.
I have tried both h5py and pandas and it seems like pandas is the better option for appending. Additionally, I have really been trying a lot of different methods. Truly, any help would be greatly appreciated.
Here is me sending an array multiple times into the hdf5 file.
import pandas as pd
import numpy as np
data = np.zeros((1,48), dtype = float)
columnName = ['Hello'+str(y) for (x,y), item in np.ndenumerate(data)]
df = pd.DataFrame(data = data, columns =columnName)
file = pd.HDFStore('file.hdf5', mode = 'a', complevel = 9, comlib = 'blosc')
for x in range(0,11):
file.put('/data', df, column_data = columnName , append = True, format = 'table')
In [243]: store = pd.HDFStore('test.h5')
This seems to work fine:
In [247]: store.put('foo',df,append=True,format='table')
In [248]: store.put('foo',df,append=True,format='table')
In [249]: store.put('foo',df,append=True,format='table')
In [250]: store['foo']
Out[250]:
Hello0 Hello1 Hello2 Hello3 Hello4 ... Hello43 Hello44 Hello45 Hello46 Hello47
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
[3 rows x 48 columns]

Maximum Monthly Values whilst retaining the Data at which that values occured

I have daily rainfall data that looks like the following:
Date Rainfall (mm)
1922-01-01 0.0
1922-01-02 0.0
1922-01-03 0.0
1922-01-04 0.0
1922-01-05 31.5
1922-01-06 0.0
1922-01-07 0.0
1922-01-08 0.0
1922-01-09 0.0
1922-01-10 0.0
1922-01-11 0.0
1922-01-12 9.1
1922-01-13 6.4
.
.
.
I am trying to work out the maximum value for each month for each year, and also what date the maximum value occurred on. I have been using the code:
rain_data.groupby(pd.Grouper(freq = 'M'))['Rainfall (mm)'].max()
This is returning the correct maximum value but returns the end date of each month rather than the date that maximum event occurred on.
1974-11-30 0.0
1974-12-31 0.0
1975-01-31 0.0
1975-02-28 65.0
1975-03-31 129.5
1975-11-30 59.9
1975-12-31 7.1
1976-01-31 10.0
1976-11-30 0.0
1976-12-31 0.0
1977-01-31 4.3
Any suggestions on how I could get the correct date?
I'm new to this, but what I think you're doing in (pd.Grouper(freq = 'M')) is grouping all the values in each month, but it's assigning every value within a group to the same date. I think this is why your groupby isn't returning the dates you're looking for.
I think your question is answered here. Alexander suggests to use:
df.groupby(pd.TimeGrouper('M')).Close.agg({'max date': 'idxmax', 'max rainfall': np.max})
The agg works without the Close I think, so if it's problematic (as I found) you might want to take it out.

Categories