I am sure this question pops up often. However, I did not manage to derive an answer for this task after going through similar questions.
I have a dataframe of returns of multiple stocks and would need to run solely univariate regressions to derive rolling beta values.
The ols.PandasRollingOLS by Brad Solomon is very convenient for including the window/rolling. However, I did not manage to iterate the function over all the stock return columns.
I want this function to loop/iterate/go through all the columns of different stock returns.
In the following I am using code from the Project description https://pypi.org/project/pyfinance/ , as it should help to point my issue out clearer than my from my project.
import numpy as np
import pandas as pd
from pyfinance import ols
from pandas_datareader import DataReader
syms = {
'TWEXBMTH': 'usd',
'T10Y2YM': 'term_spread',
'PCOPPUSDM': 'copper'
}
data = DataReader(syms.keys(), data_source='fred',
start='2000-01-01', end='2016-12-31')\
.pct_change()\
.dropna()\
.rename(columns=syms)
y = data.pop('usd')
rolling = ols.PandasRollingOLS(y=y, x=data, window=12)
rolling.beta.head()
#DATE term_spread copper
#2001-01-01 0.000093 0.055448
#2001-02-01 0.000477 0.062622
#2001-03-01 0.001468 0.035703
#2001-04-01 0.001610 0.029522
#2001-05-01 0.001584 -0.044956
Instead of a multivariate regression I would like the function to take each column separately and iterate through the entire dataframe.
As my dataframe over 50 columns long, I would like to avoid being in the need of, writing the function that often, such as in the following (however, yielding the expected results):
rolling = ols.PandasRollingOLS(y=y, x=data[["term_spread"]], window=12).beta
rolling["copper"]= ols.PandasRollingOLS(y=y, x=data[["copper"]], window=12).beta["copper"]
# term_spread copper
#DATE
#2001-01-01 0.000258 0.055856
#2001-02-01 0.000611 0.064094
#2001-03-01 0.001700 0.047485
#2001-04-01 0.001778 0.040413
#2001-05-01 0.001353 -0.032264
#... ... ...
#2016-08-01 -0.100839 -0.176078
#2016-09-01 -0.058668 -0.189192
#2016-10-01 -0.014763 -0.181441
#2016-11-01 0.046531 0.016082
#2016-12-01 0.060192 0.035062
Tries I have made ended often with "SyntaxError: positional argument follows keyword argument" as for the following:
def beta(r):
z = ols.PandasRollingOLS(y=y, x=r, window=12);
return z
Related
Some background, I'm taking a machine learning class on customer segmentation. My code env is pandas(python) and sklearn. I have two datasets, a general population dataset and a customer demographics dataset with 85 identical columns.
I'm calling a function I created to run preprocessing steps on the 'customers' data, steps that were previously run outside this function on the general population dataset. Within the function is a loop that replaces missing values with np.nan. Here is the loop:
#replacing missing data with NaNs.
#feat_sum is a dataframe (feature_summary) of coded values
for i in range(len(feat_sum)):
mi_unk = feat_sum.iloc[i]['missing_or_unknown'] #locate column and values
mi_unk = mi_unk.strip('[').strip(']').split(',')# strip the brackets then split
mi_unk = [int(val) if (val!='' and val!='X' and val!='XX') else val for val in mi_unk]
if mi_unk != ['']:
featsum_attrib = feat_sum.iloc[i]['attribute']
df = df.replace({featsum_attrib: mi_unk}, np.nan)
Toward the end of the function I'm engineering new variables:
#Investigate "CAMEO_INTL_2015" and engineer two new variables.
df['WEALTH'] = df['CAMEO_INTL_2015']
df['LIFE_STAGE'] = df['CAMEO_INTL_2015']
mf_wealth_dict = {'11':1, '12':1, '13':1, '14':1, '15':1, '21':2, '22':2, '23':2, '24':2, '25':2, '31':3,'32':3, '33':3, '34':3, '35':3, '41':4, '42':4, '43':4, '44':4, '45':4, '51':5, '52':5, '53':5, '54':5, '55':5}
mf_lifestage_dict = {'11':1, '12':2, '13':3, '14':4, '15':5, '21':1, '22':2, '23':3, '24':4, '25':5, '31':1, '32':2, '33':3, '34':4, '35':5, '41':1, '42':2, '43':3, '44':4, '45':5, '51':1, '52':2, '53':3, '54':4, '55':5}
#replacing the 'WEALTH' and 'LIFE_STAGE' columns with values from the dictionaries
df['WEALTH'].replace(mf_wealth_dict, inplace=True)
df['LIFE_STAGE'].replace(mf_lifestage_dict, inplace=True)
Near the end of the project code, I'm running an imputer to replace the np.nans which ran successfully on the general population dataset(azdias):
az_imp = Imputer(strategy="most_frequent")
azdias_cleaned_imp = pd.DataFrame(az_imp.fit_transform(azdias_cleaned_encoded))
So when I call the clean_data function passing the 'customers' dataframe, clean_data(customers),it is giving me the ValueError: could not convert str to float: 'XX' on this line:
customers_imp = Imputer(strategy="most_frequent")
---> 19 customers_cleaned_imputed = pd.DataFrame(customers_imp.fit_transform(customers_cleaned_encoded))
In the data dictionary for the CAMEO_INTL_2015 column of the dataset, the very last category is 'XX': unknown. When I run a value count on the WEALTH and LIFE_STAGE columns, 124 occurrences of 'XX' under those two columns. No other columns in the dataset have the 'XX' value except these. Again, no issues with the other dataset, I did not run into this problem. I know this is wordy, but any help appreciated and I can provide the project code as well.
A mentor and myself tried troubleshooting looking at all the steps that were performed on both datasets, to no avail. I was expecting the 'XX' to be dealt with from the loop I mentioned earlier.
I have a large df of coordinates that I'm putting through a function (reverse geocoder),
How can I run through the whole df without iterating (Takes very long)
Example df:
Latitude Longitude
0 -25.66026 28.0914
1 -25.67923 28.10525
2 -30.68456 19.21694
3 -30.12345 22.34256
4 -15.12546 17.12365
After running through the function I want (without a for loop...) a df:
City
0 HappyPlace
1 SadPlace
2 AveragePlace
3 CoolPlace
4 BadPlace
Note: I dont need to know how to do reverse geocoding, this is a question about applying a function to a whole df without iteration.
EDIT:
using df.apply() might not work as my code looks like this:
for i in range(len(df)):
results = g.reverse_geocode(df['LATITUDE'][i], df['LONGITUDE'][i])
city.append(results.city)
Slower approach Iterating through the list of geo points and fetching city of the geo point
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
# example method of g.reverse_geocode() -> geo_reverse
def geo_reverse(lat, long):
time.sleep(2)
#assuming that your reverse_geocode will take 2 second
print(lat, long)
for i in range(len(df)):
results = geo_reverse(df['Latitude'][i], df['Longitude'][i])
Because of time.sleep(2). above program will take at least 20 seconds to process all ten geo point.
Better approach than above:
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
import threading
def runnable_method(f, args):
result_info = [threading.Event(), None]
def runit():
result_info[1] = f(args)
result_info[0].set()
threading.Thread(target=runit).start()
return result_info
def gather_results(result_infos):
results = []
for i in range(len(result_infos)):
result_infos[i][0].wait()
results.append(result_infos[i][1])
return results
def geo_reverse(args):
time.sleep(2)
return "City Name of ("+str(args[0])+","+str(args[1])+")"
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
cities_result = gather_results(result_info)
print(cities_result)
Notice the method geo_reverse has processing time of 2 seconds to fetch the data based on the geo points. In this second example the code will take only 2 seconds to process as many points as you want.
Note: Try both approach assuming that your geo_reverse will take approx. 2 seconds to fetch data. First approach will take 20+1 seconds and the processing time will increase with the increasing number of inputs but second approach will have almost constant processing time (i.e. approx 2+1) seconds no matter how many geo points you want to process.
Assume g.reverse_geocode() method is geo_reverse() on above code. Run both code (approach) above separately and see the difference on your own.
Explanation:
Take a look on above code and its major part that is creating list of tuples and comprehending that list passing each tuple to a dynamically created threads (Major part):
#Converting df of geo points into list of tuples
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
#List comprehension with custom methods and create run-able threads
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
#gather result from each thread.
cities_result = gather_results(result_info)
print(cities_result)
I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data
I am learning Python's Pandas library using kaggle's titanic tutorial. I am trying to create a function which will calculate the nulls in a column.
My attempt below appears to print the entire dataframe, instead of null values in the specified column:
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
null_percentage_calculator(train,"Age")
My previous (and very first stack overflow question) was a similar problem, and it was explained to me that the .index method in pandas is undesirable and I should try and use other methods like [ ] and .loc to explicitly refer to the column.
So I have tried this:
df_column_null=[df[nullcolumn]].isnull().sum()
I have also tried
df_column_null=df[nullcolumn]df[nullcolumn].isnull().sum()
I am struggling to understand this aspect of Pandas. My non function method works fine:
Train_Age_Nulls = train["Age"].isnull().sum()
Train_Age_Nulls_percentage = (Train_Age_Nulls/traintotal)*100
Train_Age_Nulls_percentage_rounded = np.ceil(Train_Age_Nulls_percentage)
print("{} percent of Train's Age are NaN values".format(Train_Age_Nulls_percentage_rounded))
Could anyone let me know where I am going wrong?
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
# what is testtotal?
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
I would do this with:
def null_percentage_calculator(df,nullcolumn):
nulls = df[nullcolumn].isnull().sum()
pct = float(nulls) / len(df[nullcolumn]) # need float because of python division
# if you must you can * 100
print "{} percent of column {} are null".format(pct*100, nullcolumn)
beware of python integer division where 63/180 = 0
if you want a float out, you have to put a float in.
I am having a problem manipulating my excel file in python.
I have a large excel file with data arranged by date/time.
I would like to be able to average the data for a specific time of day, over all the different days; ie. to create an average profile of the gas_concentrations over 1 day.
Here is a sample of my excel file:
Decimal Day of year Decimal of day Gas concentration
133.6285 0.6285 46.51230
133.6493 0.6493 47.32553
133.6701 0.6701 49.88705
133.691 0.691 51.88382
133.7118 0.7118 49.524
133.7326 0.7326 50.37112
Basically I need a function, like the AVERAGEIF function in excel, that will say something like
"Average the gas_concentrations when decimal_of_day=x"
However I really have no idea how to do this. Currently I have got this far
import xlrd
import numpy as np
book= xlrd.open_workbook('TEST.xlsx')
level_1=book.sheet_by_index(0)
time_1=level_1.col_values(0, start_rowx=1, end_rowx=1088)
dectime_1=level_1.col_values(8, start_rowx=1, end_rowx=1088)
ozone_1=level_1.col_values(2, start_rowx=1, end_rowx=1088)
ozone_1 = [float(i) if i != 'NA' else 'NaN' for i in ozone_1]
Edit
I updated my script to include the following
ozone=np.array(ozone_1, float)
time=np.array(dectime_1)
a=np.column_stack((ozone, time))
b=np.where((a[:,0]<0.0035))
print b
EDIT
Currently I solved the problem by putting both the variables into an array, then making a smaller array with just the variables I need to average - a bit inefficient but it works!
ozone=np.array(ozone_1, float)
time=np.array(dectime_1)
a=np.column_stack((ozone, time))
b=a[a[:,1]<0.0036]
c=np.nanmean(b[:,0])
You can use numpy masked array.
import numpy as np
data_1 = np.ma.arange(10)
data_1 = np.ma.masked_where(<your if statement>, data_1)
data_1_mean = np.mean(data1)
Hope that helps