I have a .fits file with some data, from which I have made some manipulations and would like to store the new data (not the entire .fits file) as a pd.DataFrame. The data comes from a file called pabdatazcut.fits.
#Sorted by descending Paschen Beta flux
sortedpab = sorted(pabdatazcut[1].data , key = lambda data: data['PAB_FLUX'] , reverse = True )
unsorteddf = pd.DataFrame(pabdatazcut[1].data)
sortedpabdf = pd.DataFrame({'FIELD' : sortedpab['FIELD'],
'ID' : sortedpab['ID'],
'Z_50' : sortedpab['Z_50'],
'Z_ERR' : ((sortedpab['Z_84'] - sortedpab['Z_50']) + (sortedpab['Z_50'] - sortedpab['Z_16'])) / (2 * sortedpab['Z_50']),
'$\lambda Pa\beta$' : 12820 * (1 + sortedpab['Z_50']),
'$Pa\beta$ FLUX' : sortedpab['PAB_FLUX'],
'$Pa\beta$ FLUX ERR' : sortedpab['PAB_FLUX_ERR']})
''''
I have received the 'TypeError: list indices must be integers or slices, not str' error message when I try to run this.
You get this because of accesses like sortedpab['ID'] I guess. According to the doc sorted returns a sorted list. Lists do not accept strings as id to access elements. They can only be accessed by integer positions or slices. That's what the error is trying to tell you.
Unfortunately I can't test this on my machine, because I don't have your data, but I guess, what you really want to do is something like this:
data_dict= dict()
for obj in sortedpab:
for key in ['FIELD', 'ID', 'Z_50', 'Z_50', 'Z_ERR', 'Z_84', 'PAB_FLUX', 'PAB_FLUX_ERR']:
data_dict.setdefault(key, list()).append(obj[key])
sortedpabdf = pd.DataFrame(data_dict)
# maybe you don't even need to create the data_dict but
# can pass the sortedpad directly to your data frame
# have you tried that already?
#
# then I would calculate the columns which are not just copied
# in the dataframe directly, as this is more convenient
# like this:
sortedpabdf['Z_ERR']= ((sortedpabdf['Z_84'] - sortedpabdf['Z_50']) + (sortedpabdf['Z_50'] - sortedpabdf['Z_16'])) / (2 * sortedpabdf['Z_50'])
sortedpabdf['$\lambda Pa\beta$']= 12820 * (1 + sortedpabdf['Z_50']),
sortedpabdf.rename({
'PAB_FLUX': '$Pa\beta$ FLUX',
'PAB_FLUX_ERR': '$Pa\beta$ FLUX ERR'
}, axis='columns', inplace=True)
cols_to_delete= [col for col in sortedpabdf.columns if col not in ['FIELD', 'ID', 'Z_50', 'Z_ERR', '$\lambda Pa\beta$', '$Pa\beta$ FLUX','$Pa\beta$ FLUX ERR'])
sortedpabdf.drop(cols_to_delete, axis='columns', inplace=True)
Related
data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = str(data['Prop-House Number']) + data['Prop-Street Name'] + data['Prop-Mode'] + str(data['Prop-Apt Unit Number'])
df = pd.DataFrame(data, columns = ['Name','New_addy'])
So this is the code
As you can see Prop-House Number and Prop-Apt Number are both int, and the rest are strings, I am trying to combine all these so that the full address is under one column labeled 'New addy'
Follow the string assignment with each variable using map as mentioned below:
data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = data['Prop-House Number'].map(str) + data['Prop-Street Name'].map(str) + data['Prop-Mode'].map(str) + data['Prop-Apt Unit Number'].map(str)
#select the desired columns for further work
data = data[['Name','New_addy']]
One way is using list comprehension:
data['New_addy'] = [str(n) + street + mode + str(apt_n) for n,street,mode,apt_n in zip(
data['Prop-House Number'],data['Prop-Street Name'],data['Prop-Mode'],data['Prop-Apt Unit Number'])]
I have been trying to use the lambda and apply() method to create a new column based on other columns. The calculation I want to do calculates a range of "Stage Reaction" values based on a range of "Alpha 1" values and "Alpha 2" values, along with being based on a constant "Stage Loading" value. Code below:
import pandas as pd
import numpy as np
data = {
'Stage Loading':[0.1],
'Alpha 1':[[0.1,0.12,0.14]],
'Alpha 2':[[0.1,0.12,0.14]]
}
pdf = pd.DataFrame(data)
def findstageloading(row):
stload = row('Stage Loading')
for alpha1, alpha2 in zip(row('Alpha 1'), row('Alpha 2')):
streact = 1 - 0.5 * stload * (np.tan(alpha1) - np.tan(alpha2))
return streact
pdf['Stage Reaction'] = pdf.apply(lambda row: findstageloading, axis = 1)
print(pdf.to_string())
The problem is that this code returns a message
"<function findstageloading at 0x000002272AF5F0D0>"
for the new column.
Can anyone tell me why? I want it to return a list of values
[0.9420088227267556, 1.0061754635552815, 1.0579911772732444]
Your lambda is just returning the function, just use pdf['Stage Reaction'] = pdf.apply(findstageloading, axis = 1)
Also, you need square brackets to access columns not round ones, otherwise python thinks you're calling a function.
Also, I'm not sure where your output came from but if you want to do pairwise arithmetic, you can use vectorisation and omit the zip.
def findstageloading(row):
stload = row['Stage Loading'] # not row('Stage Loading')
alpha1, alpha2 = row['Alpha 1'], row['Alpha 2']
streact = 1 - 0.5 * stload * (np.tan(alpha1) - np.tan(alpha2))
return streact
I'm passing dataframe from mapInPandas function in pyspark. so I need all values of ID column should be seperated by comma(,) like this 'H57R6HU87','A1924334','496A4806'
x1['ID'] looks like this
H57R6HU87
A1924334
496A4806'
Here is my code to get unique ID's, I am getting TypeError: string indices must be integers
# batch_iter= cust.toPandas()
for x1 in batch_iter:
IDs= ','.join(f"'{i}'" for i in x1['ID'].unique())
You probably don't need a loop, try:
batch_iter = cust.toPandas()
IDs = ','.join(f"'{i}'" for i in batch_iter['ID'].unique())
Or you can try using Spark functions only:
df2 = df.select(F.concat_ws(',', F.collect_set('ID')).alias('ID'))
If you want to use mapInPandas:
def pandas_func(iter):
for x1 in iter:
IDs = ','.join(f"'{i}'" for i in x1['ID'].unique())
yield pd.DataFrame({'ID': IDs}, index=[0])
df.mapInPandas(pandas_func)
# But I suspect you want to do this instead:
# df.repartition(1).mapInPandas(pandas_func)
i want to run two for loops in which i calculate annualized returns of a hypothetical trading strategy which is based on moving average crossovers. It's pretty simple: go long as soon as the "faster" MA crosses the "slower". Otherwise move to cash.
My data looks like this:
My Code:
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
The error message i get is the following:
TypeError: list indices must be integers or slices, not tuple
EDIT:
Using a dictionary works fine. The screenshot below shows where i'm stuck at the moment.
I want to have three final columns: (SMA_1,SMA_2,Ann_rets)
SMA_1: First Moving average e.g. 20
SMA_2: Second Moving average e.g. 50
Ann_rets: annualized return which is calculated in the loop above
I try to understand your questions. Hope this helps. I simplified your output ann_rets to illustrate reformatting to expected output format. Kr
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
# Reformat
Example data:
ann_rets = {(1,2): 0.1, (3,4):0.2, (5,6):0.3}
df1 = pd.DataFrame(ann_rets.values())
df2 = pd.DataFrame(list(ann_rets.keys()))
df = pd.concat([df2, df1], axis=1)
df.columns = ['SMA_1','SMA_2','Ann_rets']
print(df)
Which yields:
SMA_1 SMA_2 Ann_rets
0 1 2 0.1
1 3 4 0.2
2 5 6 0.3
You're trying access the index of a list with a tuple here: rets[short,long].
Try instead using a dictionary. So change
rets = []
ann_rets = []
to
rets = {}
ann_rets = {}
A double index like rets[short, long] will work for NumPy arrays and Pandas dataframes (like, presumably, your aapl variable), but not for a regular Python list. Use rets[short][long] instead. (Which also means you would need to change the initialization of rests at the top of your code.)
To explain briefly the actual error message: a tuple is more or less defined by the separating comma, that is, Python sees short,long and turns that into a tuple (short, long), which is then used inside the list index. Which, of course, fails, and throws this error message.
I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic
Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']