I am using a for loop to reuse existing data frames.
Sample Code:
for i in range(0, 5, 1):
RGU_TT_TempX = pd.DataFrame()
RGU_TT_TempX = RGU_TT_Temp
#Merging Regular Ambulance TT with MSUs TT
#Updating MSUs TT according to the Formula
RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Y_DURATION"] = 0.05 + df_temp_MSU2["MSU_Y_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Z_DURATION"] = 0.05 + df_temp_MSU3["MSU_Z_DURATION"].values + 0.25 + 0.25
This gives me the error:
---> 44 RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
ValueError: Length of values (0) does not match length of index (16622)
In each data frame, I have 16622 values. Still, this gives me the length of the index error.
Full Error Track:
ValueError Traceback (most recent call last)
Input In [21], in <cell line: 16>()
41 RGU_TT_TempX = RGU_TT_Temp
42 #Merging Regular Ambulance TT with MSUs TT
43 #Updating MSUs TT according to the Formula
---> 44 RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
45 RGU_TT_TempX["MSU_Y_DURATION"] = 0.05 + df_temp_MSU2["MSU_Y_DURATION"].values + 0.25 + 0.25
46 RGU_TT_TempX["MSU_Z_DURATION"] = 0.05 + df_temp_MSU3["MSU_Z_DURATION"].values + 0.25 + 0.25
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/frame.py:3977, in DataFrame.__setitem__(self, key, value)
3974 self._setitem_array([key], value)
3975 else:
3976 # set column
-> 3977 self._set_item(key, value)
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/frame.py:4171, in DataFrame._set_item(self, key, value)
4161 def _set_item(self, key, value) -> None:
4162 """
4163 Add series to DataFrame in specified column.
4164
(...)
4169 ensure homogeneity.
4170 """
-> 4171 value = self._sanitize_column(value)
4173 if (
4174 key in self.columns
4175 and value.ndim == 1
4176 and not is_extension_array_dtype(value)
4177 ):
4178 # broadcast across multiple columns if necessary
4179 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/frame.py:4904, in DataFrame._sanitize_column(self, value)
4901 return _reindex_for_setitem(Series(value), self.index)
4903 if is_list_like(value):
-> 4904 com.require_length_match(value, self.index)
4905 return sanitize_array(value, self.index, copy=True, allow_2d=True)
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/common.py:561, in require_length_match(data, index)
557 """
558 Check the length of data matches the length of the index.
559 """
560 if len(data) != len(index):
--> 561 raise ValueError(
562 "Length of values "
563 f"({len(data)}) "
564 "does not match length of index "
565 f"({len(index)})"
566 )
ValueError: Length of values (0) does not match length of index (16622)
I am really stuck here. Any suggestions will be highly appreciated.
Data Frame (MSU_TT_Temp) Samples:
FROM_ID TO_ID DURATION_H DIST_KM
1 7 0.528556 38.43980
1 26 0.512511 37.38515
1 71 0.432453 32.57571
1 83 0.599486 39.26188
1 98 0.590517 35.53107
Data Frame (RGU_TT_Temp) Samples:
Ambulance_ID Centroid_ID Hospital_ID Regular_Ambu_TT
37 1 6 1.871885
39 2 13 1.599971
6 3 6 1.307165
42 4 12 1.411554
37 5 14 1.968138
The problem is, if I iterate my loop once, the code works absolutely fine.
Sample Code:
for i in range(0, 1, 1):
s = my_chrome_list[i]
MSU_X,MSU_Y,MSU_Z = s
#print (MSU_X,MSU_Y,MSU_Z)
#Three scenario
df_temp_MSU1 = pd.DataFrame()
df_temp_MSU2 = pd.DataFrame()
df_temp_MSU3 = pd.DataFrame()
df_temp_MSU1 = MSU_TT_Temp.loc[(MSU_TT_Temp['FROM_ID'] == MSU_X)]
df_temp_MSU1.rename(columns = {'DURATION_H':'MSU_X_DURATION'}, inplace = True)
#df_temp_MSU1
df_temp_MSU2 = MSU_TT_Temp.loc[(MSU_TT_Temp['FROM_ID'] == MSU_Y)]
df_temp_MSU2.rename(columns = {'DURATION_H':'MSU_Y_DURATION'}, inplace = True)
#df_temp_MSU2
df_temp_MSU3 = MSU_TT_Temp.loc[(MSU_TT_Temp['FROM_ID'] == MSU_Z)]
df_temp_MSU3.rename(columns = {'DURATION_H':'MSU_Z_DURATION'}, inplace = True)
#df_temp_MSU3
RGU_TT_TempX = pd.DataFrame()
RGU_TT_TempX = RGU_TT_Temp
#Merging Regular Ambulance TT with MSUs TT
#Updating MSUs TT according to the Formula
RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Y_DURATION"] = 0.05 + df_temp_MSU2["MSU_Y_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Z_DURATION"] = 0.05 + df_temp_MSU3["MSU_Z_DURATION"].values + 0.25 + 0.25
#RGU_TT_TempX
#MSUs Average Time to Treatment
MSU1=RGU_TT_TempX["MSU_X_DURATION"].mean()
MSU2=RGU_TT_TempX["MSU_Y_DURATION"].mean()
MSU3=RGU_TT_TempX["MSU_Z_DURATION"].mean()
MSU_AVG_TT = (MSU1+MSU2+MSU3)/3
parents_chromosomes_list.append(MSU_AVG_TT)
Output:
[2.0241383927258387]
Note: The data length in the three data frames are equal: Indexes are the same length
Loop for multiple iteration:Erorr
for i in range(0, 5, 1):
What is the problem?
This error occurs when you attempt to assign a NumPy array of values to a new column in a pandas DataFrame, yet the array's length does not match the current length of the index.
The easiest way to fix this error is to simply create a new column using a pandas Series instead of a NumPy array.
You can take a closer look here: How to Fix: Length of values does not match the length of index
Related
I have a dataset which talks about firms' financial information from 1987 to 1999 with different industries. Now I wanna run an OLS for each industries. There are 137 industries in the dataframe with 28128 firms-year available. df['Industry'] is the code for each industry, same code means the same industry. After getting the intercept and coefs of each industry, put them in to a statistic descriptive table.
I am thinking of writing a for loop but have no idea how.
first I filter out industries that have at least 50 observations
Dechow_1987 = Dechow_1987[Dechow_1987.groupby('Industry')['Industry'].transform('size') >= 50]
print( 'Firm-years available:' ,Dechow_1987.shape[0])
print( 'Number of Industries available:' , Dechow_1987.Industry.nunique())
And I want to get this table at the end:
Intercept b1 b b3 Adjusted R2
Mean 0.03 0.19 -0.51 0.15 0.34
(t-statistic) (16.09) (21.10) (-35.77) (15.33)
Lower quartile 0.01 0.11 -0.63 0.08 0.22
Median 0.03 0.18 -0.52 0.15 0.34
Upper quartile 0.04 0.26 -0.40 0.23 0.4
I have tried :
ind = Dechow_1987.Industry.unique()
op=pd.DataFrame()
for i in ind:
Dechow_1987_i = Dechow_1987[Dechow_1987.Industry == i]
X_CFOs = Dechow_1987_i[['CFOtm1', 'CFOt', 'CFOtp1']]
X_CFOs = sm.add_constant(X_CFOs)
Y_WC= Dechow_1987_i['wcch']
reg = sm.OLS(Y_WC, X_CFOs).fit()
#reg.score(Y_WC,X_CFOs)
intercept = reg.params.const
coef_CFOtm1 = reg.params.CFOtm1
coef_CFOt = reg.params.CFOt
coef_CFOtp1 = reg.params.CFOtp1
ind=i
array=np.append(coef_CFOtm1,coef_CFOt,coef_CFOtp1).dtype('int32')
array=np.append(array,intercept)
array=np.append(array,ind)
array=array.reshape(3,len(array))
df_append=pd.DataFrame(array)
op=op.append(df_append)
op.columns=['A'+str(i) for i in range (3,len(op.columns)+1)]
op.rename(columns={op.columns[-1]:"Industry"},inplace=True)
op.rename(columns={op.columns[-2]:"Intercept"},inplace=True)
op=op.reset_index().drop('index',axis=1)
op=op.drop_duplicates()
It come out :
TypeError Traceback (most recent call last)
<ipython-input-114-60ce6bb71209> in <module>
18
19 ind=i
---> 20 array=np.append(coef_CFOtm1,coef_CFOt,coef_CFOtp1).dtype('int32')
21 array=np.append(array,intercept)
22 array=np.append(array,ind)
<__array_function__ internals> in append(*args, **kwargs)
D:\anaconda3\lib\site-packages\numpy\lib\function_base.py in append(arr, values, axis)
4743 values = ravel(values)
4744 axis = arr.ndim-1
-> 4745 return concatenate((arr, values), axis=axis)
4746
4747
<__array_function__ internals> in concatenate(*args, **kwargs)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
I have a pandas dataframe and i want to create a list of columns for one particular variable if P_BUYER column has one entry greater than 97 and others less . For example, below, a list should be created containing TENRACT and ADV_INC. If P_BUYER has a value greater than or equal to 97 then the value which is in parallel to T for that particular block should be saved in a list (e.g. we have following values in parallel to T in below example : (TENRCT,ADVNTG_MARITAL,NEWLSGOLFIN,ADV_INC)
Input :
T TENRCT P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N (1,2,3) = Renter N (1,2,3) = Renter 35.88 0.1 33 8 2
Q <0> = Unknown Q <0> = Unknown 3.26 0.1 36 8 2
Q1 <4> = Owner Q <4> = Owner 60.86 99.8 143 5 1
E2
T ADVNTG_MARITAL P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
Q2<1> = 1+Marrd Q<1> = 1+Marrd 52.91 78.98 149 5 2
Q<2> = 1+Sngl Q<2> = 1+Sngl 45.23 17.6 39 8 3
Q1<3> = Mrrd_Sngl Q<3> = Mrrd_Sngl 1.87 3.42 183 4 1
E3
T ADV_INC P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N1('1','Y') = Yes N('1','Y') = Yes 3.26 1.2 182 4 1
N('0','-1')= No N('0','-1')= No 96.74 98.8 97 7 2
E2
output:
Finallist=['TENRACT','ADV_INC']
You can do it like this:
# In your code, you have 3 dataframes E1,E2,E3, iterate over them
output = []
for df in [E1,E2,E3]:
# Filter you dataframe
df = df[df['P_BUYER(%)'] >= 97 ]
if not df.empty:
cols = df.columns.values.tolist()
# Find index of 'T' column
t_index = cols.index('T')
# You desired parallel column will be at t_index+1
output.append(cols[t_index+1])
print(output)
I have a pandas dataframe which I am storing information about different objects in a video.
For each frame of the video I'm saving the positions of the objects in a dataframe with columns 'x', 'y' 'particle' with the frame number in the index:
x y particle
frame
0 588 840 0
0 260 598 1
0 297 1245 2
0 303 409 3
0 307 517 4
This works fine but I want to save information about each frame of the video, e.g. the temperature at each frame.
I'm currently doing this by creating a series with the values for each frame and the index containing the frame number then adding the series to the dataframe.
prop = pd.Series(temperature_values,
index=pd.Index(np.arange(len(temperature_values)), name='frame')
df['temperature'] = prop
This works but produces duplicates of the data in every row of the column:
x y particle temperature
frame
0 588 840 0 12
0 260 598 1 12
0 297 1245 2 12
0 303 409 3 12
0 307 517 4 12
Is there anyway of saving this information without duplicates in the current dataframe so that when I try and get the temperature column I just receive the original series that I created?
If there isn't anyway of doing this my plan is to either deal with the duplicates using drop_duplicates or create a second dataframe with just the data for each frame which I can then merge into my first dataframe but I'd like to avoid doing this if possible.
Here is the current code with jupyter outputs formatted as best as I can:
import pandas as pd
import numpy as np
df = pd.DataFrame()
frames = list(range(5))
for f in frames:
x = np.random.randint(10, 100, size=10)
y = np.random.randint(10, 100, size=10)
particle = np.arange(10)
data = {
'x': x,
'y': y,
'particle': particle,
'frame': f}
df_to_append = pd.DataFrame(data)
df = df.append(df_to_append)
print(df.head())
Output:
x y particle frame
0 61 97 0 0
1 49 73 1 0
2 48 72 2 0
3 59 37 3 0
4 39 64 4 0
Input
df = df.set_index('frame')
print(df.head())
Output
x y particle
frame
0 61 97 0
0 49 73 1
0 48 72 2
0 59 37 3
0 39 64 4
Input:
example_data = [10*f for f in frames]
# Current method
prop = pd.Series(example_data, index=pd.Index(np.arange(len(example_data)), name='frame'))
df['data1'] = prop
print(df.head())
print(df.tail())
Output:
x y particle data1
frame
0 61 97 0 0
0 49 73 1 0
0 48 72 2 0
0 59 37 3 0
0 39 64 4 0
x y particle data1
frame
4 25 93 5 40
4 28 17 6 40
4 39 15 7 40
4 28 47 8 40
4 12 56 9 40
Input:
# Proposed method
df['data2'] = example_data
Output:
ValueError Traceback (most recent call last)
<ipython-input-12-e41b12bbe1cd> in <module>
1 # Proposed method
----> 2 df['data2'] = example_data
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3368 else:
3369 # set column
-> 3370 self._set_item(key, value)
3371
3372 def _setitem_slice(self, key, value):
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3443
3444 self._ensure_valid_index(value)
-> 3445 value = self._sanitize_column(key, value)
3446 NDFrame._set_item(self, key, value)
3447
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3628
3629 # turn me into an ndarray
-> 3630 value = sanitize_index(value, self.index, copy=False)
3631 if not isinstance(value, (np.ndarray, Index)):
3632 if isinstance(value, list) and len(value) > 0:
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index, copy)
517
518 if len(data) != len(index):
--> 519 raise ValueError('Length of values does not match length of index')
520
521 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
I am afraid you cannot. All columns in a DataFrame share the same index and are required to have same length. But coming from the database world, I try to avoid as much as possible indexes with duplicate values.
I am working on finding some sort of moving average in a dataframe. The formula will change based on the number of the row it is being computed for. The actual scenario is where I need to compute column Z.
Edit-2:
Below is the actual data I am working with
Date Open High Low Close
0 01-01-2018 1763.95 1763.95 1725.00 1731.35
1 02-01-2018 1736.20 1745.80 1725.00 1743.20
2 03-01-2018 1741.10 1780.00 1740.10 1774.60
3 04-01-2018 1779.95 1808.00 1770.00 1801.35
4 05-01-2018 1801.10 1820.40 1795.60 1809.95
5 08-01-2018 1816.00 1827.95 1800.00 1825.00
6 09-01-2018 1823.00 1835.00 1793.90 1812.05
7 10-01-2018 1812.05 1823.00 1801.40 1816.55
8 11-01-2018 1825.00 1825.05 1798.55 1802.10
9 12-01-2018 1805.00 1820.00 1794.00 1804.95
10 15-01-2018 1809.90 1834.45 1792.45 1830.00
11 16-01-2018 1835.00 1857.45 1826.10 1850.25
12 17-01-2018 1850.00 1852.45 1826.20 1840.50
13 18-01-2018 1840.50 1852.00 1823.50 1839.00
14 19-01-2018 1828.25 1836.35 1811.00 1829.50
15 22-01-2018 1816.50 1832.55 1805.50 1827.20
16 23-01-2018 1825.00 1825.00 1782.25 1790.15
17 24-01-2018 1787.80 1792.70 1732.15 1737.50
18 25-01-2018 1739.90 1753.40 1720.00 1726.40
19 29-01-2018 1735.15 1754.95 1729.80 1738.70
The code snippet I am using is as below:
from datetime import date
from nsepy import get_history
import csv
import pandas as pd
import numpy as np
import requests
from datetime import timedelta
import datetime as dt
import pandas_datareader.data as web
import io
df = pd.read_csv('ACC.CSV')
idx = df.reset_index().index
df['Change'] = df['Close'].diff()
df['Advance'] = np.where(df.Change > 0, df.Change,0)
df['Decline'] = np.where(df.Change < 0, df.Change*-1, 0)
conditions = [idx < 14, idx == 14, idx > 14]
values = [0, (df.Advance.rolling(14).sum())/14, (df.Avg_Gain.shift(1) * 13 + df.Advance)/14]
df['Avg_Gain'] = np.select(conditions, values)
df['Avg_Loss'] = (df.Decline.rolling(14).sum())/14
df['RS'] = df.Avg_Gain / df.Avg_Loss
df['RSI'] = np.where(df['Avg_Loss'] == 0, 100, 100-(100/(1+df.RS)))
df.drop(['Change', 'Advance', 'Decline', 'Avg_Gain', 'Avg_Loss', 'RS'], axis=1)
print(df.head(20))
Below is the error I am getting:
Traceback (most recent call last):
File "C:/Users/Lenovo/Desktop/Python/0.Chart Patterns/Z.Sample Code.py", line 20, in <module>
values = [0, (df.Advance.rolling(14).sum())/14, (df.Avg_Gain.shift(1) * 13 + df.Advance)/14]
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'Avg_Gain'
Edit 3: Below is the expected output and then I will also write down the formula. Original DF consists of columns Date Open High Low & Close.
Advance and Decline –
• If the difference between current & previous is +ve then Advance = difference and Decline = 0
• If the difference between current & previous is –ve then Advance = 0 and Decline = -1 * difference
Avg_Gain:
• If index is < 13 then Avg_Gain = 0
• If index = 13 then Avg_Gain = Average of 14 periods
• If index > 13, then Avg_Gain = (Avg_Gain(previous-row) * 13 + Advance(current-row) )/14
Avg_Loss:
• If index is < 13 then Avg_Loss = 0
• If index = 13 then Avg_Loss = Average of Advance of 14 periods
• If index > 13, then Avg_Loss = (Avg_Loss(previous-row) * 13 + Decline(current-row) )/14
RS:
• If index < 13 then RS = 0
• If Index >= 13 then RS = Avg_Gain/Avg_Loss
RSI = 100-(100/(1 + RS))
I hope this helps.
You have an error in your code because you use df.Avg_Gain in creating df.Avg_Gain. Y
values = [0, (df.Advance.rolling(14).sum())/14, (df.Avg_Gain.shift(1) * 13 + df.Advance)/14]
df['Avg_Gain'] = np.select(conditions, values)
I changed that part of the code to the following:
up = df.Advance.rolling(14).sum()/14
values = [0, up, (up.shift(1) * 13 + df.Advance)/14]
Output (idx>=14):
Date Open High Low Close RSI
14 2018-01-19 1828.25 1836.35 1811.00 1829.50 75.237850
15 2018-01-22 1816.50 1832.55 1805.50 1827.20 72.920021
16 2018-01-23 1825.00 1825.00 1782.25 1790.15 58.793750
17 2018-01-24 1787.80 1792.70 1732.15 1737.50 40.573938
18 2018-01-25 1739.90 1753.40 1720.00 1726.40 31.900045
19 2018-01-29 1735.15 1754.95 1729.80 1738.70 33.197678
There should be a better way of doing this though. I'll update it with a better solution if i find one. Let me know if this data is correct.
UPDATE:
You also need to correct your calculation for 'Avg_loss`::
down = df.Decline.rolling(14).sum()/14
down_values = [0, down, (down.shift(1) * 13 + df.Decline)/14]
df['Avg_Loss'] = np.select(conditions, down_values)
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:relative_strength_index_rsi#calculation
UPDATE 2: After Expected data was provided.
So the only way i could do this is by looping - not sure if possible to do it otherwise - maybe some pandas functionality i'm unaware of.
So first do the same as before with setting Avg_Gain and Avg_Loss:
you just need to change the values slightly:
conditions = [idx<13, idx==13, idx>13]
up = df.Advance.rolling(14).sum()/14
values = [0, up, 0]
df['Avg_Gain'] = np.select(conditions, values)
down = df.Decline.rolling(14).sum()/14
down_values = [0, down, 0]
df['Avg_Loss'] = np.select(conditions, d_values)
I have changed your conditions to split on index 13 - since this is what i see based on the expected output.
Once you run this code you will populate the values for Avg_Gain and Avg_Loss from index 14 using the previous value for Agv_Gain and Avg_Loss.
p=14
for i in range(p, len(df)):
df.at[i, 'Avg_Gain'] = ((df.loc[i-1, 'Avg_Gain'] * (p-1)) + df.loc[i, 'Advance']) / p
df.at[i, 'Avg_Loss'] = ((df.loc[i-1, 'Avg_Loss'] * (p-1)) + df.loc[i, 'Decline']) / p
Output:
df[13:][['Date','Avg_Gain', 'Avg_Loss', 'RS', 'RSI']]
Date Avg_Gain Avg_Loss RS RSI
13 2018-01-18 10.450000 2.760714 3.785252 79.102460
14 2018-01-19 9.703571 3.242092 2.992997 74.956155
15 2018-01-22 9.010459 3.174800 2.838119 73.945571
16 2018-01-23 8.366855 5.594457 1.495562 59.928860
17 2018-01-24 7.769222 8.955567 0.867530 46.453335
18 2018-01-25 7.214278 9.108741 0.792017 44.196960
19 2018-01-29 7.577544 8.458116 0.895890 47.254330
You can do it by using ewm and also go have a look at there for a bit more explanation. You will find that ewm can be used to calculate the y(row)=α*x(row)+(1−α)*y(row−1), where for example y is the column Avg_Gain, x is the value of the column Advance and α is the weight to give to x(row)
# define the number for the window
win_n = 14
# create a dataframe df_avg with the non null value of the two columns
# Advance and Decline such as the first is the mean of the 14 first values and the rest as normal
df_avg = (pd.DataFrame({'Avg_Gain': np.append(df.Advance[:win_n].mean(), df.Advance[win_n:]),
'Avg_Loss': np.append(df.Decline[:win_n].mean(), df.Decline[win_n:])},
df.index[win_n-1:])
.ewm(adjust=False, alpha=1./win_n).mean()) # what you need to calculate with your formula
# create the two other columns RS and RSI
df_avg['RS'] = df_avg.Avg_Gain / df_avg.Avg_Loss
df_avg['RSI'] = 100.-(100./(1. + df_avg['RS']))
and df_avg looks like:
Avg_Gain Avg_Loss RS RSI
13 10.450000 2.760714 3.785252 79.102460
14 9.703571 3.242092 2.992997 74.956155
15 9.010459 3.174800 2.838119 73.945571
16 8.366855 5.594457 1.495562 59.928860
17 7.769222 8.955567 0.867530 46.453335
18 7.214278 9.108741 0.792017 44.196960
19 7.577544 8.458116 0.895890 47.254330
You can join it to the original data and fillna with 0:
df = df.join(df_avg).fillna(0)
I have a large catalog that I am selecting data from according to the following criteria:
columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))
When I perform the above cuts the code is working fine. Next, I want to add one more cut: I want to select all the targets that have 4.0 < logg < 5.0. However, some of the targets have logg = -1 (which stands for the fact that the value is not available). Luckily, I can calculate logg from the other available parameters. So here is my updated cuts:
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)
However, I am receiving an error:
if catalog.logg[i] == -1:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can someone please explain what I am doing wrong and how I can fix it. Thank you
Edit 1
My dataframe looks like the following:
Data columns:
System 477 non-null values
rp 477 non-null values
mp 477 non-null values
logg 477 non-null values
dtypes: float64(37), int64(3), object(3)None
Edit 2
System rp mp logg FeH FeHu FeHl Mstar Mstaru Mstarl
0 target-01 5196 24 24 0.31 0.04 0.04 0.905 0.015 0.015
1 target-02 5950 150 150 -0.30 0.25 0.25 0.950 0.110 0.110
2 target-03 5598 50 50 0.04 0.05 0.05 0.997 0.049 0.049
3 target-04 6558 44 -1 0.14 0.04 0.04 1.403 0.061 0.061
4 target-05 6190 60 60 0.05 0.07 0.07 1.194 0.049 0.050
....
[5 rows x 43 columns]
Edit 3
My code in a format that I understand should be:
for row in range(len(catalog)):
parameter = catalog['logg'][row]
if parameter == -1:
parameter = catalog['mp'][row] / catalog['rp'][row]
if parameter > 4.0 and parameter < 5.0:
# select this row for further analysis
However, I am trying to write my code in a more simple and professional way. I don't want to use the for loop. How can I do it?
EDIT 4
Consider the following small example:
System rp mp logg
target-01 2 -1 2 # will NOT be selected since mp = -1
target-02 -1 3 4 # will NOT be selected since rp = -1
target-03 7 6 4.3 # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04 3.2 15 -1 # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)
you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation:
catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']
which would modify the logg column inplace
As for edit 3:
rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]
which will select rows that satisfy the condition
Instead of that code:
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
You could use following:
i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
For your edit 3 you need to add that line:
your_rows = df[(df.logg > 4) & (df.logg < 5)]
Full code:
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]
EDIT
Probably I still don't understand what you want, but I get your desired output:
import pandas as pd
from io import StringIO
data = """
System rp mp logg
target-01 2 -1 2
target-02 -1 3 4
target-03 7 6 4.3
target-04 3.2 15 -1
"""
catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]
In [7]: your_rows
Out[7]:
System rp mp logg
2 target-03 7.0 6 4.3000
3 target-04 3.2 15 4.6875
Am I still wrong?