I am trying to set colb in a pandas array depending on the value in colb.
The order in which I refer to the two column indices in the array seems to have an impact on whether the indexing works. Why is this?
Here is an example of what I mean.
I set up my dataframe:
test=pd.DataFrame(np.random.rand(20,1))
test['cola']=[x for x in range(20)]
test['colb']=0
If I try to set column b using the following code:
test.loc['colb',test.cola>2]=1
I get the error:`ValueError: setting an array element with a sequence
If I use the following code, the code alters the dataframe as I expect.
test.loc[test.cola>2,'colb']=1
Why is this?
Further, is there a better way to assign a column using a test like this?
Related
I have a dataframe called "nums" and am trying to find the value of the column "angle" by specifying the values of other columns like this:
nums[(nums['frame']==300)&(nums['tad']==6)]['angl']
When I do so, I do not get a singular number and cannot do calculations on them. What am I doing wrong?
nums
First of all, in general you should use .loc rather than concatenate indexes like that:
>>> s = nums.loc[(nums['frame']==300)&(nums['tad']==6), 'angl']
Now, to get the float, you may use the .item() accessor.
>>> s.item()
-0.466331
I have numbers in a List that should get assigned to certain rows of a dataframe consecutively.
List=[2,5,7,12….]
In my dataframe that looks similar to the below table, I need to do the following:
A frame_index that starts with 1 gets the next element of List as “sequence_number”
Frame_Index==1 then assign first element of List as Sequence_number.
Frame_index == 1 again, so assign second element of List as Sequence_number.
So my goal is to achieve a new dataframe like this:
I don't know which functions to use. If this weren't python language, I would use a for loop and check where frame_index==1, but my dataset is large and I need a pythonic way to achieve the described solution. I appreciate any help.
EDIT: I tried the following to fill with my List values to use fillna with ffill afterwards:
concatenated_df['Sequence_number']=[List[i] for i in
concatenated_df.index if (concatenated_df['Frame_Index'] == 1).any()]
But of course I'm getting "list index out of range" error.
I think you could do that in two steps.
Add column and fill with your list where frame_index == 1.
Use df.fillna() with method="ffill" kwarg.
import pandas as pd
df = pd.DataFrame({"frame_index": [1,2,3,4,1,2]})
sequence = [2,5]
df.loc[df["frame_index"] == 1, "sequence_number"] = sequence
df.ffill(inplace=True) # alias for df.fillna(method="ffill")
This puts the sequence_number as float64, which might be acceptable in your use case, if you want it to be int64, then you can just force it when creating the column (line 4) or cast it later.
In pandas I have a dataframe with X by Y dimension containing values.
Then I have an identical pandas dataframe with X by Y dimension (same as df1) containing True/False values.
I want to return only the elements from df1 where the same location on df2 the value = True.
What is the fastest way to do this? Is there a way to do this without converting to numpy array?
Without having the reproducible example, I may be missing a couple tweaks/details here, but I think you may be able to accomplish this by dataframe multiplication
df1.mul(df2)
This will multiply each element by the corresponding element in the other dataframe, where True will act to return the other element and False will return a null.
It is also possible to use mask
df1.mask(df2)
This is similar to df1[df2] and replaces hidden values with NaN, although you can choose the value to replace with using the other option
A quick benchmark on a 10x10 dataframe suggests that the df.mul approach is ~5 times faster
In pandas this operation creates a Series:
q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1)
I would like to be able to set the index as a list of values from a df colum. Ie
list(df['Colname'])
I've tried to create the series then update it with the series generated from the first code snippet. I've also searched the docs and don't see a method that will allow me to do this. I would prefer not to manually iterate over it.
Help is appreciated.
You can simply store that series to a variable say S and set the index accordingly as shown below..!!
S = (q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1))
S.index = df['Colname']
The code is provided assuming the lengths of the series and Column from the dataframe is equal. Hope this helps.!!
If you want to reset series s index, you can do:
s.index = new_index_list
I have a pandas dataframe and I need to add a new column, which would be based on calculation of specific columns,indicated by a column 'site'. I have found a way to do this with resort to numpy, but always it gives warning about chained index. I am sure there should be better solution, please help if you know.
df_num_bin1['Chip_id_3']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_89_S1]*0x100+df_num_bin1[WB_78_S1],df_num_bin1[WB_89_S2]*0x100+df_num_bin1[WB_78_S2])
df_num_bin1['Chip_id_2']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_67_S1]*0x100+df_num_bin1[WB_56_S1],df_num_bin1[WB_67_S2]*0x100+df_num_bin1[WB_56_S2])
df_num_bin1['Chip_id_1']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_45_S1]*0x100+df_num_bin1[WB_34_S1],df_num_bin1[WB_45_S2]*0x100+df_num_bin1[WB_34_S2])
df_num_bin1['Chip_id_0']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_23_S1]*0x100+df_num_bin1[WB_12_S1],df_num_bin1[WB_23_S2]*0x100+df_num_bin1[WB_12_S2])
df_num_bin1['mac_low']=(df_num_bin1['Chip_id_1'].map(int) % 0x10000) *0x100+df_num_bin1['Chip_id_0'].map(int) // 0x1000000
The code above have 2 issues:
1: Here the value of column [key_site_num] determines which columns I should extract chip id data from. In this example it is only of site 0 or 1, but actually it could be 2 or 3 as well. I would need a general solution.
2: it generates chained index warning;
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:35: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Well, i´m not too sure about your first quest but i think that this will help you.
import pandas as pd
reader = pd.read_csv(path,engine='python')
reader['new'] = reader['treasury.maturity.rate']+reader['bond.yield']
reader.to_csv('test.csv',index=False)
As you can see,you don´t need get the values before operate with them only reference the column where they are; and to do the same for only a specific row you could filter the dataframe before create the new column.