I have a problem with regards as to how to appropriately code this condition. I'm currently creating a new pandas column in my dataframe, new_column, which performs a subtraction on the values in column test, based on what index of the data we are at. I'm currently using this code to get it to subtract a different value every 4 times:
subtraction_value = 3
subtraction_value = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
print (data['new_column']
[6,1,2,1,-5,0,-1,3,4,6]
However, I now wish to get it performing the higher subtraction on the first two positions in the column, and then 3 subtractions with the original value, another two with the higher subtraction value, 3 small subtractions, and so forth. I thought I could do it this way, with an | condition in my np.where statement:
data['new_column'] = np.where((data.index%4) | (data.index%5),
data['test']-subtraction_value,
data['test']-subtraction_value_2)
However, this didn't work, and I feel my maths may be slightly off. My desired output would look like this:
print(data['new_column'])
[6,-2,2,1,-2,-3,-4,3,7,6])
As you can see, this slightly shifts the pattern. Can I still use numpy.where() here, or do I have to take a new approach? Any help would be greatly appreciated!
As mentioned in the comment section, the output should equal
[6,-2,2,1,-2,-3,-4,2,7,6] instead of [6,-2,2,1,-2,-3,-4,3,7,6] according to your logic. Given that, you can do the following:
import pandas as pd
import numpy as np
from itertools import chain
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
index_pos_large_subtraction = list(chain.from_iterable((data.index[i], data.index[i+1]) for i in range(0, len(data)-1, 5)))
data['new_column'] = np.where(~data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value, data['test']-subtraction_value_2)
# The next line is equivalent to the previous one
# data['new_column'] = np.where(data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value_2, data['test']-subtraction_value)
---------------------------------------------
test new_column
0 12 6
1 4 -2
2 5 2
3 4 1
4 1 -2
5 3 -3
6 2 -4
7 5 2
8 10 7
9 9 6
---------------------------------------------
As you can see, np.where works fine. Your masking condition is the problem and needs to be adjusted, you are not selecting rows according to your logic.
Related
Let's say that I have som data from a file where some columns are "of the same kind", only of different subscripts of some mathematical variable, say x:
n A B C x[0] x[1] x[2]
0 1 2 3 4 5 6
1 2 3 4 5 6 7
Is there some way I can load this into a pandas dataframe df and somehow treat the three x-columns as an indexable, array-like entity (I'm new to pandas)? I believe it would be convenient, because I could do operations on the data-series contained in x such as sum(df.x).
Kind regards.
EDIT:
Admittedly, my original post was not clear enough. I'm not just interested in getting the sum of three columns. That was just an example. I'm looking for a generally applicable abstraction that I hope is built into pandas.
I'd like to have multiple columns accessible through (sub-)indices of one entity, e.g. df.x[0], such that I (or any other user of the data) can do whichever operation he/she wants (sum/max/min/avg/standard deviation, you name it). You can consider the x's as an ensamble of time-dependent measurements if you like.
Kind regards.
Consider, you define your dataframe like this
df = pd.DataFrame([[1, 2, 3, 4, 5, 6],
[2, 3, 4, 5, 6, 7]], columns=['A', 'B', 'C', 'x0', 'x1', 'x2'])
Then with
x = ['x0', 'x1', 'x2']
You use the following notation allowing a quite general definition of x
>>> df[x].sum(axis=1)
0 15
1 18
dtype: int64
Look of column which starts with 'x' and perform operations you need
column_num=[col for col in df.columns if col.startswith('x')]
df[column_num].sum(axis=1)
I'll give you another answer which will defer from you initial data structure in exchange for addressing the values of the dataframe by df.x[0] etc.
Consider you have defined your dataframe like this
>>> dv = pd.DataFrame(np.random.randint(10, size=20),
index=pd.MultiIndex.from_product([range(4), range(5)]), columns=['x'])
>>> dv
x
0 0 8
1 3
2 4
3 6
4 1
1 0 8
1 9
2 1
3 8
4 8
[...]
Then you can exactly do this
dv.x[1]
0 8
1 9
2 1
3 8
4 8
Name: x, dtype: int64
which is your desired notation. Requires some changes to your initial set-up but will give you exactly what you want.
Here is my example:
import pandas as pd
df = pd.DataFrame({'col_1':[1,5,6,77,9],'col_2':[6,2,4,2,5]})
df.index = [8,9,10,11,12]
This sub-setting is by row order:
df.col_1[2:5]
returns
10 6
11 77
12 9
Name: col_1, dtype: int64
while this subsetting is already by index and does not to work:
df.col_1[2]
returns:
KeyError: 2
I find it very confusing and am curios what is the reason behind it?
You're statements are ambiguous, therefore it best to explicitly define what you want.
df.col_1[2:5] is working like df.col_1.iloc[2:5] using integer location.
Where as df.col[2] is working like df.col_1.loc[2] using index label location, hence there is no index labelled 2, so you get the KeyError.
Hence is best to defined whether are are using integer location with .iloc or index label location using .loc.
See Pandas Indexing docs.
Let's assume this is the initial DataFrame:
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=list('abcde')
)
df
Out:
col_1 col_2
a 1 6
b 5 2
c 6 4
d 77 2
e 9 5
The index consists of strings so it is generally obvious what you are trying to do:
df['col_1']['b'] You passed a string so you are probably trying to access by label. It returns 5.
df['col_1'][1] You passed an integer so you are probably trying to access by position. It returns 5.
Same deal with slices: df['col_1']['b':'d'] uses labels and df['col_1'][1:4] uses positions.
When the index is also integer, nothing is obvious anymore.
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=[8, 9, 10, 11, 12]
)
df
Out:
col_1 col_2
8 1 6
9 5 2
10 6 4
11 77 2
12 9 5
Let's say you type df['col_1'][8]. Are you trying to access by label or by position? What if it was a slice? Nobody knows. At this point, pandas chooses one of them based on their usage. It is in the end a Series and what distinguishes a Series from an array is its labels so the choice for df['col_1'][8] is labels. Slicing with labels is not that common so pandas is being smart here and using positions when you pass a slice. Is it inconsistent? Yes. Should you avoid it? Yes. This is the main reason ix was deprecated.
Explicit is better than implicit so use either iloc or loc when there is room for ambiguity. loc will always raise a KeyError if you try to access an item by position and iloc will always raise a KeyError if you try to access by label.
I'm trying to check if all my expected values are in pandas dataframe. The expected values are known ahead of time and the dataframe is automatically generated from a database query.
This is an example of what I'm trying to do
import pandas as pd
import StringIO
expected_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
csv = StringIO.StringIO("""ExpectedID,Random Value
1,val1
2,val2
3,val3
8,val8
9,val9
10,val10
""")
df = pd.read_csv(csv, sep=",")
for e in expected_ids:
if e not in df['ExpectedID']:
print "Missing: ", e
My problem is that I have to check each value I'm expecting individually and in my real code there are approximately 14000 of these. I'd also like to pull the missing ones into another dataframe that I can manipulate later but don't know how to do that.
The other problem I have is that the above prints this:
Missing: 6
Missing: 7
Missing: 8
Missing: 9
Missing: 10
Those values aren't all correct. I am missing 6 and 7, but 8, 9, and 10 are in the df. It also doesn't say that 4 and 5 are missing.
How can I accurately check if multiple values are in a dataframe column?
df['ExpectedId'] is a Series and behaves like a dict when you test for membership:
In [5]: df.ExpectedId
Out[5]:
0 1
1 2
2 3
3 8
4 9
5 10
Name: ExpectedID, dtype: int64
In [6]: 0 in df['ExpectedID']
Out[6]: True
You should test for membership in df['ExpectedId'].values instead.
I'm converting a financial spreadsheet into Pandas, and this is a frequent challenge that comes up.
In excel, suppose you have some calculation that for columns 0:n, the value depends on the previous column [shown in format Cell (row, column)]: Cell(1,n) = (Cell(1,n-1)^2)*5.
Obviously, for n=2, you could create a calculated column in Pandas:
df[2] = (df[1]^2) *5
But for a chain of say 30, that doesn't work. So currently, I am using a for loop.
total_columns_needed = list(range(0,100))
for i in total_columns_needed:
df[i] = (df[i-1]^2)* 5
That loop works fine, but I trying to see how I could use map and apply to make this look cleaner. From reading, apply is a loop function underneath, so I'm not sure whether I will get any speed from doing this. But, it could shrink the code by a lot.
The problem that I've had with:
df.apply()
is that 1) there could be other columns not involved in the calculation (which arguably shouldn't be there if the data is properly normalised), and 2) the columns don't exist yet. Part 2 could possibly be solved by creating the dataframe with all the needed columns, but I'm trying to avoid that for other reasons.
Any help in solving this greatly appreciated!
To automatically generate a bunch of columns, without a loop:
In [433]:
df = pd.DataFrame({'Val': [0,1,2,3,4]})
In [434]:
print df.Val.apply(lambda x: pd.Series(x+np.arange(0,25,5)))
0 1 2 3 4
0 0 5 10 15 20
1 1 6 11 16 21
2 2 7 12 17 22
3 3 8 13 18 23
4 4 9 14 19 24
numpy.arange(0,25,5) gives you array([ 0, 5, 10, 15, 20]). For each of the values in Val, we will add that value to array([ 0, 5, 10, 15, 20]), creating a new Series.
And finally, put the new Series together back into a new DataFrame
Let's say I have a DataFrame like this:
df
A B
5 0 1
18 2 3
125 4 5
where 5, 18, 125 are the index
I'd like to get the line before (or after) a certain index. For instance, I have index 18 (eg. by doing df[df.A==2].index), and I want to get the line before, and I don't know that this line has 5 as an index.
2 sub-questions:
How can I get the position of index 18? Something like df.loc[18].get_position() which would return 1 so I could reach the line before with df.iloc[df.loc[18].get_position()-1]
Is there another solution, a bit like options -C, -A or -B with grep ?
For your first question:
base = df.index.get_indexer_for((df[df.A == 2].index))
or alternatively
base = df.index.get_loc(18)
To get the surrounding ones:
mask = pd.Index(base).union(pd.Index(base - 1)).union(pd.Index(base + 1))
I used Indexes and unions to remove duplicates. You may want to keep them, in which case you can use np.concatenate
Be careful with matches on the very first or last rows :)
If you need to convert more than 1 index, you can use np.where.
Example:
# df
A B
5 0 1
18 2 3
125 4 5
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [0,2,4], "B": [1,3,5]}, index=[5,18,125])
np.where(df.index.isin([18,125]))
Output:
(array([1, 2]),)