I am having trouble with conditionals / boolean indexing. I am trying to populate a dataframe (dfp) with logic which is conditional on data from a similarly shaped dataframe (dfs) plus the previous row of itself (dfp).
This is my latest fail...
import pandas as pd
dfs = pd.DataFrame({'a':[1,0,-1,0,1,0,0,-1,0,0],'b':[0,1,0,0,-1,0,1,0,-1,0]})
In [171]: dfs
Out[171]:
a b
0 1 0
1 0 1
2 -1 0
3 0 0
4 1 -1
5 0 0
6 0 1
7 -1 0
8 0 -1
9 0 0
dfp = pd.DataFrame(index=dfs.index,columns=dfs.columns)
dfp[(dfs==1)|((dfp.shift(1)==1)&(dfs!=-1))] = 1
In [166]: dfp.fillna(0)
Out[166]:
a b
0 1.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 0.0
4 1.0 0.0
5 0.0 0.0
6 0.0 1.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
So I would like dfp to have a 1 in row n if either of 2 conditions are met:
1) dfs same row = 1 or 2) both dfp previous row = 1 and dfs same row <> -1
I would like my final output to look like this:
a b
0 1 0
1 1 1
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 0 1
8 0 0
9 0 0
UPDATE / EDIT:
Sometimes the visual is more helpful - below is how it would map out in Excel.
Thanks in advance, very grateful for your time.
Let's summarize the invariants:
If the dfs value is 1 then the dfp value is 1.
If the dfs value is -1 then the dfp value is 0.
If the dfs value is 0 then the dfp value is 1 if the previous dfp value is 1 otherwise it's 0.
Or to formulate in another way:
The dfp starts with 1 if the first value is 1, otherwise 0
The dfp values are 0 until there is a 1 in dfs.
The dfp values are 1 until there is a -1 in dfs.
This is very easy to formulate in python:
def create_new_column(dfs_col):
newcol = np.zeros_like(dfs_col)
if dfs_col[0] == 1:
last = 1
else:
last = 0
for idx, val in enumerate(dfs_col):
if last == 1 and val == -1:
last = 0
if last == 0 and val == 1:
last = 1
newcol[idx] = last
return newcol
And the test:
>>> create_new_column(dfs.a)
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> create_new_column(dfs.b)
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
However this is very inefficient in Python because iterating over numpy-arrays (and pandas Series/DataFrames) is slow and the for-loops in python are inefficient as well.
However if you have numba or Cython you can compile this and it will be (probably) faster than any NumPy solution could be, because NumPy would require several rolling and/or accumulate operations.
For example with numba:
>>> import numba
>>> numba_version = numba.njit(create_new_column) # compilation step
>>> numba_version(np.asarray(dfs.a)) # need cast to np.array
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> numba_version(np.asarray(dfs.b)) # need cast to np.array
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
Even if dfs has millions of rows the numba solution will take only milliseconds:
>>> dfs = pd.DataFrame({'a':np.random.randint(-1, 2, 1000000),'b':np.random.randint(-1, 2, 1000000)})
>>> %timeit numba_version(np.asarray(dfs.b))
100 loops, best of 3: 9.37 ms per loop
Not the best way to do it but something that works.
dfs = pd.DataFrame({'a':[1,0,-1,0,1,0,0,-1,0,0],'b':[0,1,0,0,-1,0,1,0,-1,0]})
dfp = dfs.copy()
Define the function as follows. Usage of 'last' here is a little hacky.
last = [0]
def f( x ):
if x == 1:
x = 1
elif x != -1 and last[0] == 1:
x = 1
else:
x = 0
last[0] = x
return x
Simply apply the func f on each column.
dfp.a = dfp.a.apply( f )
dfp
a b
0 1 0
1 1 1
2 0 0
3 0 0
4 1 -1
5 1 0
6 1 1
7 0 0
8 0 -1
9 0 0
Similarly for col b. Don't forget to re-initalize 'last'.
last[0] = 0
dfp.b = dfp.b.apply( f )
dfp
a b
0 1 0
1 1 1
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 0 1
8 0 0
9 0 0
Related
I'm a python newbie and need help with a specfic task. My main goal is to identify all indicies with their specific values and column-names which are greater than 0 within a row and to sum up these values below each other into another column within the same row.
Here is what I tried:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
# create a new column that sums up the row
df['summary'] = 'NoData'
# print the header
print(df.columns.values)
A B C D summary
0 0 4 2 1 NoData
1 2 1 9 0 NoData
2 0 3 0 1 NoData
3 5 0 6 6 NoData
# get length of rows and columns
row = len(df.index)
column = len(df.columns)
# If a value at a spefic index is greater
# than 0, take the column name and the value at that index and print it into the column
#'summary'. Also write all values greater than 0 within a row below each other
for i in range(row):
for j in range(column):
if df.iloc[i][j] > 0:
df.at[i,'summary'] = df.columns(df.iloc[i][j]) + '\n'
I hope it is a bit clear what I want to achieve. Here is a picture of how the result should look in the column 'summary'
You don't really need a for loop.
Starting with df:
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
You can do:
# Define an helper function
def f(val, col_name):
# You can modify this function in order to customize the summary string
return "" if val == 0 else str(val) + col_name + "\n"
# Assign summary column
df["summary"] = df.apply(lambda x: x.apply(f, args=(x.name,))).sum(axis=1).str[:-1]
Output:
A B C D summary
0 0 4 2 1 4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 3B\n1D
3 5 0 6 6 5A\n6C\n6D
It works for longer column names as well:
one two three four summary
0 0 4 2 1 4two\n2three\n1four
1 2 1 9 0 2one\n1two\n9three
2 0 3 0 1 3two\n1four
3 5 0 6 6 5one\n6three\n6four
Try this:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
print(f'\n\n-------------BREAK-----------\n\n')
def func(line):
templist = ''
list_col = line.index.values.tolist()
temp = line.values.tolist()
for x in range(0, len(temp)):
if (temp[x] <= 0):
pass
else:
if (x == 0 ):
templist = f"{temp[x]}{list_col[x]}"
else:
templist = f"{templist}\n{temp[x]}{list_col[x]}"
return templist
df['summary'] = df.apply(func, axis = 1)
print(df)
EXIT
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
-------------BREAK-----------
A B C D summary
0 0 4 2 1 \n4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 \n3B\n1D
3 5 0 6 6 5A\n6C\n6D
If we have a pandas data frame and a mapping dictionary for the values in the data frame, replacing the values in the data frame using the dictionary as a mapping can be done like so:
In: df
Out:
Col1 Col2
0 a c
1 b c
2 b c
In: key
Out: {'a': 1, 'b': 2, 'c': 3}
In: df.replace(key)
Out:
Col1 Col2
0 1 3
1 2 3
2 2 3
How can a similar transformation be accomplished when the mapping dictionary has lists as values? For example:
In: key
Out: {'a': [1, 0, 0], 'b': [0, 1, 0], 'c': [0, 0, 1]}
In: df.replace(key)
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 1 output values where the mask is true
In this example, the end goal would be to have a new data frame that has 3 rows and 6 columns:
1 0 0 0 0 1
0 1 0 0 0 1
0 1 0 0 0 1
IIUC, you can applymap+explode+reshape:
df2 = df.applymap(key.get).explode(list(df.columns))
df2 = (df2
.set_index(df2.groupby(level=0).cumcount(), append=True)
.unstack(level=1)
)
output:
Col1 Col2
0 1 2 0 1 2
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
NB. to reset the columns: df2.columns = range(df2.shape[1])
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
You can use a combination DataFrame.apply and Series.map to perform this substitution. From there, you can perform a DataFrame.sum to concatenate the lists and then cast your data back into a new DataFrame
out = pd.DataFrame(
df.apply(lambda s: s.map(key)).sum(axis=1).tolist()
)
print(out)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
Semi-related testing of .sum vs .chain:
In [22]: %timeit tmp_df.sum(axis=1)
77.6 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [23]: %timeit tmp_df.apply(lambda row: list(chain.from_iterable(row)), axis=1)
197 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [24]: tmp_df
Out[24]:
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
While I won't say that .sum is the best method for concatenating lists in a Series, .apply & chain.from_iterable doesn't seem to fair much better- at least on a very small sample like this.
Hmm, this is tricky.
One solution I came up with is to convert the lists to their string represention before replacing with them, because pandas treats lists specially. Then you can use itertools.chain.from_iterable on each row to combine all the lists into one big list, and create a dataframe out of that:
import ast
from itertools import chain
n = df.replace({k: str(v) for k, v in key.items()}).applymap(ast.literal_eval)
df =pd.DataFrame(n.apply(lambda x: list(chain.from_iterable(x)), axis=1).tolist())
Output:
>>> df
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
Here's a method of replacing the items with lists without looping or stringifying:
df[:] = pd.Series(key)[df.to_numpy().flatten()].to_numpy().reshape(df.shape)
Output:
>>> df
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
Or, you can use explode and reshape to convert the data directly to a numpy array:
arr = pd.Series(key)[df.to_numpy().flatten()].explode().to_numpy().reshape(-1, 6) # 6 = len of one of the items of `key` * number of columns in df
Output:
>>> arr
array([[1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1]], dtype=object)
>>> pd.DataFrame(arr)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)
I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]
I have data of the following format:
Col1 Col2 Col3
1, 1424549456, "3 4"
2, 1424549457, "2 3 4 5"
& have successfully read it into pandas.
How can I turn Col3 to a numpy matrix of the following form:
# each value needs to become a 1 in the index of the col
# i.e. in the above example 3 is the 4th value, thus
# it is [0 0 0 1] [0 indexing is included]
mtx = [0 0 0 1 1 0 # corresponds to first row
0 0 1 1 1 1]; # corresponds to second row
Thanks for any help you can provide!
Since 0.13.1 there's str.get_dummies:
In [11]: s = pd.Series(["3 4", "2 3 4 5"])
In [12]: s.str.get_dummies(sep=" ")
Out[12]:
2 3 4 5
0 0 1 1 0
1 1 1 1 1
You have to ensure the columns are integers (rather than strings) and reindex:
In [13]: df = s.str.get_dummies(sep=" ")
In [14]: df.columns = df.columns.map(int)
In [15]: df.reindex(columns=np.arange(6), fill_value=0)
Out[15]:
0 1 2 3 4 5
0 0 0 0 1 1 0
1 0 0 1 1 1 1
To get the numpy values use .values:
In [16]: df.reindex(columns=np.arange(6), fill_value=0).values
Out[16]:
array([[0, 0, 0, 1, 1, 0],
[0, 0, 1, 1, 1, 1]])
if there's not a lot of data you can do something like
res = []
def f(v):
r = np.zeros(6, np.int)
r[map(int, v.split())] = 1
res.append(r)
df.Col3.apply(f)
mat = np.array(res)
# if you really want it to be a matrix, you can do
mat = np.matrix(res)
check out this link for more info