I have a DataFrame like below,
df1
col1
0 10
1 [5, 8, 11]
2 15
3 12
4 13
5 33
6 [12, 19]
Code to generate this df1:
df1 = pd.DataFrame({"col1":[10,[5,8,11],15,12,13,33,[12,19]]})
df2
col1 col2
0 12 1
1 10 2
2 5 3
3 11 10
4 7 5
5 13 4
6 8 7
Code to generate this df2:
df2 = pd.DataFrame({"col1":[12,10,5,11,7,13,8],"col2":[1,2,3,10,5,4,7]})
I want to replace elements in df1 with df2 values.
If the series values contains non list elements,
I could simply replace with map
df1['res'] = df1['col1'].map(df2.set_index('col1')["col2"].to_dict())
But now this series contains mixed of list and scalar.
How to replace elements in list and scalar values in series in effective way.
Expected Output
col1 res
0 10 2
1 [5, 8, 11] [3,7,10]
2 15 15
3 12 1
4 13 4
5 33 33
Your series is of dtype object, as it contains int and list objects. This is inefficient for Pandas and means a vectorised solution won't be possible.
You can create a mapping dictionary and use pd.Series.apply. To account for list objects, you can catch TypeError. You meet this specific error for lists since they are not hashable, and therefore cannot be used as dictionary keys.
d = df2.set_index('col1')['col2'].to_dict()
def mapvals(x):
try:
return d.get(x, x)
except TypeError:
return [d.get(i, i) for i in x]
df1['res'] = df1['col1'].apply(mapvals)
print(df1)
col1 res
0 10 2
1 [5, 8, 11] [3, 7, 10]
2 15 15
3 12 1
4 13 4
5 33 33
6 [12, 19] [1, 19]
Related
Suppose I have heterogeneous dataframe:
a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
And i want to stack the rows like so:
a b c d
1 1,5,8,13 2,6,10,14 3,7,11,15 4,8,12,16
Etc...
All the references for grouby etc seem to require some feature of grouping, I just want to put x rows into columns, regardless of their content. Each row has a timestamp, I am looking to group values by sample count, so i want 1 row with all the values of x sample rows as columns.
I should end up with a dataframe that has x*original number of columns and original number of rows/x
I'm sure there must be some simple method I'm missing here without a series of loop etc
If need join all values to strings use:
df1 = df.astype(str).agg(','.join).to_frame().T
print (df1)
a b c d
0 1,5,9,13 2,6,10,14 3,7,11,15 4,8,12,16
Or if need create lists use:
df2 = pd.DataFrame([[list(df[x]) for x in df]], columns=df.columns)
print (df2)
a b c d
0 [1, 5, 9, 13] [2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16]
If need scalars with MultiIndex (generated fro index nad columns labels) use:
df3 = df.unstack().to_frame().T
print (df3)
a b c d
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
So let us say I have a dataframe, which is created like this and has 3 products A,B,C
df = pd.DataFrame({'type' : ['A','A','B','B','C','C'], 'x' : [1,2,3,4,5,6]})
Which you can print and see looks like below
type x
0 A 1
1 A 2
2 B 3
3 B 4
4 C 5
5 C 6
Now I create a function called f, which returns tuple
def f(x):
return x*2, x*3, x*4
And I apply this on the dataframe with groupby on type
df.groupby('type').apply(lambda x : f(x.x))
And now the result is a series of 3 array as below. But how do I merge it back to the dataframe correctly
type
A ([2, 4], [3, 6], [4, 8])
B ([6, 8], [9, 12], [12, 16])
C ([10, 12], [15, 18], [20, 24])
dtype: object
What I want to see is
type x a b c
A 1 2 3 4
A 2 4 6 8
B 3 6 9 12
B 4 8 12 16
C 5 10 15 20
C 6 12 18 24
EDITED:
Please note that I gave f function as a very simple example and it looks like why can't I directly create a new column with multiplication. But imagine a more complex function f that uses 3 columns and then generates tuples that it not straight forward column multiplication
That is why I asked this question
The real function in question is talib.BBANDS
Assuming that in your real case: the groupby is needed, your function takes several columns as input and return several columns as output, your function could return a dataframe:
def f(x):
return pd.DataFrame({'a':x*2, 'b':x*3, 'c':x*4}, index=x.index)
# then assign directly or use join
df[['a','b','c']] = df.groupby('type').apply(lambda x : f(x.x))
print (df)
type x a b c
0 A 1 2 3 4
1 A 2 4 6 8
2 B 3 6 9 12
3 B 4 8 12 16
4 C 5 10 15 20
5 C 6 12 18 24
Edit with the name of the function used talib.BBANDS, then I guess you can create a wrapper:
def f(x):
upper, middle, lower = talib.BBANDS(x, ...) #enter the parameter you need
return pd.DataFrame({'upper':upper, 'middle':middle, 'lower':lower },
index=x.index)
df[['upper','middle','lower']] = df.groupby('type').apply(lambda x : f(x.x))
import pandas as pd
df = pd.DataFrame({'type' : ['A','A','B','B','C','C'], 'x' : [1,2,3,4,5,6]})
newcol=df['x']**2
df['x**2']=newcol
df
output:
I have the following dataframe:
import pandas as pd
import numpy as np
data = {
"index": [1, 2, 3, 4, 5],
"A": [11, 17, 5, 9, 10],
"B": [8, 6, 16, 17, 9],
"C": [10, 17, 12, 13, 15],
"target": [12, 13, 8, 6, 12]
}
df = pd.DataFrame.from_dict(data)
print(df)
I would like to find nearest values for column target in column A, B and C, and put those values into column result. As far as I know, I need to use abs() and argmin() function.
Here is the output I expected:
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
Here is the solution and links what i have found from stackoverflow which may help:
(df.assign(closest=df.apply(lambda x: x.abs().argmin(), axis='columns'))
.apply(lambda x: x[x['target']], axis='columns'))
Identifying closest value in a column for each filter using Pandas
https://codereview.stackexchange.com/questions/204549/lookup-closest-value-in-pandas-dataframe
Subtract "target" from the other columns, use idxmin to get the column of the minimum difference, followed by a lookup:
idx = df.drop(['index', 'target'], 1).sub(df.target, axis=0).abs().idxmin(1)
df['result'] = df.lookup(df.index, idx)
df
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
General solution handling string columns and NaNs (along with your requirement of replacing NaN values in target with value in "v1"):
df2 = df.select_dtypes(include=[np.number])
idx = df2.drop(['index', 'target'], 1).sub(df2.target, axis=0).abs().idxmin(1)
df['result'] = df2.lookup(df2.index, idx.fillna('v1'))
You can also index into the underlying NumPy array by getting integer indices using df.columns.get_indexer.
# idx = df[['A', 'B', 'C']].sub(df.target, axis=0).abs().idxmin(1)
idx = df.drop(['index', 'target'], 1).sub(df.target, axis=0).abs().idxmin(1)
# df['result'] = df.values[np.arange(len(df)), df.columns.get_indexer(idx)]
df['result'] = df.values[df.index, df.columns.get_indexer(idx)]
df
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
You can use NumPy positional integer indexing with argmin:
col_lst = list('ABC')
col_indices = df[col_lst].sub(df['target'], axis=0).abs().values.argmin(1)
df['result'] = df[col_lst].values[np.arange(len(df.index)), col_indices]
Or you can lookup column labels with idxmin:
col_labels = df[list('ABC')].sub(df['target'], axis=0).abs().idxmin(1)
df['result'] = df.lookup(df.index, col_labels)
print(df)
index A B C target result
0 1 11 8 10 12 11
1 2 17 6 17 13 17
2 3 5 16 12 8 5
3 4 9 17 13 6 9
4 5 10 9 15 12 10
The principle is the same, though for larger dataframes you may find NumPy more efficient:
# Python 3.7, NumPy 1.14.3, Pandas 0.23.0
def np_lookup(df):
col_indices = df[list('ABC')].sub(df['target'], axis=0).abs().values.argmin(1)
df['result'] = df[list('ABC')].values[np.arange(len(df.index)), col_indices]
return df
def pd_lookup(df):
col_labels = df[list('ABC')].sub(df['target'], axis=0).abs().idxmin(1)
df['result'] = df.lookup(df.index, col_labels)
return df
df = pd.concat([df]*10**4, ignore_index=True)
assert df.pipe(pd_lookup).equals(df.pipe(np_lookup))
%timeit df.pipe(np_lookup) # 7.09 ms
%timeit df.pipe(pd_lookup) # 67.8 ms
I have the following Dataframe as input:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
df = pd.DataFrame(l)
print(df)
0
0 2
1 2
2 2
3 5
4 5
5 5
6 3
7 3
8 2
9 2
10 4
11 4
12 6
13 5
14 5
15 3
16 5
As an output I would like to have a final count of the total sequences that meet a certain condition. For example, in this case, I want the number of sequences that the values are greater than 3.
So, the output is 3.
1st Sequence = [555]
2nd Sequence = [44655]
3rd Sequence = [5]
Is there a way to calculate this without a for-loop in pandas ?
I have already implemented a solution using for-loop, and I wonder if there is better approach using pandas in O(N) time.
Thanks very much!
Related to this question: How to count the number of time intervals that meet a boolean condition within a pandas dataframe?
You can use:
m = df[0] > 3
df[1] = (~m).cumsum()
df = df[m]
print (df)
0 1
3 5 3
4 5 3
5 5 3
10 4 7
11 4 7
12 6 7
13 5 7
14 5 7
16 5 8
#create tuples
df = df.groupby(1)[0].apply(tuple).value_counts()
print (df)
(5, 5, 5) 1
(4, 4, 6, 5, 5) 1
(5,) 1
Name: 0, dtype: int64
#alternativly create strings
df = df.astype(str).groupby(1)[0].apply(''.join).value_counts()
print (df)
5 1
44655 1
555 1
Name: 0, dtype: int64
If need output as list:
print (df.astype(str).groupby(1)[0].apply(''.join).tolist())
['555', '44655', '5']
Detail:
print (df.astype(str).groupby(1)[0].apply(''.join))
3 555
7 44655
8 5
Name: 0, dtype: object
If you don't need pandas this will suit your needs:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
def consecutive(array, value):
result = []
sub = []
for item in array:
if item > value:
sub.append(item)
else:
if sub:
result.append(sub)
sub = []
if sub:
result.append(sub)
return result
print(consecutive(l,3))
#[[5, 5, 5], [4, 4, 6, 5, 5], [5]]
I have a DataFrame with these columns:
ID int64
Key int64
Reference object
sKey float64
sName float64
fKey int64
cName object
ints int32
I want to create a new DataFrame containing columns commonName and ints where ints is greater than 10, I am doing:
df_greater_10 = df[['commonName', df[df.ints >= 1997]]]
I see the problem lies with the expression df[df.ints >= 1997] as I'm returning a DataFrame - how can I just get the column of ints with values greater than 10?
You can use one of many available indexers. I would recommend .ix, because it seems to be faster:
df_greater_10 = df.ix[df.ints >= 1997, ['commonName', 'ints']]
or if you need only ints column
df_greater_10 = df.ix[df.ints >= 1997, 'ints']
Demo:
In [123]: df = pd.DataFrame(np.random.randint(5, 15, (10, 3)), columns=list('abc'))
In [124]: df
Out[124]:
a b c
0 13 11 14
1 14 10 13
2 7 11 6
3 7 13 12
4 9 9 6
5 7 7 7
6 5 7 8
7 5 11 5
8 9 7 9
9 11 13 7
In [125]: df_greater_10 = df.ix[df.c > 10, ['a','c']]
In [126]: df_greater_10
Out[126]:
a c
0 13 14
1 14 13
3 7 12
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
So use df.loc[...] or df.iloc[...] instead of deprecated df.ix[...]
Not sure why you haven't tried df[df.ints >= 1997]['ints'] first (Maybe I am missing someting, your dataframe is very big?). Here's a demo of how it would work below
>>> pd.DataFrame({'ints': [1, 2, 3, 10, 11], 'other': ['a', 'b', 'c', 'y', 'z']})
', 'y', 'z']})
ints other
0 1 a
1 2 b
2 3 c
3 10 y
4 11 z
>>> df[df.ints >= 10]
ints other
3 10 y
4 11 z
>>> df[df.ints >= 10]['ints']
3 10
4 11
You can get same result with df['ints'][df['ints'] >= 10] too, which makes it more obvious you're only interested in the ints column.