python pandas isin method? - python

I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?

I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']

Related

Collapse multiple timestamp rows into a single one

I have a series like that:
s = pd.DataFrame({'ts': [1, 2, 3, 6, 7, 11, 12, 13]})
s
ts
0 1
1 2
2 3
3 6
4 7
5 11
6 12
7 13
I would like to collapse rows that have difference less than MAX_DIFF (2). That means that the desired output must be:
[{'ts_from': 1, 'ts_to': 3},
{'ts_from': 6, 'ts_to': 7},
{'ts_from': 11, 'ts_to': 13}]
I did some coding:
s['close'] = s.diff().shift(-1)
s['close'] = s[s['close'] > MAX_DIFF].astype('bool')
s['close'].iloc[-1] = True
parts = []
ts_from = None
for _, row in s.iterrows():
if row['close'] is True:
part = {'ts_from': ts_from, 'ts_to': row['ts']}
parts.append(part)
ts_from = None
continue
if not ts_from:
ts_from = row['ts']
This works but does not seem optimal because of iterrows(). I thought about ranks but couldn't figure out how to implement them so as to groupby rank further.
Is there way to optimes algorithm?
You can create groups by checking where the difference is more than your threshold and take a cumsum. Then agg however you'd like, perhaps first and last in this case.
gp = s['ts'].diff().abs().ge(2).cumsum().rename(None)
res = s.groupby(gp).agg(ts_from=('ts', 'first'),
ts_to=('ts', 'last'))
# ts_from ts_to
#0 1 3
#1 6 7
#2 11 13
And if you want the list of dicts then:
res.to_dict('records')
#[{'ts_from': 1, 'ts_to': 3},
# {'ts_from': 6, 'ts_to': 7},
# {'ts_from': 11, 'ts_to': 13}]
For completeness here is how the grouper aligns with the DataFrame:
s['gp'] = gp
print(s)
ts gp
0 1 0 # `1` becomes ts_from for group 0
1 2 0
2 3 0 # `3` becomes ts_to for group 0
3 6 1 # `6` becomes ts_from for group 1
4 7 1 # `7` becomes ts_to for group 1
5 11 2 # `11` becomes ts_from for group 2
6 12 2
7 13 2 # `13` becomes ts_to for group 2

Pandas DataFrame group-by indexes matching list - indexes respectively smaller than list[i+1] and greater than list[i]

I have a DataFrame Times_df with times in a single column and a second DataFrame End_df with specific end times for each group indexed by group name.
Times_df = pd.DataFrame({'time':np.unique(np.cumsum(np.random.randint(5, size=(100,))), axis=0)})
End_df = pd.DataFrame({'end time':np.unique(random.sample(range(Times_df.index.values[0], Times_df.index.values[-1]), 10))})
End_df.index.name = 'group'
I want to add a group index for all times in Times_df smaller or equal than each consequitive end time in End_df but greater than the previous one
I can only do it for now with a loop, which takes forever ;(
lis = []
i = 1
for row in Times_df['time'].values:
while i <= row:
lis.append((End_df['end time']==row).index)
i +1
Then I add the list lis as a new column to Times_df
Times_df['group']=lis
A nother sollution that sadly still uses a loop is this:
test_df = pd.DataFrame()
for group, index in End_df.iterrows():
test = count.loc[count.index<=index['end time]][:]
test['group']=group
test_df = pd.concat([test_df,test], axis=0, ignore_index=True)
I think what you are looking for is pd.cut to bin your values into the groups.
bins = [0, 3, 10, 20, 53, 59, 63, 65, 68, 74, np.inf]
groups = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Times_df["group"] = pd.cut(Times_df["time"], bins, labels=groups)
print(Times_df)
time group
0 2 0
1 3 0
2 7 1
3 11 2
4 15 2
5 16 2
6 18 2
7 22 3
8 25 3
9 28 3

How to conditionally fill new column using for loop in python? [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I want to add a new column and fill values based on condition.
df:
indicator, value, a, b
1, 20, 5, 3
0, 30, 6, 8
0, 70, 2, 2
1, 10, 3, 7
I want to add a new column (value_new) based on Indicator. If indicator == 1, value_new = a*b otherwise value_new = value.
df:
indicator, value, a, b, value_new
1, 20, 5, 3, 15
0, 30, 6, 8, 30
0, 70, 2, 2, 70
1, 10, 3, 7, 21
I have tried following:
value_new = []
for in in range(1, len(df)):
if indicator[i] == 1:
value_new.append(df['a'][i]*df['b'][i])
else:
value_new.append(df['value'][i])
df['value_new'] = value_new
Error: 'Length of values does not match length of index'
And I have also tried:
for in in range(1, len(df)):
if indicator[i] == 1:
df['value_new'][i] = df['a'][i]*df['b'][i]
else:
df['value_new'][i] = df['value'][i]
KeyError: 'value_new'
You can use np.where:
df['value_new'] = np.where(df['indicator'], df['a']*df['b'], df['value'])
print(df)
Prints:
indicator value a b value_new
0 1 20 5 3 15
1 0 30 6 8 30
2 0 70 2 2 70
3 1 10 3 7 21

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3

Panda-Column as index for numpy array

How can I use a panda row as index for a numpy array? Say I have
>>> grid = arange(10,20)
>>> df = pd.DataFrame([0,1,1,5], columns=['i'])
I would like to do
>>> df['j'] = grid[df['i']]
IndexError: unsupported iterator index
What is a short and clean way to actually perform this operation?
Update
To be precise, I want an additional column that has the values that correspond to the indices that the first column contains: df['j'][0] = grid[df['i'][0]] in column 0 etc
expected output:
index i j
0 0 10
1 1 11
2 1 11
3 5 15
Parallel Case: Numpy-to-Numpy
Just to show where the idea comes from, in standard python / numpy, if you have
>>> keys = [0, 1, 1, 5]
>>> grid = arange(10,20)
>>> grid[keys]
Out[30]: array([10, 11, 11, 15])
Which is exactly what I want to do. Only that my keys are not stored in a vector, they are stored in a column.
This is a numpy bug that surfaced with pandas 0.13.0 / numpy 1.8.0.
You can do:
In [5]: grid[df['i'].values]
Out[5]: array([0, 1, 1, 5])
In [6]: Series(grid)[df['i']]
Out[6]:
i
0 0
1 1
1 1
5 5
dtype: int64
This matches your output. You can assign an array to a column, as long as the length of the array/list is the same as the frame (otherwise how would you align it?)
In [14]: grid[keys]
Out[14]: array([10, 11, 11, 15])
In [15]: df['j'] = grid[df['i'].values]
In [17]: df
Out[17]:
i j
0 0 10
1 1 11
2 1 11
3 5 15

Categories