Inserting counter objects into Dataframe python - python

I am confused with insert counter (collections) into a dataframe:
My dataframe looks like,
doc_cluster_key_freq=pd.DataFrame(index=[], columns=['doc_parent_id','keyword_id','key_count_in_doc_cluster'])
sim_docs_ids=[3342,3783]
the counters generated in for the sim_docs_ids are given below
id=3342
Counter({133: 9, 79749: 7})
id=3783
Counter({133: 10, 12072: 5, 79749: 1})
The counter is generated in loop for each sim_docs_id
My code looks like:
for doc_ids in sim_docs_ids:
#generate counter for doc_ids
#insert the counter into dataframe (doc_cluster_key_freq) here
The output I am looking for is as below:
doc_cluster_key_freq=
doc_parent_id Keyword_id key_count_in_doc_cluster
0 3342 133 9
1 3342 79749 7
2 3783 133 10
3 3783 12072 5
4 3783 79749 1
I tried by using counter.keys() and counter.values but I get something like below, I have no idea how to separate them into different rows:
doc_parent_id Keyword_id key_count_in_doc_cluster
0 33342 [133, 79749] [9, 7]
1 3783 [12072, 133, 79749] [5, 10, 1]

If you have the same number of keyword for each doc_id, you may pre-allocate proper row number for each record, and use the code below to ensure one row for each keyword in every doc_id:
keywords = ['key1', 'key2', 'key3', ...]
number_of_keywords = len(keywords)
for i, doc_id in enumerate(sim_doc_ids):
# Generate keyword Counter (counter) for doc_id
for j, key in enumerate(keywords):
doc_cluster_key_freq.loc[i * number_of_keywords + j] = [doc_id, key, counter[key]]
An example:
keywords = ['a', 'b', 'c']
N = len(keywords)
ids = range(5)
for i, idd in enumerate(ids):
counter = Counter({'a': random.randint(0, 10),
'b': random.randint(0, 10),
'c': random.randint(0, 10),})
for j, key in enumerate(keywords):
a.loc[i*N+j] = [idd, key, counter[key]]
Output:
id keyword count
0 0 a 10
1 0 b 9
2 0 c 9
3 1 a 1
4 1 b 10
5 1 c 10
6 2 a 9
7 2 b 0
8 2 c 5
9 3 a 6
10 3 b 0
11 3 c 8
12 4 a 0
13 4 b 3
14 4 c 8

Related

How to get a stratified random sample of indices?

I have an array (pd.Series) of two values (A's and B's, for example).
y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
0 A
1 B
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 B
10 B
I want to get a random sample of indices from series, but half of the indices must correspond with an A, and the other half must correspond with a B.
For example
get_random_stratified_sample_of_indices(y=y, n=4)
[0, 1, 2, 4]
The indices 0 and 2 correspond with the indices of A's, and the indices of 1 and 4 correspond with the indices of B's.
Another example
get_random_stratified_sample_of_indices(y=y, n=6)
[1, 4, 5, 0, 2, 3]
The order of the returned list of indices doesn't matter but I need it to be even split between indices of A's and B's from the y array.
My plan was to first look at the indices of A's, then take a random sample (size=n/2) of the indices. And then repeat for B.
You can use groupby.sample:
N = 4
idx = (y
.index.to_series()
.groupby(y)
.sample(n=N//len(y.unique()))
.to_list()
)
Output: [3, 8, 10, 1]
Check:
3 A
8 A
10 B
1 B
dtype: object
Here's one way to do it:
def get_random_stratified_sample_of_indices(s, n):
mask = s == 'A'
s1 = s[mask]
s2 = s[~mask]
m1 = n // 2
m2 = m1 if n % 2 == 0 else m1 + 1
i1 = s1.sample(m1).index.to_list()
i2 = s2.sample(m2).index.to_list()
return i1 + i2
Which could be used in this way:
y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
i = get_random_stratified_sample_of_indices(y, 5)
print(i)
print()
print(y[i])
Result:
[6, 2, 7, 10, 5]
6 A
2 A
7 B
10 B
5 B
I think you could use the train_test_split from Scikit-Learn, defining its stratify parameter.
from sklearn.model_selection import train_test_split
import pandas as pd
y = (
pd.Series(["A", "B", "A", "A", "B", "B", "A", "B", "A", "B", "B"])
.T.to_frame("col")
.assign(i=lambda xdf: xdf.index)
)
print(y)
# Prints:
#
# col i
# 0 A 0
# 1 B 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 A 6
# 7 B 7
# 8 A 8
# 9 B 9
# 10 B 10
print('\n')
# ===== Actual solution =====================================
a, b = train_test_split(y, test_size=0.5, stratify=y["col"])
# ===========================================================
print(a)
# Prints:
#
# col i
# 10 B 10
# 6 A 6
# 7 B 7
# 8 A 8
# 4 B 4
print('\n')
print(b)
# Prints:
#
# col i
# 3 A 3
# 9 B 9
# 2 A 2
# 1 B 1
# 5 B 5
# 0 A 0

Conflate sets of pandas columns into a single column

I have a dataframe that looks like this:
n objects id x y Vx Vy id.1 x.1 ... Vx.40 Vy.40 ...
0 41 1 2 3 4 5 17 3 ... 5 6 ...
1 21 1 2 3 4 5 17 3 ... 0 0 ...
2 36 1 2 3 4 5 17 3 ... 0 0 ...
My goal is to conflate the contents of every set of id, x, y, Vx, and Vy columns into a single column.
I.e. the end result should look like this:
n objects object_0 object_1 object_40 ...
0 41 [1,2,3,4,5] [17,3,...] ... [...5,6] ...
1 21 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
2 36 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
I am kind of at a loss as to how to achieve that. My only idea was hardcoding it like
df['object_0'] = df[['id', 'x', 'y', 'Vx', 'Vy']].values.tolist()
df.drop(['id', 'x', 'y', 'Vx', 'Vy'], inplace=True)
for i in range(1,41):
df[f'object_{i}'] = df[[f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}']].values.tolist()
df.drop([f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}'], inplace=True)
but that is not a good option, as the number (and names) of repeating columns varies between dataframes. What is consistent is that the number of objects per row is listed, and every object has the same number of elements (i.e. there are no cases of columns going like id.26, y.26, Vx.26, id.27 Vy.27, id.28...)
I suppose I could find the number of objects via something like
last_obj = max([ int(col.split('.')[-1]) for col in df.columns ])
and then dig out the number and names of cols per object by
[ col.split('.')[0] for col in df.columns if col.split('.')[-1] == last_obj ]
but at that point this all starts seeming a bit too cluttered and hacky.
Is there a cleaner way to do that, one that works irrespective of the number of objects, of columns per object, and (ideally) of column names? Any help would be appreciated!
EDIT:
This does work, but is there a more elegant way of doing it?
last_obj = max([ int(col.split('.')[-1]) for col in df.columns if '.' in col])
obj_col_names = [ col.split('.')[0] for col in df.columns if col.split('.')[-1] == str(last_obj) ]
df['object_0'] = df[obj_col_names].values.tolist()
df.drop(obj_col_names, axis=1, inplace=True)
for i in range(1, last_obj+1):
current_col_set = [ "".join([col, f'.{i}']) for col in obj_col_names ]
df[f'object_{i}'] = df[current_col_set].values.tolist()
df.drop(current_col_set, axis=1, inplace=True)
This solution renames the columns into same-named groups. Then does a groupby on those columns and converts them into lists.
Starting with
n objects id x y Vx Vy id.1 x.1 y.1 Vx.1 Vy.1
0 0 41 1 2 3 4 5 17 3 3 4 5
1 1 21 1 2 3 4 5 17 3 3 4 5
2 2 36 1 2 3 4 5 17 3 3 4 5
Then
nb_cols = df.shape[1]-2
nb_groups = int(df.columns[-1].split('.')[1])+1
cols_per_group = nb_cols // nb_groups
group_cols = np.arange(nb_cols)//cols_per_group
explode_cols = list(np.arange(nb_groups))
pd.concat([df.loc[:,:'objects'].reset_index(drop=True), \
df.loc[:,'id':].set_axis(group_cols, axis=1).groupby(level=0, axis=1) \
.apply(lambda x: x.values).to_frame().T.explode(explode_cols).reset_index(drop=True) \
.rename(columns = lambda x: 'object_' + str(x)) \
], axis=1)
Result
n objects object_0 object_1
0 0 41 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
1 1 21 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
2 2 36 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]

Collapse multiple timestamp rows into a single one

I have a series like that:
s = pd.DataFrame({'ts': [1, 2, 3, 6, 7, 11, 12, 13]})
s
ts
0 1
1 2
2 3
3 6
4 7
5 11
6 12
7 13
I would like to collapse rows that have difference less than MAX_DIFF (2). That means that the desired output must be:
[{'ts_from': 1, 'ts_to': 3},
{'ts_from': 6, 'ts_to': 7},
{'ts_from': 11, 'ts_to': 13}]
I did some coding:
s['close'] = s.diff().shift(-1)
s['close'] = s[s['close'] > MAX_DIFF].astype('bool')
s['close'].iloc[-1] = True
parts = []
ts_from = None
for _, row in s.iterrows():
if row['close'] is True:
part = {'ts_from': ts_from, 'ts_to': row['ts']}
parts.append(part)
ts_from = None
continue
if not ts_from:
ts_from = row['ts']
This works but does not seem optimal because of iterrows(). I thought about ranks but couldn't figure out how to implement them so as to groupby rank further.
Is there way to optimes algorithm?
You can create groups by checking where the difference is more than your threshold and take a cumsum. Then agg however you'd like, perhaps first and last in this case.
gp = s['ts'].diff().abs().ge(2).cumsum().rename(None)
res = s.groupby(gp).agg(ts_from=('ts', 'first'),
ts_to=('ts', 'last'))
# ts_from ts_to
#0 1 3
#1 6 7
#2 11 13
And if you want the list of dicts then:
res.to_dict('records')
#[{'ts_from': 1, 'ts_to': 3},
# {'ts_from': 6, 'ts_to': 7},
# {'ts_from': 11, 'ts_to': 13}]
For completeness here is how the grouper aligns with the DataFrame:
s['gp'] = gp
print(s)
ts gp
0 1 0 # `1` becomes ts_from for group 0
1 2 0
2 3 0 # `3` becomes ts_to for group 0
3 6 1 # `6` becomes ts_from for group 1
4 7 1 # `7` becomes ts_to for group 1
5 11 2 # `11` becomes ts_from for group 2
6 12 2
7 13 2 # `13` becomes ts_to for group 2

grouping a list values based on max value

I'm working on k-mean algorthim to cluster list of number, If i have an array (X)
X=array([[0.85142858],[0.85566274],[0.85364912],[0.81536489],[0.84929932],[0.85042336],[0.84899714],[0.82019115], [0.86112067],[0.8312496 ]])
then I run the following code
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)
for i in range(len(X)):
print("%4d " % cluster.labels_[i], end=""); print(X[i])
i got the results
1 1 [0.85142858]
2 3 [0.85566274]
3 3 [0.85364912]
4 0 [0.81536489]
5 1 [0.84929932]
6 1 [0.85042336]
7 1 [0.84899714]
8 0 [0.82019115]
9 4 [0.86112067]
10 2 [0.8312496]
how to get the max number in each cluster with value of (i) ? like this
0: 0.82019115 8
1: 0.85142858 1
2: 0.8312496 10
3: 0.85566274 2
4: 0.86112067 9
First group them together as pair using zip then sort it by values(second element of pair) in increasing order and create a dict out of it.
Try:
res = list(zip(cluster.labels_, X))
max_num = dict(sorted(res, key=lambda x: x[1], reverse=False))
max_num:
{0: array([0.82019115]),
2: array([0.8312496]),
1: array([0.85142858]),
3: array([0.85566274]),
4: array([0.86112067])}
Edit:
Do you want this?
elem = list(zip(res, range(1,len(X)+1)))
e = sorted(elem, key=lambda x: x[0][1], reverse=False)
final_dict = {k[0]:(k[1], v) for (k,v) in e}
for key in sorted(final_dict):
print(f"{key}: {final_dict[key][0][0]} {final_dict[key][1]}")
0: 0.82019115 8
1: 0.85142858 1
2: 0.8312496 10
3: 0.85566274 2
4: 0.86112067 9
OR
import pandas as pd
df = pd.DataFrame(zip(cluster.labels_,X))
df[1] = df[1].str[0]
df = df.sort_values(1).drop_duplicates([0],keep='last')
df.index = df.index+1
df = df.sort_values(0)
df:
0 1
8 0 0.820191
1 1 0.851429
10 2 0.831250
2 3 0.855663
9 4 0.861121

For loop iteration with range restarting at index 1

I'm trying to iterate through a loop with a step of 2 indexes at the time and once it reaches the end to restart the same but from index 1 this time rather than zero.
I have already read different articles on stack like this with a while loop workaround. However, I'm looking for an option which will simply use the element in my for loop with range and without using itertool or other libraries or a nested loop:
Here is my code:
j = [0,0,1,1,2,2,3,3,9,11]
count = 0
for i in range(len(j)):
if i >= len(j)/2:
print(j[len(j)-i])
count += 1
else:
count +=1
print(j[i*2],i)
Here is the output:
0 0
1 1
2 2
3 3
9 4
2
2
1
1
0
The loop does not start back from where is supposed to.
Here is the desired output:
0 0
1 1
2 2
3 3
9 4
0 5
1 6
2 7
3 8
11 9
How can I fix it?
You can do that by combining two range() calls like:
Code:
j = [0, 0, 1, 1, 2, 2, 3, 3, 9, 11]
for i in (j[k] for k in
(list(range(0, len(j), 2)) + list(range(1, len(j), 2)))):
print(i)
and using an itertools solution:
import itertools as it
for i in it.chain.from_iterable((it.islice(j, 0, len(j), 2),
it.islice(j, 1, len(j), 2))):
print(i)
Results:
0
1
2
3
9
0
1
2
3
11
Another itertools solution:
import itertools as it
lst = [0, 0, 1, 1, 2, 2, 3, 3, 9, 11]
a, b = it.tee(lst)
next(b)
for i, x in enumerate(it.islice(it.chain(a, b), None, None, 2)):
print(x, i)
Output
0 0
1 1
2 2
3 3
9 4
0 5
1 6
2 7
3 8
11 9

Categories