value column in other column containing list of values - python

Column A contains strings. Column B contains list of strings.
I would like to know how many times is the value of A in the column B.
I have
A
B
k
[m]
c
[k,l,m]
j
[k,l]
e
[e,m]
e
[e,m,c,e]
I would like this output:
C
0
0
0
1
2

You can use apply on row then use .count list like below.
Try this:
>>> df = pd.DataFrame({'A': ['k', 'c', 'j', 'e', 'e'],'B': [['m'],['k','l','m'],['k','l'],['e','m'],['e','m','c','e']]})
>>> df['C'] = df.apply(lambda row: row['B'].count(row['A']), axis=1)
>>> df
A B C
0 k [m] 0
1 c [k, l, m] 0
2 j [k, l] 0
3 e [e, m] 1
4 e [e, m, c, e] 2

Related

Grouping DataFrame rows given a row-dependent condition

I have a problem i cannot solve.
I have a DataFrame in which every row is a "person" with one or two connections with another row. Every person has got an ID, and the connection is expressed in the columns COMPANION1 and COMPANION2, where will appear the ID of the person connected.
I have to bind with Pandas every "group", maybe by creating a new column with a number associated with the group.
It's easier to look ad the DF:
array = np.array([['A', 'B', 0], ['B', 'A', 0], ['C','D', 0],
['D', 'C', 0], ['E', 'G','F'], ['F','E','G'],
['G', 0, 0]])
index_values = ['0', '1', '2',
'3', '4', '5', '6']
column_values = ['ID', 'COMPANION1', 'COMPANION2']
df = pd.DataFrame(data = array,
index = index_values,
columns = column_values)
df['GROUP'] = np.zeros(len(df))
df
Dataframe screenshot
The original dataset is way bigger than this (circa 1600 rows).
In this example, A-B are bound, as C-D, and then E-F-G (yes, not every "person" has links, but it is sufficient to check if others have links to the ones without any).
How can i assign a "index" to every family? I'm sure there are no unbound people as well every family is a "closed system", and no family is bigger than 3.
I hope i've been clear enough!
Thanks a lot,
Samuel
EDIT
Ok, I think this solves the issue of G being in its own group:
# Step 1, create groups
>>> groups = [sorted([j for j in i if j != '0']) for i in list(df['ID'] + df['COMPANION1'] + df['COMPANION2'])]
# Get groups that are actually groups, not just one-offs
>>> groups = [g for g in groups if len(g) > 1]
{['A', 'B'], ['A', 'B'], ['C',' D'], ['C', 'D'], ['E', 'F', 'G'], ['E', 'F', 'G']}
# Convert groups to strings and get only the unique ones
>>> groups = set(["".join(g) for g in groups])
{"AB", "CD", "EFG"}
# Step 2, fill in groups
>>> df['Group'] = df['ID'].apply(lambda x: [i for i in groups if x in i][0])
ID COMPANION1 COMPANION2 Group
0 A B 0 AB
1 B A 0 AB
2 C D 0 CD
3 D C 0 CD
4 E G F EFG
5 F E G EFG
6 G 0 0 EFG
Then you could continue as below to assign numbers to the groups
Original
The simplest way I'm seeing to do this is to create a new column with the group members:
df['Members'] = df['ID'] + df['COMPANION1'] + df['COMPANION2']
df['Members'] = df['Members'].apply(lambda x: "".join(sorted(x)))
ID COMPANION1 COMPANION2 Members
0 A B 0 0AB
1 B A 0 0AB
2 C D 0 0CD
3 D C 0 0CD
4 E G F EFG
5 F E G EFG
6 G 0 0 00G
If you wanted to have numeric group IDs instead, you could do:
df["Group_Id"] = df["Members"].copy().replace({gid: i for i, gid in enumerate(df["Members"].unique())})
ID COMPANION1 COMPANION2 Members Group_Id
0 A B 0 0AB 0
1 B A 0 0AB 0
2 C D 0 0CD 1
3 D C 0 0CD 1
4 E G F EFG 2
5 F E G EFG 2
6 G 0 0 00G 3

Putting rows of pandas dataframe into list form

I have a pandas dataframe of the form
T1 T2
0 A B
1 C D
2 B C
3 D E
4 F A
I would like to generate another pandas dataframe that lists each of the unique items in T1 and T2 has its own row, and has a column with the name of that unique item and a column with a list of the items it shared a row with in the original dataframe. For example, in this case I would be looking for something of the form:
Name List
0 A [B, F]
1 B [A, C]
2 C [D, B]
3 D [C, E]
4 E [D]
5 F [A]
Can someone suggest a proper pandonic (like pythonic but for pandas :)) way to do this? Thanks in advance!
IIUC, swap columns and use pandas.DataFrame.columns:
df2 = df.copy()
df2.columns = df.columns[::-1]
new_df = pd.concat([df, df2])
new_df.groupby("T1")["T2"].apply(list).reset_index()
Output:
T1 T2
0 A [B, F]
1 B [C, A]
2 C [D, B]
3 D [E, C]
4 E [D]
5 F [A]

Counting each unique array of an array in each row of a column in a data frame

I am practicing pandas and python and I am not so good at for loops. I have a data frame as below: let's say this is df:
Name Value
A [[A,B],[C,D]]
B [[A,B],[D,E]]
C [[D,E],[K,L],[M,L]]
D [[K,L]]
I want to go through each row and find unique arrays and count them.
I have tried np.unique(a, return_index=True) then returns two different list and my problem I don't know how to go through each array.
Expected result would be:
Value Counts
[A,B] 2
[D,E] 2
[K,L] 2
[C,D] 1
[M,L] 1
Thank you very much.
Use DataFrame.explode in pandas +0.25:
df.explode('Value')['Value'].value_counts()
Output:
[K, L] 2
[A, B] 2
[D, E] 2
[C, D] 1
[M, L] 1
Name: Value, dtype: int64
Use Series.explode with Series.value_counts:
df = df['Value'].explode().value_counts().rename_axis('Value').reset_index(name='Counts')
print (df)
Value Counts
0 [D, E] 2
1 [A, B] 2
2 [K, L] 2
3 [C, D] 1
4 [M, L] 1
Numpy solution:
a, v = np.unique(np.concatenate(df['Value']),axis=0, return_counts=True)
df = pd.DataFrame({'Value':a.tolist(), 'Counts':v})
print (df)
Value Counts
0 [A, B] 2
1 [C, D] 1
2 [D, E] 2
3 [K, L] 2
4 [M, L] 1

How to group pandas DataFrame if some values are range of integers, while others are pure integer?

I want to group a df by a column col_2, which contains mostly integers, but some cells contain a range of integers. In my real life example, each unique integer represents a specific serial number of an assembled part. Each row in the dataframe represents a single part, which is allocated to the assembled part by col_2. Some parts can only be allocated to the assembled part with a given uncertainty (range).
The expected output would be one single group for each referenced integer (assembled part S/N). For example, the entry col_1 = c should be allocated to both groups where col_2 = 1 and col_2 = 2.
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
print(df.groupby(['col_2']).groups)
The code above gives an error:
TypeError: '<' not supported between instances of 'range' and 'int'
I think this does what you want:
s = df.col_2.apply(pd.Series).set_index(df.col_1).stack().astype(int)
s.reset_index().groupby(0).col_1.apply(list)
The first step gives you:
col_1
a 0 1
b 0 2
c 0 1
1 2
d 0 3
e 0 2
1 3
2 4
f 0 5
And the final result is:
1 [a, c]
2 [b, c, e]
3 [d, e]
4 [e]
5 [f]
Try this:
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
df['col_2'] = df.col_2.map(lambda x: range(x) if type(x) != range else x)
print(df.groupby(['col_2']).groups)```

Find columns where values are greater than column-wise mean

How to print the column headers if the row values are greater than the mean value (or median) of the column.
For Eg.,
df =
a b c d
0 12 11 13 45
1 6 13 12 23
2 5 12 6 35
the output should be 0: a, c, d. 1: a, b, c. 2: b.
In [22]: df.gt(df.mean()).T.agg(lambda x: df.columns[x].tolist())
Out[22]:
0 [a, c, d]
1 [b, c]
2 [d]
dtype: object
or:
In [23]: df.gt(df.mean()).T.agg(lambda x: ', '.join(df.columns[x]))
Out[23]:
0 a, c, d
1 b, c
2 d
dtype: object
You can try this by using pandas , I break down the steps
df=df.reset_index().melt('index')
df['MEAN']=df.groupby('variable')['value'].transform('mean')
df[df.value>df.MEAN].groupby('index').variable.apply(list)
Out[1016]:
index
0 [a, c, d]
1 [b, c]
2 [d]
Name: variable, dtype: object
Use df.apply to generate a mask, which you can then iterate over and index into df.columns:
mask = df.apply(lambda x: x > x.mean())
out = [(i, ', '.join(df.columns[x])) for i, x in mask.iterrows()]
print(out)
[(0, 'a, c, d'), (1, 'b, c'), (2, 'd')]
d = defaultdict(list)
v = df.values
[d[df.index[r]].append(df.columns[c])
for r, c in zip(*np.where(v > v.mean(0)))];
dict(d)
{0: ['a', 'c', 'd'], 1: ['b', 'c'], 2: ['d']}

Categories