how to count unique row & its numbers of appearance in pandas - python

How to count unique row and its numbers of appearance in pandas?
Lead ID bank_account_id NO.of account
0 308148.0 12460.0 1
1 310443.0 12654.0 1
2 310443.0 12655.0 1
3 312745.0 12835.0 1
4 312745.0 12836.0 1
5 312745.0 12837.0 1
6 312745.0 12838.0 1
7 312745.0 12839.0 1
8 313082.0 13233.0 1
9 314036.0 13226.0 1
10 314559.0 13271.0 1
11 314559.0 13273.0 1
12 316728.0 13228.0 1
13 316728.0 13230.0 1
14 316728.0 13232.0 1
15 316728.0 13234.0 1
16 316728.0 13235.0 1
17 316728.0 13272.0 1
18 318465.0 13419.0 1
19 318465.0 13420.0 1
20 318465.0 13421.0 1
21 318465.0 13422.0 1
22 318465.0 13423.0 1
23 318465.0 13424.0 1
24 318465.0 13425.0 1
25 321146.0 13970.0 1
26 321146.0 13971.0 1
27 321218.0 14779.0 1
28 321356.0 15142.0 1
29 321356.0 15144.0 1
30 321356.0 15146.0 1
In this dataset I want to get bank_account_id corresponding to every unique Lead ID & total number of bank_account_id every Lead ID is having.

Yo can use df.groupby():
import pandas as pd
df = pd.DataFrame({'Lead ID': ['308148.0', '310443.0', '310443.0', '312745.0', '312745.0', '312745.0', '312745.0', '312745.0', '313082.0', '314036.0', '314559.0', '314559.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '321146.0', '321146.0', '321218.0', '321356.0', '321356.0', '321356.0'],
'bank_account_id': ['12460.0', '12654.0', '12655.0', '12835.0', '12836.0', '12837.0', '12838.0', '12839.0', '13233.0', '13226.0', '13271.0', '13273.0', '13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0', '13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0', '13970.0', '13971.0', '14779.0', '15142.0', '15144.0', '15146.0'],
'NO.of account': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']})
df2 = df[df.duplicated('Lead ID', keep=False)].groupby('Lead ID')['bank_account_id'].apply(list).reset_index()
print(df2)
Output:
Lead ID bank_account_id
0 310443.0 [12654.0, 12655.0]
1 312745.0 [12835.0, 12836.0, 12837.0, 12838.0, 12839.0]
2 314559.0 [13271.0, 13273.0]
3 316728.0 [13228.0, 13230.0, 13232.0, 13234.0, 13235.0, ...
4 318465.0 [13419.0, 13420.0, 13421.0, 13422.0, 13423.0, ...
5 321146.0 [13970.0, 13971.0]
6 321356.0 [15142.0, 15144.0, 15146.0]
You can also use a for loop to iterate through the values of your data frame with zip():
import pandas as pd
df = pd.DataFrame({'Lead ID': ['308148.0', '310443.0', '310443.0', '312745.0', '312745.0', '312745.0', '312745.0', '312745.0', '313082.0', '314036.0', '314559.0', '314559.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '321146.0', '321146.0', '321218.0', '321356.0', '321356.0', '321356.0'],
'bank_account_id': ['12460.0', '12654.0', '12655.0', '12835.0', '12836.0', '12837.0', '12838.0', '12839.0', '13233.0', '13226.0', '13271.0', '13273.0', '13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0', '13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0', '13970.0', '13971.0', '14779.0', '15142.0', '15144.0', '15146.0'],
'NO.of account': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']})
dct = dict()
for l, b in zip(df['Lead ID'], df['bank_account_id']):
if l in dct:
dct[l].append(b)
else:
dct[l] = [b]
print(dct)
Output:
{'308148.0': ['12460.0'],
'310443.0': ['12654.0', '12655.0'],
'312745.0': ['12835.0', '12836.0', '12837.0', '12838.0', '12839.0'],
'313082.0': ['13233.0'],
'314036.0': ['13226.0'],
'314559.0': ['13271.0', '13273.0'],
'316728.0': ['13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0'],
'318465.0': ['13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0'],
'321146.0': ['13970.0', '13971.0'],
'321218.0': ['14779.0'],
'321356.0': ['15142.0', '15144.0', '15146.0']}

How about using MultiIndex for the count?
import pandas as pd
df = pd.DataFrame({'Lead ID': ['308148.0', '310443.0', '310443.0', '312745.0', '312745.0', '312745.0', '312745.0', '312745.0', '313082.0', '314036.0', '314559.0', '314559.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '321146.0', '321146.0', '321218.0', '321356.0', '321356.0', '321356.0'],
'bank_account_id': ['12460.0', '12654.0', '12655.0', '12835.0', '12836.0', '12837.0', '12838.0', '12839.0', '13233.0', '13226.0', '13271.0', '13273.0', '13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0', '13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0', '13970.0', '13971.0', '14779.0', '15142.0', '15144.0', '15146.0'],
'NO.of account': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']})
df2 = df.set_index(["Lead ID", "bank_account_id"])
print(df2.groupby(level="Lead ID").size())
Output:
Lead ID
308148.0 1
310443.0 2
312745.0 5
313082.0 1
314036.0 1
314559.0 2
316728.0 6
318465.0 7
321146.0 2
321218.0 1
321356.0 3
dtype: int64

Hy, try user a single df.value_counts() you will receive a good aggregation table.
Lead ID bank_account_id NO.of account
321356.0 15146.0 1 1
316728.0 13232.0 1 1
310443.0 12654.0 1 1
12655.0 1 1

Related

Converting to dataframe, beginner question

I have a piece of data that looks like this
my_data[:5]
returns:
[{'key': ['Aaliyah', '2', '2016'], 'values': ['10']},
{'key': ['Aaliyah', '2', '2017'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2018'], 'values': ['21']},
{'key': ['Aaliyah', '2', '2019'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2020'], 'values': ['15']}]
The key represents Name, Gender, and Year. The value is number.
I do not manage to generate a data frame with columns name, gender, year, and number.
Can you help me?
Here is one way, using a generator:
from itertools import chain
pd.DataFrame.from_records((dict(zip(['name', 'gender', 'year', 'number'],
chain(*e.values())))
for e in my_data))
Without itertools:
pd.DataFrame(((E:=list(e.values()))[0]+E[1] for e in my_data),
columns=['name', 'gender', 'year', 'number'])
output:
name gender year number
0 Aaliyah 2 2016 10
1 Aaliyah 2 2017 26
2 Aaliyah 2 2018 21
3 Aaliyah 2 2019 26
4 Aaliyah 2 2020 15

Pandas: Flatten Nested Dictionary vertically

I have a list of dictionary as below:
[{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},{'tagId': '20',
'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},{'tagId': '40',
'tagName': 'FG'}]}]
I want to turn this into a dataframe like below:
Name tagList_tagID tagList_tagName
Jack 10 AB
Jack 20 BC
mike 30 DE
mike 40 FG
How can I convert this list of dictionaries to pandas dataframe in an efficient way.
Try with json.normalize:
lst = [{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},
{'tagId': '20', 'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},
{'tagId': '40', 'tagName': 'FG'}]}]
df = pd.json_normalize(lst, record_path="tagList", meta=["name"])
#formatting to match expected output
df = df.set_index("name").add_prefix("tagList_")
>>> df
tagList_tagId tagList_tagName
name
jack 10 AB
jack 20 BC
mike 30 DE
mike 40 FG

How to store for loop iterations to a new data set

screenshot
My code:
for video in most_disliked:
df1 = video['id'],video['statistics']
print(df1)
Output:
('bvyTxpY9qJM', {'viewCount': '145', 'likeCount': '3', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'})
('gShHA7BZNCw', {'viewCount': '36', 'likeCount': '3', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'})
('q7gxl8RJEv4', {'viewCount': '11', 'likeCount': '2', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '1'})
Expected Output:
Videoid viewcount likecount dislikecount favouritecount commentcount
bvyTxpY9qJM 145 3 0 0 0
gShHA7BZNCw 36 3 0 0 0
q7gxl8RJEv4 11 2 0 0 1
df1 = video['id'],video['statistics'] creates a tuple of two elements video['id'] and video['statistics'].
To create a dataframe from the most_disliked list, you can use this example:
df1 = pd.DataFrame([{'Videoid': video['id'], **video['statistics']} for video in most_disliked])
print(df1)
Prints:
Videoid viewCount likeCount dislikeCount favoriteCount commentCount
0 bvyTxpY9qJM 145 3 0 0 0
1 gShHA7BZNCw 36 3 0 0 0
2 q7gxl8RJEv4 11 2 0 0 1
data = [('bvyTxpY9qJM', {'viewCount': '145', 'likeCount': '3',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'}),
('gShHA7BZNCw', {'viewCount': '36', 'likeCount': '3',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'}),
('q7gxl8RJEv4', {'viewCount': '11', 'likeCount': '2',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '1'}),
]
most_liked = pd.DataFrame(data, columns=['id', 'stat'])
df2 = pd.merge(most_liked['id'], most_liked['stat'].apply(pd.Series),
left_index=True, right_index=True)
Output
id viewCount likeCount dislikeCount favoriteCount commentCount
0 bvyTxpY9qJM 145 3 0 0 0
1 gShHA7BZNCw 36 3 0 0 0
2 q7gxl8RJEv4 11 2 0 0 1

Python, Take Multiple Lists and Putting into pd.Dataframe

I have seen a variety of answers to this question (like this one), and have had no success in getting my lists into one dataframe. I have one header list (meant to be column headers), and then a variable that has multiple records in it:
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m']
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m']
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
When I try something like:
df = pd.DataFrame(list(zip(['1', 'Jack, '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
# ['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'])))
It lists all the attributes as their own rows, like so:
0 1
0 1 2
1 Jack Jill
2 57.4 km 34.0 km
3 4 2
4 21.7 km 17.9 km
5 5:57 /km 5:27 /km
6 994 m 152 m
How do I get this into a frame that has list1 as the headers, and the rest of the data neatly squared away?
Given
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'],
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
do
pd.DataFrame(list2, columns=list1)
which returns
Rank Athlete Distance Runs Longest Avg. Pace Elev. Gain
0 1 Jack 57.4 km 4 21.7 km 5:57 /km 994 m
1 2 Jill 34.0 km 2 17.9 km 5:27 /km 152 m
2 3 Kelsey 32.6 km 2 21.3 km 5:46 /km 141 m
Change your second list into a list of lists and then
df = pd.DataFrame(columns = list1, data = list2)

Converting Data frame into a dict with columns as key inside key

I have a pandas data frame.
mac_address no. of co_visit no. of random_visit
0 00:02:1a:11:b0:b9 1 2
1 00:02:71:d6:04:84 1 1
2 00:05:33:34:2f:f2 1 3
3 00:08:22:04:c4:fb 1 4
4 00:08:22:06:7b:41 1 1
5 00:08:22:07:48:15 1 1
6 00:08:22:08:a8:54 1 3
7 00:08:22:0e:0a:fc 1 1
I want to convert it into a dictionary with mac_address as key and 'no. of co_visit' and 'no. of random_visit' as subkeys inside key and value across that column as value inside subkey. So, my output for first 2 row will be like.
00:02:1a:11:b0:b9:{no. of co_visit:1, no. of random_visit: 2}
00:02:71:d6:04:84:{no. of co_visit:1, no. of random_visit: 1}
I am using python2.7. Thank you.
I was able to set mac_address as key but the values were being added as list inside key, not key inside key.
You can use pandas.DataFrame.T and to_dict().
df.set_index('mac_address').T.to_dict()
Output:
{'00:02:1a:11:b0:b9': {'no. of co_visit': '1', 'no. of random_visit': '2'},
'00:02:71:d6:04:84': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:05:33:34:2f:f2': {'no. of co_visit': '1', 'no. of random_visit': '3'},
'00:08:22:04:c4:fb': {'no. of co_visit': '1', 'no. of random_visit': '4'},
'00:08:22:06:7b:41': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:08:22:07:48:15': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:08:22:08:a8:54': {'no. of co_visit': '1', 'no. of random_visit': '3'},
'00:08:22:0e:0a:fc': {'no. of co_visit': '1', 'no. of random_visit': '1'}}

Categories