Python, Take Multiple Lists and Putting into pd.Dataframe - python

I have seen a variety of answers to this question (like this one), and have had no success in getting my lists into one dataframe. I have one header list (meant to be column headers), and then a variable that has multiple records in it:
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m']
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m']
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
When I try something like:
df = pd.DataFrame(list(zip(['1', 'Jack, '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
# ['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'])))
It lists all the attributes as their own rows, like so:
0 1
0 1 2
1 Jack Jill
2 57.4 km 34.0 km
3 4 2
4 21.7 km 17.9 km
5 5:57 /km 5:27 /km
6 994 m 152 m
How do I get this into a frame that has list1 as the headers, and the rest of the data neatly squared away?

Given
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'],
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
do
pd.DataFrame(list2, columns=list1)
which returns
Rank Athlete Distance Runs Longest Avg. Pace Elev. Gain
0 1 Jack 57.4 km 4 21.7 km 5:57 /km 994 m
1 2 Jill 34.0 km 2 17.9 km 5:27 /km 152 m
2 3 Kelsey 32.6 km 2 21.3 km 5:46 /km 141 m

Change your second list into a list of lists and then
df = pd.DataFrame(columns = list1, data = list2)

Related

Converting to dataframe, beginner question

I have a piece of data that looks like this
my_data[:5]
returns:
[{'key': ['Aaliyah', '2', '2016'], 'values': ['10']},
{'key': ['Aaliyah', '2', '2017'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2018'], 'values': ['21']},
{'key': ['Aaliyah', '2', '2019'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2020'], 'values': ['15']}]
The key represents Name, Gender, and Year. The value is number.
I do not manage to generate a data frame with columns name, gender, year, and number.
Can you help me?
Here is one way, using a generator:
from itertools import chain
pd.DataFrame.from_records((dict(zip(['name', 'gender', 'year', 'number'],
chain(*e.values())))
for e in my_data))
Without itertools:
pd.DataFrame(((E:=list(e.values()))[0]+E[1] for e in my_data),
columns=['name', 'gender', 'year', 'number'])
output:
name gender year number
0 Aaliyah 2 2016 10
1 Aaliyah 2 2017 26
2 Aaliyah 2 2018 21
3 Aaliyah 2 2019 26
4 Aaliyah 2 2020 15

Pandas: Flatten Nested Dictionary vertically

I have a list of dictionary as below:
[{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},{'tagId': '20',
'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},{'tagId': '40',
'tagName': 'FG'}]}]
I want to turn this into a dataframe like below:
Name tagList_tagID tagList_tagName
Jack 10 AB
Jack 20 BC
mike 30 DE
mike 40 FG
How can I convert this list of dictionaries to pandas dataframe in an efficient way.
Try with json.normalize:
lst = [{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},
{'tagId': '20', 'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},
{'tagId': '40', 'tagName': 'FG'}]}]
df = pd.json_normalize(lst, record_path="tagList", meta=["name"])
#formatting to match expected output
df = df.set_index("name").add_prefix("tagList_")
>>> df
tagList_tagId tagList_tagName
name
jack 10 AB
jack 20 BC
mike 30 DE
mike 40 FG

Pairwise reshaping dataframe

I am trying to build a list of graph edges from a two-column data frame representing one edge per node.
pd.DataFrame({'node': ['100', '100', '200', '200', '200'],
'edge': ['111111', '222222', '123456', '456789', '987654']})
The result should look like this
pd.DataFrame({'node': ['100', '100','200', '200', '200', '200', '200', '200'],
'edge1': ['111111','222222','123456', '123456', '456789', '456789', '987654', '987654'],
'edge2': ['222222', '111111','456789', '987654', '987654', '123456' , '123456','456789']})
I have been wrestling with pivot table and stack for a while but no success.
You can use itertools.permutations to get the permutations of the edges after groupby, then convert the output to a new df to generate the desired output:
import pandas as pd
from itertools import permutations
df = pd.DataFrame({'node': ['100', '100', '200', '200', '200'],'edge': ['111111', '222222', '123456', '456789', '987654']})
df = df.groupby('node')['edge'].apply(list).apply(lambda x:list(permutations(x, 2))).reset_index().explode('edge')
pd.DataFrame(df["edge"].to_list(), index=df['node'], columns=['edge1', 'edge2']).reset_index()
Result:
node
edge1
edge2
0
100
111111
222222
1
100
222222
111111
2
200
123456
456789
3
200
123456
987654
4
200
456789
123456
5
200
456789
987654
6
200
987654
123456
7
200
987654
456789

How to store for loop iterations to a new data set

screenshot
My code:
for video in most_disliked:
df1 = video['id'],video['statistics']
print(df1)
Output:
('bvyTxpY9qJM', {'viewCount': '145', 'likeCount': '3', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'})
('gShHA7BZNCw', {'viewCount': '36', 'likeCount': '3', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'})
('q7gxl8RJEv4', {'viewCount': '11', 'likeCount': '2', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '1'})
Expected Output:
Videoid viewcount likecount dislikecount favouritecount commentcount
bvyTxpY9qJM 145 3 0 0 0
gShHA7BZNCw 36 3 0 0 0
q7gxl8RJEv4 11 2 0 0 1
df1 = video['id'],video['statistics'] creates a tuple of two elements video['id'] and video['statistics'].
To create a dataframe from the most_disliked list, you can use this example:
df1 = pd.DataFrame([{'Videoid': video['id'], **video['statistics']} for video in most_disliked])
print(df1)
Prints:
Videoid viewCount likeCount dislikeCount favoriteCount commentCount
0 bvyTxpY9qJM 145 3 0 0 0
1 gShHA7BZNCw 36 3 0 0 0
2 q7gxl8RJEv4 11 2 0 0 1
data = [('bvyTxpY9qJM', {'viewCount': '145', 'likeCount': '3',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'}),
('gShHA7BZNCw', {'viewCount': '36', 'likeCount': '3',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'}),
('q7gxl8RJEv4', {'viewCount': '11', 'likeCount': '2',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '1'}),
]
most_liked = pd.DataFrame(data, columns=['id', 'stat'])
df2 = pd.merge(most_liked['id'], most_liked['stat'].apply(pd.Series),
left_index=True, right_index=True)
Output
id viewCount likeCount dislikeCount favoriteCount commentCount
0 bvyTxpY9qJM 145 3 0 0 0
1 gShHA7BZNCw 36 3 0 0 0
2 q7gxl8RJEv4 11 2 0 0 1

Pandas clean messy data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on some real time data of people and the age column of dataFrame is really messy.
I want expected output to be age_bins in range [0,10,20,30,40,50,60,70,80,90,100].
Whats the best way to clean messy data like this?
df = pd.DataFrame({'Age':['23', '64', '71', '53', '40', '45', '30-39', '50-59', '60-69',
'30', '65', '44', '8-68', '21-72', '26', '36', '43', '70', '52',
'66', '27', '17', '51', '68', '35', '28', '58', '33', '31', '50',
'24', '88', '29', '21', '78', '60', '63', '37', '32', '49',
'20-29', '47', '18-99', '41', '39', '42', '38', '7', '40-49', '82',
'61', '34-66', '62', '40-89', '80-89', '55', '0.25', '13-19', '69',
'16', '8', '10', '25', '34', '55-74', '75-', '70-79', '79',
'35-54', '55-', '95', '54', '40-50', '46', '48', '57', '56']})
You can split values by Series.str.split with remove possible traling - by Series.str.strip to 2 columns and for each of them use cut:
df1 = df['Age'].str.strip('-').str.split('-', expand=True).astype(float)
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j-1) for i, j in zip(bins[:-1], bins[1:])]
g1 = pd.cut(df1[0], bins=bins, right=False, labels=labels)
g2 = pd.cut(df1[1], bins=bins, right=False, labels=labels)
Then compare both and if match (also are replaced missing values between both Series) then is created new column by Series.mask:
df['age_bins'] = g1.mask(g1.ne(g2.fillna(g1)))
print (df)
Age age_bins
0 23 20-29
1 64 60-69
2 71 70-79
3 53 50-59
4 40 40-49
.. ... ...
72 40-50 NaN
73 46 40-49
74 48 40-49
75 57 50-59
76 56 50-59
[77 rows x 2 columns]
Not matched values:
df1 = df[df['age_bins'].isna()]
print (df1)
Age age_bins
12 8-68 NaN
13 21-72 NaN
42 18-99 NaN
51 34-66 NaN
53 40-89 NaN
64 55-74 NaN
68 35-54 NaN
72 40-50 NaN

Categories