Pandas: Flatten Nested Dictionary vertically - python

I have a list of dictionary as below:
[{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},{'tagId': '20',
'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},{'tagId': '40',
'tagName': 'FG'}]}]
I want to turn this into a dataframe like below:
Name tagList_tagID tagList_tagName
Jack 10 AB
Jack 20 BC
mike 30 DE
mike 40 FG
How can I convert this list of dictionaries to pandas dataframe in an efficient way.

Try with json.normalize:
lst = [{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},
{'tagId': '20', 'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},
{'tagId': '40', 'tagName': 'FG'}]}]
df = pd.json_normalize(lst, record_path="tagList", meta=["name"])
#formatting to match expected output
df = df.set_index("name").add_prefix("tagList_")
>>> df
tagList_tagId tagList_tagName
name
jack 10 AB
jack 20 BC
mike 30 DE
mike 40 FG

Related

Converting to dataframe, beginner question

I have a piece of data that looks like this
my_data[:5]
returns:
[{'key': ['Aaliyah', '2', '2016'], 'values': ['10']},
{'key': ['Aaliyah', '2', '2017'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2018'], 'values': ['21']},
{'key': ['Aaliyah', '2', '2019'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2020'], 'values': ['15']}]
The key represents Name, Gender, and Year. The value is number.
I do not manage to generate a data frame with columns name, gender, year, and number.
Can you help me?
Here is one way, using a generator:
from itertools import chain
pd.DataFrame.from_records((dict(zip(['name', 'gender', 'year', 'number'],
chain(*e.values())))
for e in my_data))
Without itertools:
pd.DataFrame(((E:=list(e.values()))[0]+E[1] for e in my_data),
columns=['name', 'gender', 'year', 'number'])
output:
name gender year number
0 Aaliyah 2 2016 10
1 Aaliyah 2 2017 26
2 Aaliyah 2 2018 21
3 Aaliyah 2 2019 26
4 Aaliyah 2 2020 15

Pairwise reshaping dataframe

I am trying to build a list of graph edges from a two-column data frame representing one edge per node.
pd.DataFrame({'node': ['100', '100', '200', '200', '200'],
'edge': ['111111', '222222', '123456', '456789', '987654']})
The result should look like this
pd.DataFrame({'node': ['100', '100','200', '200', '200', '200', '200', '200'],
'edge1': ['111111','222222','123456', '123456', '456789', '456789', '987654', '987654'],
'edge2': ['222222', '111111','456789', '987654', '987654', '123456' , '123456','456789']})
I have been wrestling with pivot table and stack for a while but no success.
You can use itertools.permutations to get the permutations of the edges after groupby, then convert the output to a new df to generate the desired output:
import pandas as pd
from itertools import permutations
df = pd.DataFrame({'node': ['100', '100', '200', '200', '200'],'edge': ['111111', '222222', '123456', '456789', '987654']})
df = df.groupby('node')['edge'].apply(list).apply(lambda x:list(permutations(x, 2))).reset_index().explode('edge')
pd.DataFrame(df["edge"].to_list(), index=df['node'], columns=['edge1', 'edge2']).reset_index()
Result:
node
edge1
edge2
0
100
111111
222222
1
100
222222
111111
2
200
123456
456789
3
200
123456
987654
4
200
456789
123456
5
200
456789
987654
6
200
987654
123456
7
200
987654
456789

Python, Take Multiple Lists and Putting into pd.Dataframe

I have seen a variety of answers to this question (like this one), and have had no success in getting my lists into one dataframe. I have one header list (meant to be column headers), and then a variable that has multiple records in it:
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m']
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m']
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
When I try something like:
df = pd.DataFrame(list(zip(['1', 'Jack, '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
# ['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'])))
It lists all the attributes as their own rows, like so:
0 1
0 1 2
1 Jack Jill
2 57.4 km 34.0 km
3 4 2
4 21.7 km 17.9 km
5 5:57 /km 5:27 /km
6 994 m 152 m
How do I get this into a frame that has list1 as the headers, and the rest of the data neatly squared away?
Given
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'],
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
do
pd.DataFrame(list2, columns=list1)
which returns
Rank Athlete Distance Runs Longest Avg. Pace Elev. Gain
0 1 Jack 57.4 km 4 21.7 km 5:57 /km 994 m
1 2 Jill 34.0 km 2 17.9 km 5:27 /km 152 m
2 3 Kelsey 32.6 km 2 21.3 km 5:46 /km 141 m
Change your second list into a list of lists and then
df = pd.DataFrame(columns = list1, data = list2)

Pandas clean messy data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on some real time data of people and the age column of dataFrame is really messy.
I want expected output to be age_bins in range [0,10,20,30,40,50,60,70,80,90,100].
Whats the best way to clean messy data like this?
df = pd.DataFrame({'Age':['23', '64', '71', '53', '40', '45', '30-39', '50-59', '60-69',
'30', '65', '44', '8-68', '21-72', '26', '36', '43', '70', '52',
'66', '27', '17', '51', '68', '35', '28', '58', '33', '31', '50',
'24', '88', '29', '21', '78', '60', '63', '37', '32', '49',
'20-29', '47', '18-99', '41', '39', '42', '38', '7', '40-49', '82',
'61', '34-66', '62', '40-89', '80-89', '55', '0.25', '13-19', '69',
'16', '8', '10', '25', '34', '55-74', '75-', '70-79', '79',
'35-54', '55-', '95', '54', '40-50', '46', '48', '57', '56']})
You can split values by Series.str.split with remove possible traling - by Series.str.strip to 2 columns and for each of them use cut:
df1 = df['Age'].str.strip('-').str.split('-', expand=True).astype(float)
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j-1) for i, j in zip(bins[:-1], bins[1:])]
g1 = pd.cut(df1[0], bins=bins, right=False, labels=labels)
g2 = pd.cut(df1[1], bins=bins, right=False, labels=labels)
Then compare both and if match (also are replaced missing values between both Series) then is created new column by Series.mask:
df['age_bins'] = g1.mask(g1.ne(g2.fillna(g1)))
print (df)
Age age_bins
0 23 20-29
1 64 60-69
2 71 70-79
3 53 50-59
4 40 40-49
.. ... ...
72 40-50 NaN
73 46 40-49
74 48 40-49
75 57 50-59
76 56 50-59
[77 rows x 2 columns]
Not matched values:
df1 = df[df['age_bins'].isna()]
print (df1)
Age age_bins
12 8-68 NaN
13 21-72 NaN
42 18-99 NaN
51 34-66 NaN
53 40-89 NaN
64 55-74 NaN
68 35-54 NaN
72 40-50 NaN

Python error when getting absolute value from a cell while using function

I have below code. It is part of bigger code and i am just providing a snippet to show the problem. When i run below code i get the error AttributeError: 'str' object has no attribute 'values'. df['URL'].values[0] runs fine. I want copy text values from URL field into new field called pdf_text and i want to do this one value at a time. Therefore I am using a function. In my real code, i take values from URL column and open those files and do further processing.
sales = [{'account': 'credit cards', 'Jan': '150 jones', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf'},
{'account': '1', 'Jan': 'Jones', 'Feb': '210', 'URL': ''},
{'account': '1', 'Jan': '50', 'Feb': '90', 'URL': 'ea2017-104.pdf' }]
df = pd.DataFrame(sales)
def pdf2text(url):
url=url.values[0]
return url
#
abc= (df.assign(pdf_text = df['URL'].apply(pdf2text)))
You just want the name of the PDF without the file extension?
>>> import pandas as pd
>>> sales = [{'account': 'credit cards', 'Jan': '150 jones', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf'},
... {'account': '1', 'Jan': 'Jones', 'Feb': '210', 'URL': ''},
... {'account': '1', 'Jan': '50', 'Feb': '90', 'URL': 'ea2017-104.pdf' }]
>>>
>>> df = pd.DataFrame(sales)
>>> df.head()
Feb Jan URL account
0 200 .jones 150 jones ea2018-001.pdf credit cards
1 210 Jones 1
2 90 50 ea2017-104.pdf 1
>>> df['your_column'] = df.URL.map(lambda x: x.split(".")[0])
>>> df.head()
Feb Jan URL account your_column
0 200 .jones 150 jones ea2018-001.pdf credit cards ea2018-001
1 210 Jones 1
2 90 50 ea2017-104.pdf 1 ea2017-104
>>>
It raises ValueError because url is a string (not whole series) and you try to get values attribute from string object.
In your case when you use apply for series, your function pdf2text on each iteration takes pdf file name as the argument.
df['URL'] = df['URL'].apply(pdf2text)
is equivalent to
urls = []
for url in df['URL']:
# `url` equals something like this -> 'ea2018-001.pdf'
urls.append(pdf2text(url))
df['URL'] = pd.Series(urls)
but it's slower and unefficient

Categories