Splitting at specific string from a dataframe column in Python

Splitting at specific string from a dataframe column in Python - python

I have a dataframe with a column called "Spl" with the values below: I am trying to extract the values next to 'name': strings (some rows have multiple values) but I see the new column generated with the specific location of the memory. I used the below code to extract. Any help how to extract the values after "name:" string is much appreciated.
Column values:
'name': 'Chirotherapie', 'name': 'Innen Medizin'
'name': 'Manuelle Medizin'
'name': 'Akupunktur', 'name': 'Chirotherapie', 'name': 'Innen Medizin'
Code:
df['Spl'] = lambda x: len(x['Spl'].str.split("'name':"))
Output:
<function <lambda> at 0x0000027BF8F68940>

Just simply do:-
df['Spl']=df['Spl'].str.split("'name':").str.len()

Just do count
df['Spl'] = df['Spl'].str.count("'name':")+1

Related

(Python) How to only keep specific part of cells in a dataframe

I want to cleanup a source column of my dataframe. At the end I only want to keep the part behind 'name'.
What is the best way to do this?
For example:
row 1, column 1:
{'id': 'rtl-nieuws', 'name': 'RTL Nieuws'}
row 2, column 1:
{'id': 'none', 'name': 'www.ad.nl'}
Desired outcome:
row 1, column 1:
RTL Nieuws
row 2, column 1:
www.ad.nl

Is this what you are trying to do? In the future, please consider giving a working example to solve the request from.
data = pd.DataFrame({
"id": ["rtl-nieuws", "none"],
"name": ["RTL Nieuws", "www.ad.nl"]
}, index=[0,1])
data.drop("id", axis = 1)
# name
# 0 RTL Nieuws
# 1 www.ad.nl

Considering your data seems to be in the format of a dictionary, you can use ast.literal_eval() to access the value at the 'name' key.
import ast
current_cell = "{'id': 'rtl-nieuws', 'name': 'RTL Nieuws'}"
name = ast.literal_eval(current_cell)['name']
print(name)
>>> RTL Nieuws

Parsing nested dictionary to dataframe

I am trying to create data frame from a JSON file.
and each album_details have a nested dict like this
{'api_path': '/albums/491200',
'artist': {'api_path': '/artists/1421',
'header_image_url': 'https://images.genius.com/f3a1149475f2406582e3531041680a3c.1000x800x1.jpg',
'id': 1421,
'image_url': 'https://images.genius.com/25d8a9c93ab97e9e6d5d1d9d36e64a53.1000x1000x1.jpg',
'iq': 46112,
'is_meme_verified': True,
'is_verified': True,
'name': 'Kendrick Lamar',
'url': 'https://genius.com/artists/Kendrick-lamar'},
'cover_art_url': 'https://images.genius.com/1efc5de2af228d2e49d91bd0dac4dc49.1000x1000x1.jpg',
'full_title': 'good kid, m.A.A.d city (Deluxe Version) by Kendrick Lamar',
'id': 491200,
'name': 'good kid, m.A.A.d city (Deluxe Version)',
'url': 'https://genius.com/albums/Kendrick-lamar/Good-kid-m-a-a-d-city-deluxe-version'}
I want to create another column in the data frame with just album name which is one the above dict
'name': 'good kid, m.A.A.d city (Deluxe Version)',
I have been looking how to do this from very long time , can some one please help me. thanks

Is that is the case use str to call the dict key
df['name'] = df['album_details'].str['name']

If you have the dataframe stored in the df variable you could do:
df['artist_name'] = [x['artist']['name'] for x in df['album_details'].values]

You can use apply with lambda function:
df['album_name'] = df['album_details'].apply(lambda d: d['name'])
Basically you execute the lambda function for each value of the column 'album_details'. Note that the argument 'd' in the function is the album dictionary. Apply returns a series of the function return values and this you can set to a new column.
See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Pandas json_normalize and JSON flattening error

A panda newbie here that's struggling to understand why I'm unable to completely flatten a JSON I receive from an API. I need a Dataframe with all the data that is returned by the API, however I need all nested data to be expanded and given it's own columns for me to be able to use it.
The JSON I receive is as follows:
[
{
"query":{
"id":"1596487766859-3594dfce3973bc19",
"name":"test"
},
"webPage":{
"inLanguages":[
{
"code":"en"
}
]
},
"product":{
"name":"Test",
"description":"Test2",
"mainImage":"image1.jpg",
"images":[
"image2.jpg",
"image3.jpg"
],
"offers":[
{
"price":"45.0",
"currency":"€"
}
],
"probability":0.9552192
}
}
]
Running pd.json_normalize(data) without any additional parameters shows the nested values price and currency in the product.offers column. When I try to separate these out into their own columns with the following:
pd.json_normalize(data,record_path=['product',meta['product',['offers']]])
I end up with the following error:
f"{js} has non list value {result} for path {spec}. "
Any help would be much appreciated.

I've used this technique a few times
do initial pd.json_normalize() to discover the columns
build meta parameter by inspecting this and the original JSON. NB possible index out of range here
you can only request one list drives record_path param
a few tricks product/images is a list so it gets named 0. rename it
did a Cartesian product to merge two different data frames from breaking down lists. It's not so stable
data = [{'query': {'id': '1596487766859-3594dfce3973bc19', 'name': 'test'},
'webPage': {'inLanguages': [{'code': 'en'}]},
'product': {'name': 'Test',
'description': 'Test2',
'mainImage': 'image1.jpg',
'images': ['image2.jpg', 'image3.jpg'],
'offers': [{'price': '45.0', 'currency': '€'}],
'probability': 0.9552192}}]
# build default to get column names
df = pd.json_normalize(data)
# from column names build the list that gets sent to meta param
mymeta = [[s for s in c.split(".")] for c in df.columns ]
# exclude lists from meta - this will fail
mymeta = [l for l in mymeta if not isinstance(data[0][l[0]][l[1]], list)]
# you can build df from either of the product lists NOT both
df1 = pd.json_normalize(data, record_path=[["product","offers"]], meta=mymeta)
df2 = pd.json_normalize(data, record_path=[["product","images"]], meta=mymeta).rename(columns={0:"image"})
# want them together - you can merge them. note columns heavily overlap so remove most columns from df2
df1.assign(foo=1).merge(
df2.assign(foo=1).drop(columns=[c for c in df2.columns if c!="image"]), on="foo").drop(columns="foo")

Parse one string column in dataframe column into many other columns

I have a column in a pandas data frame that contains string like the following format as for example
fullyRandom=true+mapSizeDividedBy64=51048
mapSizeDividedBy16000=9756+fullyRandom=false
qType=MpmcArrayQueue+qCapacity=822398+burstSize=664
count=11087+mySeed=2+maxLength=9490
capacity=27281
capacity=79882
we can read for example the first row as 2 parameters separated by '+' each parameter has a value, that clear by '=' that separate between the parameter and its value.
in Output, I'm asking if there is a python script that either extract the parameters we retrieve a list of unique parameters like the following
[fullyRandom,mapSizeDividedBy64,mapSizeDividedBy64,qType,qCapacity,qCapacity, count,mySeed,maxLength,Capacity]
Notice from the previous list that it contains only the unique parameters without its values
Or extended pandas data frame if it's not too difficult if we can parse the following column and convert into many columns, each column is for one parameter that store it's value in it

Try this, it will store the values in a list.
data = []
with open('<your text file>', 'r') as file:
content = file.readlines()
for row in content:
if '+' in row:
sub_row = row.strip('\n').split('+')
for r in sub_row:
data.append(r)
else:
data.append(row.strip('\n'))
print(data)
Output:
['fullyRandom=true', 'mapSizeDividedBy64=51048', 'mapSizeDividedBy16000=9756', 'fullyRandom=false', 'qType=MpmcArrayQueue', 'qCapacity=822398', 'burstSize=664', 'count=11087', 'mySeed=2', 'maxLength=9490', 'capacity=27281', 'capacity=79882']
to convert to a list of dict that could be used in pandas:
dict_list = []
for item in data:
df = {
item.split('=')[0]: item.split('=')[1]
}
dict_list.append(df)
print(dict_list)
Output:
[{'fullyRandom': 'true'}, {'mapSizeDividedBy64': '51048'}, {'mapSizeDividedBy16000': '9756'}, {'fullyRandom': 'false'}, {'qType': 'MpmcArrayQueue'}, {'qCapacity': '822398'}, {'burstSize': '664'}, {'count': '11087'}, {'mySeed': '2'}, {'maxLength': '9490'}, {'capacity': '27281'}, {'capacity': '79882'}]
To just get the headers:
dict_list.append(item.split('=')[0])
Output:
['fullyRandom', 'mapSizeDividedBy64', 'mapSizeDividedBy16000', 'fullyRandom', 'qType', 'qCapacity', 'burstSize', 'count', 'mySeed', 'maxLength', 'capacity', 'capacity']

pandas create new columns from dictionaries

a portion of one column 'relatedWorkOrder' in my dataframe looks like this:
{'number': 2552, 'labor': {'name': 'IA001', 'code': '70M0901003'}...}
{'number': 2552, 'labor': {'name': 'IA001', 'code': '70M0901003'}...}
{'number': 2552, 'labor': {'name': 'IA001', 'code': '70M0901003'}...}
My desired output is to have a column 'name','labor_name','labor_code' with their respective values. I can do this using regex extract and replace:
df['name'] = df['relatedWorkOrder'].str.extract(r'{regex}',expand=False).str.replace('something','')
But I have several dictionaries in this column and in this way is tedious, also I'm wondering if it's possible doing this through accessing the keys and values of the dictionary
Any help with that?

You can join the result from pd.json_normalize:
df.join(pd.json_normalize(df['relatedWorkOrder'], sep='_'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting at specific string from a dataframe column in Python - python

Just simply do:- df['Spl']=df['Spl'].str.split("'name':").str.len()

Just do count df['Spl'] = df['Spl'].str.count("'name':")+1

Related

(Python) How to only keep specific part of cells in a dataframe

Parsing nested dictionary to dataframe

Pandas json_normalize and JSON flattening error

Parse one string column in dataframe column into many other columns

pandas create new columns from dictionaries

Categories

Resources