extracting values by keywords in a pandas column - python

I have a column that is a list of dictionary. I extracted only the values by the name key, and saved it to a list. Since I need to run the column to a tfidVectorizer, I need the columns to be a string of words. My code is as follows.
def transform(s,to_extract):
return [object[to_extract] for object in json.loads(s)]
cols = ['genres','keywords']
for col in cols:
lst = df[col]
df[col] = list(map(lambda x : transform(x,to_extract='name'), lst))
df[col] = [', '.join(x) for x in df[col]]
for testing, here's 2 rows.
data = {'genres': [[{"id": 851, "name": "dual identity"},{"id": 2038, "name": "love of one's life"}],
[{"id": 5983, "name": "pizza boy"},{"id": 8828, "name": "marvel comic"}]],
'keywords': [[{"id": 9663, "name": "sequel"},{"id": 9715, "name": "superhero"}],
[{"id": 14991, "name": "tentacle"},{"id": 34079, "name": "death", "id": 163074, "name": "super villain"}]]
}
df = pd.DataFrame(data)
I'm able to extract the necessary data and save it accordingly. However, I find the codes too verbose, and I would like to know if there's a more pythonic way to achieve the same outcome?
Desired output of one row should be a string, delimited only by a comma. Ex, 'Dual Identity,love of one's life'.

Is this what you need ?
df.applymap(lambda x : pd.DataFrame(x).name.tolist())
Out[278]:
genres keywords
0 [dual identity, love of one's life] [sequel, superhero]
1 [pizza boy, marvel comic] [tentacle, super villain]
Update
df.applymap(lambda x : pd.DataFrame(x).name.str.cat(sep=','))
Out[280]:
genres keywords
0 dual identity,love of one's life sequel,superhero
1 pizza boy,marvel comic tentacle,super villain

Related

pandas df explode and implode to remove specific dict from the list

I have pandas dataframe with multiple columns. On the the column called request_headers is in a format of list of dictionaries, example:
[{"name": "name1", "value": "value1"}, {"name": "name2", "value": "value2"}]
I would like to remove only those elements from that list which do contains specific name. For example with:
blacklist = "name2"
I should get the same dataframe, with all the columns including request_headers, but it's value (based on the example above) should be:
[{"name": "name1", "value": "value1"}]
How to achieve it ? I've tried first to explode, then filter, but was not able to "implode" correctly.
Thanks,
Exploding is expensive, rather us a list comprehension:
blacklist = "name2"
df['request_headers'] = [[d for d in l if 'name' in d and d['name'] != blacklist]
for l in df['request_headers']]
Output:
request_headers
0 [{'name': 'name1', 'value': 'value1'}]
can use a .apply function:
blacklist = 'name2'
df['request_headers'] = df['request_headers'].apply(lambda x: [d for d in x if blacklist not in d.values()])
df1=pd.DataFrame([{"name": "name1", "value": "value1"}, {"name": "name2", "value": "value2"}])
blacklist = "name2"
col1=df1.name.eq(blacklist)
df1.loc[col1]
out:
name value
1 name2 value2

How to formatting a Json content type?

I need to extract only the content of the operations of the following json:
{"entries":[{"description":"Text transform on 101 cells in column Column 2: value.toLowercase()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 2","expression":"value.toLowercase()","onError":"keep-original","repeat":false,"repeatCount":10,"description":"Text transform on cells in column Column 2 using expression value.toLowercase()"}},{"description":"Text transform on 101 cells in column Column 6: value.toDate()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 6","expression":"value.toDate()","onError":"keep-original","repeat":false,"repeatCount":10,"description":"Text transform on cells in column Column 6 using expression value.toDate()"}}]}
It should look like this:
[{"op": "core/text-transform", "engineConfig": {"facets": [], "mode": "row-based"}, "columnName": "Column 2", "expression": "value.toLowercase()", "onError": "keep-original", "repeat": "false", "repeatCount": 10, "description": "Text transform on cells in column Column 2 using expression value.toLowercase()"}, {"op": "core/text-transform", "engineConfig": {"facets": [], "mode": "row-based"}, "columnName": "Column 6", "expression": "value.toDate()", "onError": "keep-original", "repeat": "false", "repeatCount": 10, "description": "Text transform on cells in column Column 6 using expression value.toDate()"}]
I tried to use this code:
import json
operations = [{"description":"Text transform on 101 cells in column Column 2: value.toLowercase()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 2","expression":"value.toLowercase()","onError":"keep-original","repeat":"false","repeatCount":10,"description":"Text transform on cells in column Column 2 using expression value.toLowercase()"}},{"description":"Text transform on 101 cells in column Column 6: value.toDate()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 6","expression":"value.toDate()","onError":"keep-original","repeat":"false","repeatCount":10,"description":"Text transform on cells in column Column 6 using expression value.toDate()"}}]
new_operations = []
for operation in operations:
new_operations.append(operation["operation"])
x = json.dumps(new_operations)
print(x)
However, I must manually place quotes in the words like "fake" and also remove the first "entries" part for it to work. Does anyone know how to do it automatically?
IIUC you can do it like this. Read it as json data, extract the parts you want and dump it back to json.
with open('your-json-data.json') as j:
data = json.load(j)
new_data = []
for dic in data['entries']:
for key,value in dic.items():
if key == 'operation':
dic = {k:v for k,v in value.items()}
new_data.append(dic)
x = json.dumps(new_data)
print(x)
Output:
[
{
"op":"core/text-transform",
"engineConfig":{
"facets":[
],
"mode":"row-based"
},
"columnName":"Column 2",
"expression":"value.toLowercase()",
"onError":"keep-original",
"repeat":false,
"repeatCount":10,
"description":"Text transform on cells in column Column 2 using expression value.toLowercase()"
},
{
"op":"core/text-transform",
"engineConfig":{
"facets":[
],
"mode":"row-based"
},
"columnName":"Column 6",
"expression":"value.toDate()",
"onError":"keep-original",
"repeat":false,
"repeatCount":10,
"description":"Text transform on cells in column Column 6 using expression value.toDate()"
}
]

Is there a way to transform all unique values into a new dataframe using loop and at the same time create additional columns? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 8 months ago.
My problem is that I have a dataframe like this:
##for demonstration
import pandas as pd
example = {
"ID": [1, 1, 2, 2, 2, 3],
"place":["Maryland","Maryland", "Washington", "Washington", "Washington", "Los Angeles"],
"type": ["condition", "symptom", "condition", "condition", "sky", "condition"],
"name": ["depression", "cough", "fatigue", "depression", "blue", "fever" ]
}
#load into df:
example = pd.DataFrame(example)
print(example)
}
And I want to sort it by unique ID so that it will be reorganized like that:
#for demonstration
import pandas as pd
result = {
"ID": [1,2,3],
"place":["Maryland","Washington", "Los Angeles"],
"condition": ["depression", "fatigue", "fever"],
"condition1":["no", "depression", "no"],
"symptom": ["cough", "no", "no"],
"sky": ["no", "blue", "no"]
}
#load into df:
result = pd.DataFrame(result)
print(result)
I tried to sort it like:
example.nunique()
df_names = dict()
for k, v in example.groupby('ID'):
df_names[k] = v
However, this gives me back a dictionary and is not organized in a way it should.
Is there a way to do it with the loop like for all unique ID create a new column if there is condition, sky or others? If there are couple conditions that the next condition is becoming condition1. Could you please help me if you know the way to realize it?
This should give you the answers you need. It is a combination of cumsum() and pivot
import pandas as pd
df = pd.DataFrame({
"ID": [1, 1, 2, 2, 2, 3],
"place":["Maryland","Maryland", "Washington", "Washington", "Washington", "Los Angeles"],
"type": ["condition", "symptom", "condition", "condition", "sky", "condition"],
"name": ["depression", "cough", "fatigue", "depression", "blue", "fever" ]
})
df['type'] = df['type'].astype(str) + '_' + df.groupby(['place', 'type']).cumcount().astype(str)
df = df.pivot(index=['ID', 'place'], columns = 'type', values = 'name').reset_index()
df = df.fillna('no')
df.columns = df.columns.str.replace('_0', '')
df = df[['ID', 'place', 'condition', 'condition_1', 'symptom', 'sky']]
df

Create json files from dataframe

I have a dataframe with 10 rows like this:
id
name
team
1
Employee1
Team1
2
Employee2
Team2
...
How can I generate 10 json files from the dataframe with python?
Here is the format of each json file:
{
"Company": "Company",
"id": "1",
"name": "Employee1",
"team": "Team1"
}
The field "Company": "Company" is the same in all json files.
Name of each json file is the name of each employee (i.e Employee1.json)
I do not really like iterrows but as you need a file per row, I cannot imagine how to vectorize the operation:
for _, row in df.iterrows():
row['Company'] = 'Company'
row.to_json(row['name'] + '.json')
You could use apply in the following way:
df.apply(lambda x: x.to_json(), axis=1)
And inside the to_json pass the employee name, it’s available to you in x
Another approach is to iterate over the rows like:
for i in df.index:
df.loc[i].to_json("Employee{}.json".format(i))

How to create a Pandas DataFrame from row-based list of dictionaries

I have a data structure like this:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
},
.
.
.
]
Each key in the dictionaries has values of type integer, string or
list of strings (not all keys are in all dicts present), each
dictionary represents a row in a table; all rows are given as the list
of dictionaries.
How can I easily import this into Pandas? I tried
df = pd.DataFrame.from_records(data)
but here I get an "ValueError: arrays must all be same length" error.
The DataFrame constructor takes row-based arrays (amoungst other structures) as data input. Therefore the following works:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
}]
df = pd.DataFrame(data)
print(df)
Output:
character name pattern skills
0 mean leopard striped [sprinting, hiding]
1 good antilope NaN [running]

Categories