Clean multiple JSONs in a pandas dataframe

Clean multiple JSONs in a pandas dataframe - python

I have a dataframe created like below, with countries in JSON format:
df = pd.DataFrame([['matt', '''[{"c_id": "cn", "c_name": "China"}, {"c_id": "au", "c_name": "Australia"}]'''],
['david', '''[{"c_id": "jp", "c_name": "Japan"}, {"c_id": "cn", "c_name": "China"},{"c_id": "au", "c_name": "Australia"}]'''],
['john', '''[{"c_id": "br", "c_name": "Brazil"}, {"c_id": "ag", "c_name": "Argentina"}]''']],
columns =['person','countries'])
I'd like to have the output as below, with just the country names, separated by a comma and sorted in alphabetical order:
result = pd.DataFrame([['matt', 'Australia, China'],
['david', 'Australia, China, Japan'],
['john', 'Argentina, Brazil']],
columns =['person','countries'])
I tried doing this using a few methods, but none worked successfully. I was hoping the below would split the JSON format appropriately, but it didn't work out - perhaps because the JSONs are in string format in the dataframe?
result = pd.io.json.json_normalize(df, 'c_name')

One solution could be to use ast.literal_eval to treat the string as a list of dictionaries:
import ast
df["countries"] = df["countries"].map(lambda x: ast.literal_eval(x))
df["countries"] = df["countries"].map(lambda x: sorted([c["c_name"] for c in x]))

Related

replacing values in a data frame from a dictionary with multiple keys

I have not seen any posts about this on here. I have a data frame with some data that i would like to replace the values with the values found in a dictionary. This could simply be done with .replace but I want to keep this dynamic and reference the df column names using a paired dictionary map.
import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
replace_dict={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict():
return
df_map={
"letters": [replace_dict],
"State": [replace_dict]
}
#replace the df values with the replace_dict values
I hope this makes sense but to explain more i want to replace the data under columns 'letters' and 'State' with the values found in replace_dict but referencing the column names from the keys found in df_map. I know this is overcomplicated for this example but i want to provide an easier example to understand.
I need help making the function 'replace_dict' to do the operations above.
Expected output is:
data=[['ABC', 'IN']]
df=pd.DataFrame(data,columns=['letters','State'])
by creating a function and then running the function with something along these lines
for i in df_map:
for j in df_map[i]:
df= j(i, df)
how would i create a function to run these operations? I have not seen anyone try to do this with multiple dictionary keys in the replace_dict

I'd keep the replace_dict keys the same as the column names.
def map_from_dict(data: pd.DataFrame, cols: list, mapping: dict) -> pd.DataFrame:
return pd.DataFrame([data[x].map(mapping.get(x)) for x in cols]).transpose()
df = pd.DataFrame({
"letters": ["Alphabet"],
"states": ["Indiana"]
})
replace_dict = {
"states": {"Illinois": "IL", "Indiana": "IN"},
"letters": {"Alphabet": "ABC", "Alphabet end": "XYZ"}
}
final_df = map_from_dict(df, ["letters", "states"], replace_dict)
print(final_df)
letters states
0 ABC IN

import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
dict_={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict(df, dict_):
for d in dict_.values():
for val in d:
for c in df.columns:
df[c][df[c]==val] = d[val]
return df
df = replace_dict(df, dict_)

Split column containing list of dics and match dictionnairy values to new column based on key

I am in need of some help in python.
I have a column in my dataset that contains a list of dictionaries. An example of the contents of such a cell are as follows:
lyst = [
{"region": "Baden-Wuerttemberg", "percentage":0.176553},
{"region": "Bayern", "percentage":0.142457},
{"region": "Bremen", "percentage":0.008874},
{"region": "Hamburg", "percentage":0.01915},
{"region": "Hessen", "percentage":0.09612},
{"region": "Niedersachsen", "percentage":0.094816},
{"region": "Nordrhein-Westfalen", "percentage":0.244745},
]
There are up to 17 dictionaries per list. Each list describes the percentage of ad targeting in each German states involved in an advertisement. Unfortunately the order in which the states are mentioned in each cell is different and some cells do not include some states that others do
enter image description here
This is what the cells look like in the original dataset
enter image description here
Id like to split the column based on name of the region and then put the percentage value in the new cell. I have no idea where to start, any help would be greatly appreciated.

We can generate some simple test data as follows:
import pandas as pd
cell1 = [
{"region": "Baden-Wuerttemberg", "percentage":0.176553},
{"region": "Bayern", "percentage":0.142457},
{"region": "Bremen", "percentage":0.008874},
{"region": "Hamburg", "percentage":0.01915},
{"region": "Hessen", "percentage":0.09612},
{"region": "Niedersachsen", "percentage":0.094816},
{"region": "Nordrhein-Westfalen", "percentage":0.244745},
]
cell2 = [
{'region': 'Baden-Wuerttemberg', 'percentage': 0.5882765},
{'region': 'Bremen', 'percentage': 0.504437},
{'region': 'Hessen', 'percentage': 0.54806}
]
old_data = pd.DataFrame({
"Advert ID": ["Ad-" + str(id) for id in range(1, 3)],
"Efficacy": [
cell1,
cell2
],
}
)
print(old_data)
Now that we have some test data, we extract the names of the 17 German states.
column = old_data["Efficacy"]
lysts = [t[1] for t in column.iteritems()]
regions = set(itts.chain.from_iterable([[d["region"] for d in lyst] for lyst in lysts]))
print(regions)
regions is a set of strings; something like the following:
{'Nordrhein-Westfalen', 'Baden-Wuerttemberg', 'Bremen', 'Bayern', 'Niedersachsen', 'Hessen', 'Hamburg'}
let us make a new pandas.DataFrame such that there is one column for each of the German states.
import pandas as pd
new_data = pd.DataFrame(
columns = [old_data.columns[0]] + list(regions)
)
Now, you can populate the new data frame with data from the old data-frame.
EDIT:
To populate the new-data frame with data from your old data-frame, you can write something like this:
new_data = pd.DataFrame(
columns = [old_data.columns[0]] + list(regions)
)
adverts = [t[1] for t in old_data["Advert ID"].iteritems()]
# print("adverts == ", repr(str(adverts))[1:-1])
for advert in adverts:
old_row = old_data.loc[old_data["Advert ID"] == advert]
list_of_dicts = old_row["Efficacy"].to_list()[0]
rowi = 1 + len(new_data.index)
new_data.at[rowi, "Advert ID"] = advert
for dyct in list_of_dicts:
region = dyct["region"]
percentage = dyct["percentage"]
new_data.at[rowi, region] = percentage
Let us display what we have:
def pretty_print(df:pd.DataFrame):
colw = 30 # column width is 40 characters wide
import io
strm = io.StringIO()
d = df.to_dict()
for key in d.keys():
print( key.ljust(30), end="", file=strm)
print(*[repr(str(v))[1:-1].ljust(colw) for v in d[key].values()], file=strm)
# return strm.getvalue()
print(strm.getvalue())
pretty_print(new_data)
We get something like:
Advert ID Ad-1 Ad-2
Bremen 0.008874 0.504437
Baden-Wuerttemberg 0.176553 0.5882765
Bayern 0.142457 nan
Hessen 0.09612 0.54806
Hamburg 0.01915 nan
Nordrhein-Westfalen 0.244745 nan
Niedersachsen 0.094816 nan

How to get rid of Series heading (column heading) using Pandas Library in Python

Using pandas Library, I made dictionaries that are nested in a list from file “german_words.csv”.
(for Info: “german_words.csv” is file with German words and corresponding English translated words)
german_words.csv (It's just sample, current file contains thousands of words):
Deutsch,English
Gedanken,thought
Stadt,city
Baum,tree
überqueren,cross
Bauernhof,farm
schwer,hard
Beginn,start
Macht,might
Geschichte,story
Säge,saw
weit,far
Meer,sea
Here's the code of that:
import pandas
import random
word_data = pandas.read_csv("./data/german_words.csv")
word_data_list = word_data.to_dict(orient="records")
print(random.choice(word_data_list))
And then printing random dictionary from that list.
list looks like this:
[{'Deutsch': 'Gedanken', 'English': 'thought'}, {'Deutsch': 'Stadt', 'English': 'city'}, {'Deutsch': 'Baum', 'English': 'tree'}, ....]
Here's the sample output:
{'Deutsch': 'Küste', 'English': 'coast'}
But the problem is, I don't want the column heading in the dictionaries.
I want these dictionaries in list as follows:
[{'Gedanken': 'thought'}, {'Stadt': 'city'}, {'Baum': 'tree'} ...]

Create Series by column Deutsch like index, select column English and then convert to dictionaries:
print (word_data.set_index('Deutsch')['English'].to_dict())
Or if only 2 columns DataFrame is possible use:
print (dict(word_data.to_numpy()))
EDIT: For list of dictionaries use:
print([{x["Deutsch"]: x["English"]} for x in word_data.to_dict(orient="records")])
[{'Gedanken': 'thought'}, {'Stadt': 'city'}, {'Baum': 'tree'},
{'überqueren': 'cross'}, {'Bauernhof': 'farm'}, {'schwer': 'hard'},
{'Beginn': 'start'}, {'Macht': 'might'}, {'Geschichte': 'story'},
{'Säge': 'saw'}, {'weit': 'far'}, {'Meer': 'sea'}]

import pandas as pd
word_data = pd.DataFrame(
data={
"Deutsch": ["Gedanken", "Stadt", "Baum"],
"English": ["thought", "city", "tree"],
}
)
print({d["Deutsch"]: d["English"] for d in word_data.to_dict(orient="records")})
# {'Gedanken': 'thought', 'Stadt': 'city', 'Baum': 'tree'}

Using structured queries to geocode records in a pandas dataframe using GeoPy

I would like to use structured queries to do geocoding in GeoPy, and I would like to run this on a large number of observations. I don't know how to do these queries using a pandas dataframe (or something that can be easily transformed to and from a pandas dataframe).
First, some set up:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
Ngeolocator = Nominatim(user_agent="myGeocoder")
Ngeocode = RateLimiter(Ngeolocator.geocode, min_delay_seconds=1)
df = pandas.DataFrame(["Bob", "Joe", "Ed"])
df["CLEANtown"] = ['Harmony', 'Fargo', '']
df["CLEANcounty"] = ['', '', 'Traill']
df["CLEANstate"] = ['Minnesota', 'North Dakota', 'North Dakota']
df["full"]=['Harmony, Minnesota','Fargo, North Dakota','Traill County, North Dakota']
df.columns = ["name"] + list(df.columns[1:])
I know how to run a structured query on a single location by providing a dictionary. I.e.:
q={'city':'Harmony', 'county':'', 'state':'Minnesota'}
testN=Ngeocode(q,addressdetails=True)
And I know how to geocode from the dataframe simply using a single column populated with strings. I.e.:
df['easycode'] = df['full'].apply(lambda x: Ngeocode(x, language='en',addressdetails=True).raw)
But how do I turn the columns CLEANtown, CLEANcounty, and CLEANstate into dictionaries row by row, use those dictionaries as structured queries, and put the results back into the pandas dataframe?
Thank you!

One way is to use apply method of a DataFrame instead of a Series. This would pass the whole row to the lambda. Example:
df["easycode"] = df.apply(
lambda row: Ngeocode(
{
"city": row["CLEANtown"],
"county": row["CLEANcounty"],
"state": row["CLEANstate"],
},
language="en",
addressdetails=True,
).raw,
axis=1,
)
Similarly, if you wanted to make a single row of the dictionaries first, you could do:
df["full"] = df.apply(
lambda row: {
"city": row["CLEANtown"],
"county": row["CLEANcounty"],
"state": row["CLEANstate"],
},
axis=1,
)
df["easycode"] = df["full"].apply(
lambda x: Ngeocode(
x,
language="en",
addressdetails=True,
).raw
)

Pandas Dataframe conversion to JSON

I have a Pandas Dataframe or two rows with data that I'd like to pass as a JSON array.
The JSON needs to be formatted as follow:
[{
"Date": "2017-02-03",
"Text": "Sample Text1"
},
{
"Date": "2015-02-04",
"Text": "Sample Text2"
}]
I tried using df.to_json(orient='index'), but the output is not quite right as it seems to be using the index values as keys
{"0":{"Date":"2017-02-03","Text""Sample Text1"},"1":{"Date":"2017-02-04","Text""Sample Text2"}}

If you want an array of dictionaries, you can use orient='records':
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'Date': ['2017-02-03', '2015-02-04'],
... 'Text': ['Sample Text 1', 'Sample Text 2']
... })
>>> df.to_json(orient='records')
'[{"Date":"2017-02-03","Text":"Sample Text 1"},{"Date":"2015-02-04","Text":"Sample Text 2"}]'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Clean multiple JSONs in a pandas dataframe - python

One solution could be to use ast.literal_eval to treat the string as a list of dictionaries: import ast df["countries"] = df["countries"].map(lambda x: ast.literal_eval(x)) df["countries"] = df["countries"].map(lambda x: sorted([c["c_name"] for c in x]))

Related

replacing values in a data frame from a dictionary with multiple keys

Split column containing list of dics and match dictionnairy values to new column based on key

How to get rid of Series heading (column heading) using Pandas Library in Python

Using structured queries to geocode records in a pandas dataframe using GeoPy

Pandas Dataframe conversion to JSON

Categories

Resources