I have not seen any posts about this on here. I have a data frame with some data that i would like to replace the values with the values found in a dictionary. This could simply be done with .replace but I want to keep this dynamic and reference the df column names using a paired dictionary map.
import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
replace_dict={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict():
return
df_map={
"letters": [replace_dict],
"State": [replace_dict]
}
#replace the df values with the replace_dict values
I hope this makes sense but to explain more i want to replace the data under columns 'letters' and 'State' with the values found in replace_dict but referencing the column names from the keys found in df_map. I know this is overcomplicated for this example but i want to provide an easier example to understand.
I need help making the function 'replace_dict' to do the operations above.
Expected output is:
data=[['ABC', 'IN']]
df=pd.DataFrame(data,columns=['letters','State'])
by creating a function and then running the function with something along these lines
for i in df_map:
for j in df_map[i]:
df= j(i, df)
how would i create a function to run these operations? I have not seen anyone try to do this with multiple dictionary keys in the replace_dict
I'd keep the replace_dict keys the same as the column names.
def map_from_dict(data: pd.DataFrame, cols: list, mapping: dict) -> pd.DataFrame:
return pd.DataFrame([data[x].map(mapping.get(x)) for x in cols]).transpose()
df = pd.DataFrame({
"letters": ["Alphabet"],
"states": ["Indiana"]
})
replace_dict = {
"states": {"Illinois": "IL", "Indiana": "IN"},
"letters": {"Alphabet": "ABC", "Alphabet end": "XYZ"}
}
final_df = map_from_dict(df, ["letters", "states"], replace_dict)
print(final_df)
letters states
0 ABC IN
import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
dict_={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict(df, dict_):
for d in dict_.values():
for val in d:
for c in df.columns:
df[c][df[c]==val] = d[val]
return df
df = replace_dict(df, dict_)
Related
We have to build nested json using below structure in pyspark and i have added data that need to feed using this
Input Data structure
Data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
columns = ["data","action"]
df = spark.createDataFrame(zip(a1, a2), columns)
#Input data for json structure
a1=["Pune"]
a2=["YES"]
a3=["India"]
col=["DA_Stinf_city","DA_Stinf_NA_ID_GRANT","DA_country"]
data=spark.createDataFrame(zip(a1, a2,a3), col)
Expected result based on above data
{
"data": {
"studentinfo": {
"city": "Pune",
"name": {
"id": {
"grant": "YES"
}
}
},
"country": "india"
}
}
we have tried using F.struct function in manually but we have find dynamic way to build this json using df dataframe having data and action column
data.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)
The approach below should give the correct structure (with the wrong key names - if you are happy with the approach, which doesn't use DataFrame operations but rather works in the underlying RDD, then I can flesh it out):
def build_json(input, running={}):
new_input = {}
for hierarchy, value in input:
key = hierarchy.pop(0)
if len(hierarchy) == 0:
running[key] = value
else:
new_input[key] = new_input.get(key, []) + [(hierarchy, value)]
for key in new_input:
print(new_input[key])
running[key] = build_json(new_input[key], running={})
return running
data.rdd.map(
lambda x: build_json(
[(column.split("_"), value) for column, value in x.asDict().items()]
)
)
The basic idea is to get a set of tuples from the underlying RDD consisting of the column name broken into its json hierarchy and the value to insert into the hierarchy. Then the function build_json inserts the value into its correct place in the json hierarchy, while building out the json object recursively.
I have the following Pandas dataframe:
foo = {
"first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
"last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
"salary" : [101000, 50000, 56943, 330532, 92750],
}
df = pd.DataFrame(foo)
I'd like to be able to validate column data using a RegEx pattern, then replace with NaN if the validation fails.
To do this, I use the following hard-coded RegEx patterns in the .replace() method:
df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)
This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.
I'd like to set up a dictionary that looks as follows:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
Then, I'd like to pass a value to the .replace() function based on the name of the column in the df. It would look as follows:
df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
How would I reference the name of the df column, then use that to look up the key in the dictionary and get its associated value?
For example, look up first_name, then access its dictionary value [^A-Za-z \/\-\.\'] and pass this value into .replace()?
Thanks!
P.S. if there is a more elegant approach, I'm all ears.
One approach would be using columns attribute:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
for column in df.columns:
df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)
You can actually pass a nested dictionary of the form {'col': {'match': 'replacement'}} to replace
In your case:
d = {k:{v:np.nan} for k,v in regex_patterns.items()}
df.replace(d, regex=True)
This is the flattened version of the column. I still need the keys as column titles for the dataframe and the values as values for the corresponding column.
reaction
{ "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
I want my data to look like this so I can add it to the dataframe.
veddra_term_code veddra_version veddra_term_name
99026 3 'Tablets, Abnormal'
Use f-strings. Theyre made for creating strings formatted like you want:
d = { "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
s = f'veddra_term_code veddra_version veddra_term_name {d["veddra_term_code"]} {d["veddra_version"]} \'{d["veddra_term_name"]}\''
print(s) # prints veddra_term_code veddra_version veddra_term_name 99026 3 'Tablets, Abnormal'
I have a below API response. This is a very small subset which I am pasting here for reference. there can be 80+ columns on this.
[["name","age","children","city", "info"], ["Richard Walter", "35", ["Simon", "Grace"], {"mobile":"yes","house_owner":"no"}],
["Mary", "43", ["Phil", "Marshall", "Emily"], {"mobile":"yes","house_owner":"yes", "own_stocks": "yes"}],
["Drew", "21", [], {"mobile":"yes","house_owner":"no", "investor":"yes"}]]
Initially I thought pandas could help here and searched accordingly but as a newbie to python/coding I was not able to get much out of it. any help or guidance is appreciated.
I am expecting output in a JSON key-value pair format such as below.
{"name":"Mary", "age":"43", "children":["Phil", "Marshall", "Emily"],"info_mobile":"yes","info_house_owner":"yes", "info_own_stocks": "yes"},
{"name":"Drew", "age":"21", "children":[], "info_mobile":"yes","info_house_owner":"no", "info_investor":"yes"}]```
I assume that the first list always will be the headers (column names)?
If that is the case, maybe something like this could work.
import pandas as pd
data = [["name", "age", "children", "info"], ["Ned", 40, ["Arya", "Rob"], {"dead": "yes", "winter is coming": "yes"}]]
headers = data[0]
data = data[1:]
df = pd.DataFrame(data, columns=headers)
df_json = df.to_json()
print(df)
Assuming that the first list always represents the keys ["name", "age"... etc]
and then the subsequent lists represent the actual data/API response then you can construct a dictionary (key pair values) like this.
keys = ["name", "age", "children", "info"]
api_response = ["Richard Walter", "35", ["Simon", "Grace"], {"mobile":"yes","house_owner":"no"}]
data_dict = {k: v for k, v in zip(keys, api_response)}
Here is example JSON im working with.
{
":#computed_region_amqz_jbr4": "587",
":#computed_region_d3gw_znnf": "18",
":#computed_region_nmsq_hqvv": "55",
":#computed_region_r6rf_p9et": "36",
":#computed_region_rayf_jjgk": "295",
"arrests": "1",
"county_code": "44",
"county_code_text": "44",
"county_name": "Mifflin",
"fips_county_code": "087",
"fips_state_code": "42",
"incident_count": "1",
"lat_long": {
"type": "Point",
"coordinates": [
-77.620031,
40.612749
]
}
I have been able to pull out select columns I want except I'm having troubles with "lat_long". So far my code looks like:
# PRINTS OUT SPECIFIED COLUMNS
col_titles = ['county_name', 'incident_count', 'lat_long']
df = df.reindex(columns=col_titles)
However 'lat_long' is added to the data frame as such: {'type': 'Point', 'coordinates': [-75.71107, 4...
I thought once I figured out how properly add the coordinates to the data frame I would then create two seperate columns, one for latitude and one for longitude.
Any help with this matter would be appreciated. Thank you.
If I don't misunderstood your requirements then you can try this way with json_normalize. I just added the demo for single json, you can use apply or lambda for multiple datasets.
import pandas as pd
from pandas.io.json import json_normalize
df = {":#computed_region_amqz_jbr4":"587",":#computed_region_d3gw_znnf":"18",":#computed_region_nmsq_hqvv":"55",":#computed_region_r6rf_p9et":"36",":#computed_region_rayf_jjgk":"295","arrests":"1","county_code":"44","county_code_text":"44","county_name":"Mifflin","fips_county_code":"087","fips_state_code":"42","incident_count":"1","lat_long":{"type":"Point","coordinates":[-77.620031,40.612749]}}
df = pd.io.json.json_normalize(df)
df_modified = df[['county_name', 'incident_count', 'lat_long.type']]
df_modified['lat'] = df['lat_long.coordinates'][0][0]
df_modified['lng'] = df['lat_long.coordinates'][0][1]
print(df_modified)
Here is how you can do it as well:
df1 = pd.io.json.json_normalize(df)
pd.concat([df1, df1['lat_long.coordinates'].apply(pd.Series) \
.rename(columns={0: 'lat', 1: 'long'})], axis=1) \
.drop(columns=['lat_long.coordinates', 'lat_long.type'])