Convert dataframe column with Json Array content into separate columns

Convert dataframe column with Json Array content into separate columns - python

i have a column dataframe with a json array that i want to split in columns for every row.
Dataframe
FIRST_NAME CUSTOMFIELDS
0 Maria [{'FIELD_NAME': 'CONTACT_FIELD_1', 'FIELD_VALU...
1 John [{'FIELD_NAME': 'CONTACT_FIELD_1', 'FIELD_VALU...
...
Goal
I need convert the json content in that column into a dataframe
+------------+-----------------+-------------+-----------------+
| FIRST NAME | FIELD_NAME | FIELD_VALUE | CUSTOM_FIELD_ID |
+------------+-----------------+-------------+-----------------+
| Maria | CONTACT_FIELD_1 | EN | CONTACT_FIELD_1 |
| John | CONTACT_FIELD_1 | false | CONTACT_FIELD_1 |
+------------+-----------------+-------------+-----------------+

The code snippet below should work for you.
import pandas as pd
df = pd.DataFrame()
df['FIELD'] = [[{'FIELD_NAME': 'CONTACT_FIELD_1', 'FIELD_VALUE': 'EN', 'CUSTOM_FIELD_ID': 'CONTACT_FIELD_1'}, {'FIELD_NAME': 'CONTACT_FIELD_10', 'FIELD_VALUE': 'false', 'CUSTOM_FIELD_ID': 'CONTACT_FIELD_10'}]]
temp_dict = {}
counter = 0
for entry in df['FIELD'][0]:
temp_dict[counter] = entry
counter += 1
new_dataframe = pd.DataFrame.from_dict(temp_dict, orient='index')
new_dataframe #outputs dataframe
Edited answer to reflect edited question:
Under the assumption that each entry in CUSTOMFIELDS is a list with 1 element (which is different from original question; the entry had 2 elements), the following will work for you and create a dataframe in the requested format.
import pandas as pd
# Need to recreate example problem
df = pd.DataFrame()
df['CUSTOMFIELDS'] = [[{'FIELD_NAME': 'CONTACT_FIELD_1', 'FIELD_VALUE': 'EN', 'CUSTOM_FIELD_ID': 'CONTACT_FIELD_1'}],
[{'FIELD_NAME': 'CONTACT_FIELD_1', 'FIELD_VALUE': 'FR', 'CUSTOM_FIELD_ID': 'CONTACT_FIELD_1'}]]
df['FIRST_NAME'] = ['Maria', 'John']
#begin solution
counter = 0
dataframe_solution = pd.DataFrame()
for index, row in df.iterrows():
dataframe_solution = pd.concat([dataframe_solution, pd.DataFrame.from_dict(row['CUSTOMFIELDS'][0], orient = 'index').transpose()], sort = False, ignore_index = True)
dataframe_solution.loc[counter,'FIRST_NAME'] = row['FIRST_NAME']
counter += 1
Your dataframe is in dataframe_solution

Related

Pandas export data to CSV and make first row headers

I have this table which i export to CSV Using this code:
df['time'] = df['time'].astype("datetime64").dt.date
df = df.set_index("time")
df = df.groupby(df.index).agg(['min', 'max', 'mean'])
df = df.reset_index()
df = df.to_csv(r'C:\****\Exports\exportMMA.csv', index=False)
While exporting this, my result is:
| column1 | column2 | column3 |
|:---- |:------: | -----: |
| FT1 | FT2 | FT3 |
| 12 | 8 | 3 |
I want to get rid of column 1,2,3 and replace the header with FT2 and FT3
Tried this :
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header #set the header row as the df header
And This :
df.columns = df.iloc[0]
df = df[1:]
Somehow it wont work, I not really in need to replace the headers in the dataframe having the right headers in csv is more important.
Thanks!

You can change the columns names in your DataFrame before writing it to a CSV file. Here is the updated code:
df['time'] = df['time'].astype("datetime64").dt.date
df = df.set_index("time")
df = df.groupby(df.index).agg(['min', 'max', 'mean'])
df = df.reset_index()
# Changing the column names
df.columns = ['FT2', 'FT3']
# Writing the DataFrame to a CSV file
df.to_csv(r'C:\****\Exports\exportMMA.csv', index=False, header=True)
The header parameter in the to_csv method determines whether or not to write the column names to the CSV file. In this case, it's set to True so that the column names will be written.

How to rename column names according to value of columns

I need to arrange a Pandas DataFrame with values that aren't in the right columns. I would like to rearrange the values in the cells according to a prefix that I have, and push the 'unknown' columns with their values to the end of the dataframe.
I have the following dataframe:
The output I am looking for is:
the 'known' values have a header while the unknowns (5, 6) are to the end.
the 'rule': if there is no cell with '|' in the column then the column name will not be changed.
any suggestions that I could try would be really helpful in solving this.

Try this:
import pandas as pd
rename_dict = {} # reset rename dictionay
df = pd.DataFrame({'1':['name | Steve', 'name | John'],
'2':[None, None],
'3':[None , 'age | 50']})
for col in df.columns:
vals = df[col].values # look at values in each column
vals = [x for x in vals if x] # remove Nulls
vals = [x for x in vals if '|' in x] # leave values with | only
if len(vals) > 0:
new_col_name = vals[0].split('|')[0] # getting the new column name
rename_dict[col] = new_col_name # add column names to rename dictionay
df.rename(columns=rename_dict, inplace = True) # renaming the column name
df
name 2 age
0 name | Steve None None
1 name | John None age | 50

it looks a bit tricky and not exactly what you expected, but it might give you an idea how to solve your task:
df = pd.DataFrame([['email | 1#mail.com','name | name1','surname | surname1','','',''],
['email | 2#mail.com','','name | name2','occupation | student','surname | surname2','abc | 123']])
df.apply(lambda x: pd.Series(dict([tuple(i.split(' | ')) for i in x.tolist() if i])),axis=1)
>>> out
'''
abc email name occupation surname
0 NaN 1#mail.com name1 NaN surname1
1 123 2#mail.com name2 student surname2

You can try this solution:
my_dict = {}
def createDict(ss):
for i in range(1, 7, 1):
sss = ss[i].split('|')
if len(sss) > 1:
if sss[0].strip() in my_dict:
my_dict[sss[0].strip()].append(ss[i])
else:
my_dict[sss[0].strip()] = [ss[i]]
df = df.apply(lambda x: createDict(x), axis=1)
dff = pd.DataFrame.from_dict(my_dict, orient='index')
dff = dff.transpose()
print(dff)
Hope this answers your question.

How do you replace the "_" in the name of a list of keys from a dictionary as a result of doing a dict.keys()?

I have the following code.....
#classmethod
def get_instruments_dict(cls):
i_list = cls.get_instruments_list()
i_keys = [x.name for x in i_list]
#print(i_list) ==> Result [{'name': 'ZAR_JPY', 'type': 'CURRENCY', 'displayName': 'ZAR/JPY', 'pipLocation': 0.01, 'marginRate': '0.07'}, {'name': 'EUR_HUF', 'type': 'CURRENCY', 'displayName': 'EUR/HUF', 'pipLocation': 0.01, 'marginRate': '0.05'}....]
#print (i_keys) ==> Results ['ZAR_JPY', 'EUR_HUF', 'EUR_DKK', 'USD_MXN', 'GBP_USD', 'CAD_CHF', 'EUR_GBP'...........]
df = pd.DataFrame(i_keys)
print(df) ==> Result
"""
| | 0 |
|:---|:------|
|0 |ZAR_JPY|
|1 |EUR_HUF|
|2 |EUR_DKK|
|3 |USD_MXN|
|4 |GBP_USD|
|.. | ...|
|63 |USD_PLN|
|64 |CAD_HKD|
|65 |GBP_CAD|
|66 |GBP_PLN|
|67 | |
"""
# I Tried the following with no luck.......................
list = df.astype(str).tolist()
print(list) ==> <bound method DataFrame.count of Empty DataFrame
Columns: []
Index: []>
return {k:v for (k,v) in zip(i_keys, i_list) }
I would like to remove the "_" in each name of the df. Maybe a loop of the DataFrame would work? I can't seem to find an example of this with a List of keys. It does not appear to be a way to use the index of this output.
End Result should look like this, as a usable DataFrame when done.......
Index
Name
0
ZARJPY
1
EURHUF
2
EURDKK
3
USDMXN
4
GBPUSD
..
...
63
USDPLN
64
CADHKD
65
GBPCAD
66
GBPPLN
67

It should be pretty easy,
Sample Data:
df = pd.DataFrame({0:['ZAR_JPY', 'EUR_HUF', 'EUR_DKK', 'USD_MXN', 'GBP_USD']})
print(df)
0
0 ZAR_JPY
1 EUR_HUF
2 EUR_DKK
3 USD_MXN
4 GBP_USD
Rename zero to Names:
df = df.rename(columns={0: "Names"})
print(df)
Names
0 ZAR_JPY
1 EUR_HUF
2 EUR_DKK
3 USD_MXN
4 GBP_USD
Use replace:
df['Names'] = df['Names'].str.replace("_", '')
print(df)
Names
0 ZARJPY
1 EURHUF
2 EURDKK
3 USDMXN
4 GBPUSD
In case, you are looking for Index as column as well..
df.index.names = ['Index']
df = df.reset_index()
print(df)
Index Names
0 0 ZARJPY
1 1 EURHUF
2 2 EURDKK
3 3 USDMXN
4 4 GBPUSD
Or In a Single line:
df = df.rename(columns={0: "Names"}).replace("_", '').reset_index()

Another option could be using pd.DataFrame + dict.get + str.replace
Using comprehension-list:
# If name key is not in the dictionary, then 'Not found' will be returned.
df_output = pd.DataFrame(columns=['Name'], data=[d.get('name', 'Not found').replace("_","") for d in i_list]).reset_index()
print(df_output)
Without comprehension-list:
data = []
for d in i_list:
name = d.get('name', 'Not found').replace("_","")
data.append(name)
df_output = pd.DataFrame(columns=['Name'], data=data).reset_index()
print(df_output)
EDIT:
If you want the index as a new column in the final dataframe, so you should add: reset_index() method. I have update the code above with this new feature.
Output:
index
Name
0
0
ZARJPY
1
1
EURHUF

How to create a dictionary and store values of multiple columns w.r.t to one column and reconstruct it back to original dataframe?

Hello I have a dataframe,
---------------------
ID | PARTY | KId
---------------------
1 | IND | 12
2 | IND | 13
3 | CUST | 14
4 | IND | 17
---------------------
I want to create a dict in python that stores the values of column 'Party' and 'KId' w.r.t the value in Id.
So my dictionary should be like:
dict = {
1 : 'IND_12',
2 : 'IND_13'
.
.
}
what I tried:
dict = {}
df = df.reset_index(drop=True)
for idx in df_.index:
temp = df[df.index==idx]
dict[temp['ID'].iloc[0]] = f"{temp['PARTY'].iloc[0]}_{temp['KId'].iloc[0]}"
after this dictionary is generated which posiibly is the best solution to reconstruct my original df from the dict?

Dont use dict for variable, because python code word.
Solution if ID is column:
You can jojn both columns and convert it to dict:
df['PARTY'] = df['PARTY'] + '_' + df['KId'].astype(str)
d = df.set_index('ID')['PARTY'].to_dict()
Or:
df['PARTY'] = df['PARTY'] + '_' + df['KId'].astype(str)
#failed if dict is variable used before
d = dict(zip(df['ID'], df['PARTY']))
Solution if ID is index use Series.str.cat:
d = df['PARTY'].str.cat(df['KId'].astype(str), sep='_').to_dict()
print (d)
{1: 'IND_12', 2: 'IND_13', 3: 'CUST_14', 4: 'IND_17'}
For convert back use:
d = {1: 'IND_12', 2: 'IND_13', 3: 'CUST_14', 4: 'IND_17'}
df = pd.DataFrame.from_dict(d, orient='index', columns=['PARTY'])
df[['PARTY','KId']] = df['PARTY'].str.split('_', expand=True)
df['KId'] = df['KId'].astype(int)
df = df.rename_axis('ID')
print (df)
PARTY KId
ID
1 IND 12
2 IND 13
3 CUST 14
4 IND 17

import pandas as pd
Just make use of astype() method and to_dict() method:-
mydict=(df['PARTY']+'_'+df['KId'].astype(str)).to_dict()
For making it back to dataframe use this:-
df=pd.DataFrame(mydict.values(),index=mydict.keys())
df=df[0].str.split('_',expand=True).rename(columns={0:'PARTY',1:'KId'})
df['KId']=df['KId'].astype(int)
df.index.name='ID'

searching if anyone of word is present in the another column of a dataframe or in another data frame using python

Hi I have two DataFrames like below
DF1
Alpha | Numeric | Special
and | 1 | #
or | 2 | $
| 3 | &
| 4 |
| 5 |
and
DF2 with single column
Content |
boy or girl |
school # morn|
I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF
output_DF
output_column|
Alpha |
Special |
someone help me with this

I have a method that is not very good.
df1 = pd.DataFrame([[['and', 'or'],['1', '2','3','4','5'],['#', '$','&']]],columns=['Alpha','Numeric','Special'])
print(df1)
Alpha Numeric Special
0 [and, or] [1, 2, 3, 4, 5] [#, $, &]
df2 = pd.DataFrame([[['boy', 'or','girl']],[['school', '#','morn']]],columns=['Content'])
print(df2)
Content
0 [boy, or, girl]
1 [school, #, morn]
First, combine the df2 data：
df2list=[x for row in df2['Content'].tolist() for x in row]
print(df2list)
['boy', 'or', 'girl', 'school', '#', 'morn']
Then get data of each column of df1 is intersected with the df2list:
containlistname = []
for i in range(0,df1.shape[1]):
columnsname = df1.columns[i]
df1list=[x for row in df1[columnsname].tolist() for x in row]
intersection = list(set(df1list).intersection(set(df2list)))
if len(intersection)>0:
containlistname.append(columnsname)
output_DF = pd.DataFrame(containlistname,columns=['output_column'])
Final print：
print(output_DF)
output_column
0 Alpha
1 Special

You could apply the Series.isin() method for each column in df1 and then return the column names for which there are any occurrences:
import pandas as pd
d = {'Alpha' :['and', 'or'],'Numeric':[1, 2,3,4,5],'Special':['#', '$','&']}
df1 = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.iteritems() ]))
df2 = pd.DataFrame({'Content' :['boy or girl','school # morn']})
check = lambda r:[c for c in df1.columns if df1[c].dropna().isin(r).any()]
df3 = pd.DataFrame({'output_column' : df2["Content"].str.split(' ').apply(check)})
This results in:
output_column
0 [Alpha]
1 [Special]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert dataframe column with Json Array content into separate columns - python

Related

Pandas export data to CSV and make first row headers

How to rename column names according to value of columns

How do you replace the "_" in the name of a list of keys from a dictionary as a result of doing a dict.keys()?

How to create a dictionary and store values of multiple columns w.r.t to one column and reconstruct it back to original dataframe?

searching if anyone of word is present in the another column of a dataframe or in another data frame using python

Categories

Resources