Pandas merge column after json_normalize - python

I have a list of dicts in a single column but for each row, a different post_id in a separate column. I've gotten the dataframe I am looking for via pd.concat(json_normalize(d) for d in data['comments']) but I'd like to add another column to this from the original dataframe to attach the original post_id.
Original
'post_id' 'comments'
123456 [{'from':'Bob','present':True}, {'from':'Jon', 'present':False}]
Current Result (after json_normalize)
comments.from comments.present
Bob True
Jon False
Desired Result
comments.from comments.present post_id
Bob True 123456
Jon False 123456
Thanks for any help

Consider first outputting dataframe to_json then run json_normalize:
import json
from pandas import DataFrame
from pandas.io.json import json_normalize
df = DataFrame({'post_id':123456,
'comments': [{'from':'Bob','present':True},
{'from':'Jon', 'present':False}]})
df_json = df.to_json(orient='records')
finaldf = json_normalize(json.loads(df_json), meta=['post_id'])
print(finaldf)
# comments.from comments.present post_id
# 0 Bob True 123456
# 1 Jon False 123456

Related

How to check if value of dataframe one exist in dataframe two and join two dataframes?

I have two csv file like the below:
city.csv :
City,Province
aa,b
bb,c
ee,b
customers.csv:
Address, CustomerID
John Smith aa blab blab, 234
Micheal Smith bb blab2 blab2, 123
I want join two csv files with pandas dataframe with the condion (if City in address).
I try the below code:
import pandas as pd
df1 = pd.read_csv(r"city.csv")
df2 = pd.read_csv(r"customers.csv")
df1["City"] = df2.drop("Address", 1).isin(df2["Address"]).any(1)
I follow this Q/A but it did not work for me.
How to join these two csv files in pandas dataframe?
Use:
pat = '|'.join(df1["City"].values)
df2['col to join'] = df2['Address'].str.extract(f'({pat})')

can not append new data to a pd dataframe

im trying to save some data in a dataframe, the first row of the dataframe should be ('Tom',.99, 'tom2'), supose i need to add ('mart',.3, 'mart2') row to the dataframe , i've tried to use append but is adding nothing this is my code
import pandas as pd
trackeds = {'Name':['Tom'], 'proba':[.99],'name2':['tom2']}
df_trackeds = pd.DataFrame(trackeds)
df_trackeds.append(pd.DataFrame({'name':['mart'],'proba': [.3],'name2':['mart2']}))
print(df_trackeds)
the output is
Name proba name2
0 Tom 0.99 tom2
i also tried to use
df_trackeds.append({'name':['mart'],'proba': [.3],'name2':['mart2']},ignore_index=True)
and
df_trackeds.append(pd.DataFrame({'name':['mart'],'proba': [.3],'name2':['mart2']}))
but nothing, i hope you can help me, thanks in advance
Pandas function DataFrame.append not working inplace like pure python append, so is necessary assign back:
df = pd.DataFrame({'Name':['mart'],'proba': [.3],'name2':['mart2']})
df_trackeds = df_trackeds.append(df, ignore_index=True)
print(df_trackeds)
Name proba name2
0 Tom 0.99 tom2
1 mart 0.30 mart2

How do I extract values from different columns after a groupby in pandas?

I have the following input file in csv:
INPUT
ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No
The desired output is the following:
** DESIRED OUTPUT**
ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker
Basically, I want to group by the same GroupID and then concatenate all the values present in Person column that belong to that group. Then, in my output, for each group I want to return the ID(s) of those rows where the Parent column is "Yes", the GroupID, and the concatenated person values for each group.
I am able to concatenate all person values for a particular group and remove any duplicate values from the person column in my output. Here is what I have so far:
import pandas as pd
inputcsv = path to the input csv file
outputcsv = path to the output csv file
colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)
#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.
df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()
df2.to_csv(outputcsv, sep=',', index=False)
This yields the following output:
GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker
I can't figure out how to include the ID column and include all rows in a group where the Parent is "Yes" (as shown in the desired output above).
IIUC
df.Person=df.Person.str.split(';')#1st split the string to list
df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]:
ID GroupID Person Parent
0 ID_001 A001 James Smith;John Doe;Mary Jane Yes
2 ID_003 A001 James Smith;John Doe;Mary Jane Yes
3 ID_004 B003 Troy Baker;Nathan Drake Yes

How to transform dataframe into "ColumnName1 | Value1 \r\n ColumnName2 | Value2 \r\n ColumnName3 | Value3" etc

I have a pandas dataframe consisting of 11 columns and 1 row. I need the final output to go from:
Type ID From To
XYZ 999 Tony Andy
To:
Type|XYZ
ID|999
From|Tony
To|Andy
The result will the be exported to a txt file, which I believe I can manage.
Thank you!
Just use Transpose:
m=df.transpose()
And then:
[str(list(a)[0])+'|'+str(list(a)[1]) for a in m.values]
You can zip together the column names and value from the first row to pipe delimit them and then join the resulting list with a new line character to put everything on a separate line.
import pandas as pd
df = pd.DataFrame([{"Type": "XYZ", "ID": 999, "From": "Tony", "To": "Andy"}])
print(
"\n".join(["|".join([col, str(val)]) for col, val in zip(df.columns, df.iloc[0])])
)

Python reading JSON in dataframe

I have an SQL database which has two columns. One has the timestamp, the other holds data in JSON format
for example df:
ts data
'2017-12-18 02:30:20.553' {'name':'bob','age':10, 'location':{'town':'miami','state':'florida'}}
'2017-12-18 02:30:21.101' {'name':'dan','age':15, 'location':{'town':'new york','state':'new york'}}
'2017-12-18 02:30:21.202' {'name':'jay','age':11, 'location':{'town':'tampa','state':'florida'}}
If I do the following :
df = df['data'][0]
print (df['name'].encode('ascii', 'ignore'))
I get :
'bob'
Is there a way I can get all of the data correspondings to a JSON key for the whole column?
(i.e. for the df column 'data' get 'name')
'bob'
'dan'
'jay'
Essentially I would like to be able to make a new df column called 'name'
You can use json_normalize i.e
pd.io.json.json_normalize(df['data'])['name']
0 bob
1 dan
2 jay
Name: name, dtype: object
IIUC, lets use apply with lambda function to select value from dictionary by key:
df['data'].apply(lambda x: x['name'])
Output:
0 bob
1 dan
2 jay
Name: data, dtype: object

Categories