I have an SQL database which has two columns. One has the timestamp, the other holds data in JSON format
for example df:
ts data
'2017-12-18 02:30:20.553' {'name':'bob','age':10, 'location':{'town':'miami','state':'florida'}}
'2017-12-18 02:30:21.101' {'name':'dan','age':15, 'location':{'town':'new york','state':'new york'}}
'2017-12-18 02:30:21.202' {'name':'jay','age':11, 'location':{'town':'tampa','state':'florida'}}
If I do the following :
df = df['data'][0]
print (df['name'].encode('ascii', 'ignore'))
I get :
'bob'
Is there a way I can get all of the data correspondings to a JSON key for the whole column?
(i.e. for the df column 'data' get 'name')
'bob'
'dan'
'jay'
Essentially I would like to be able to make a new df column called 'name'
You can use json_normalize i.e
pd.io.json.json_normalize(df['data'])['name']
0 bob
1 dan
2 jay
Name: name, dtype: object
IIUC, lets use apply with lambda function to select value from dictionary by key:
df['data'].apply(lambda x: x['name'])
Output:
0 bob
1 dan
2 jay
Name: data, dtype: object
Related
I want to remove duplicates in a column via Pandas.
I tried df.drop_duplicates() but no luck.
How to achieve this in Pandas?
Input:
A
team=red, Manager=Travis
team=Blue, Manager=John, team=Blue
Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Expected_Output:
A
team=red, Manager=Travis
team=Blue, Manager=John
Manager=David, Bank=HDFC, team=XYZ
Code
df = df.drop_duplicates('A', keep='last')
You can use some data structures to achieve this result.
split entries
convert to set (or some non duplicated structure)
join back to string
print(df['A'])
0 team=red, Manager=Travis
1 team=Blue, Manager=John, team=Blue
2 Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Name: A, dtype: object
out = (
df['A'].str.split(r',\s+')
.map(set)
.str.join(", ")
)
print(out)
0 Manager=Travis, team=red
1 team=Blue, Manager=John
2 Bank=HDFC, team=XYZ, Manager=David
Name: A, dtype: object
Alternatively, if the order of your string entries is important, you can use dict.fromkeys instead of a set. Since dictionaries are implicitly ordered as of Py > 3.6
out = (
df['A'].str.split(r',\s+')
.map(dict.fromkeys)
.str.join(", ")
)
print(out)
0 team=red, Manager=Travis
1 team=Blue, Manager=John
2 Manager=David, Bank=HDFC, team=XYZ
Name: A, dtype: object
Try:
df['A'].str.split(',').explode().str.strip(' ')\
.drop_duplicates().groupby(level=0).agg(','.join)
Output:
0 team=red,Manager=Travis
1 team=Blue,Manager=John
2 Manager=David,Bank=HDFC,team=XYZ
Name: A, dtype: object
I currently have 2 csv files and am reading them both in, and need to get the ID's in one csv and find them in the other so that I can get their row of data. Currently I have the following code that I believe goes through the first dataframe but only is adding the last match onto the new dataframe. I need it to add all of the subsequent rows however.
Here is my code:
patientSet = pd.read_csv("794_chips_RMA.csv")
affSet = probeset[probeset['Analysis']==1].reset_index(drop=True)
houseGenes = probeset[probeset['Analysis']==0].reset_index(drop=True)
for x in affSet['Probeset']:
#patients = patientSet[patientSet['ID']=='1557366_at'].reset_index(drop=True)
#patients = patientSet[patientSet['ID']=='224851_at'].reset_index(drop=True)
patients = patientSet[patientSet['ID']==x].reset_index(drop=True)
print(affSet['Probeset'])
print(patientSet['ID'])
print(patients)
The following is the output:
0 1557366_at
1 224851_at
2 1554784_at
3 231578_at
4 1566643_a_at
5 210747_at
6 231124_x_at
7 211737_x_at
Name: Probeset, dtype: object
0 1007_s_at
1 1053_at
2 117_at
3 121_at
4 1255_g_at
...
54670 AFFX-ThrX-5_at
54671 AFFX-ThrX-M_at
54672 AFFX-TrpnX-3_at
54673 AFFX-TrpnX-5_at
54674 AFFX-TrpnX-M_at
Name: ID, Length: 54675, dtype: object
ID phchp003v1 phchp003v2 phchp003v3 ... phchp367v1 phchp367v2 phchp368v1 phchp368v2
0 211737_x_at 12.223453 11.747159 9.941889 ... 14.828389 9.322779 10.609053 10.771162
as you can see, it is only matching the very last ID from the first dataframe, and not all of them. How can I get them all to match and be in patients? Thank you.
you probably want to use the merge function
df_inner = pd.merge(df1, df2, on='id', how='inner')
check here https://www.datacamp.com/community/tutorials/joining-dataframes-pandas search for "inner join"
--edit--
you can specify the columns (using left_on=None,right_on=None,) , look here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
#Rui Lima already posted the correct answer, but you'll need to use the following to make it work:
df = pd.merge(patientSet , affSet, on='ID', how='inner')
I have the following input file in csv:
INPUT
ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No
The desired output is the following:
** DESIRED OUTPUT**
ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker
Basically, I want to group by the same GroupID and then concatenate all the values present in Person column that belong to that group. Then, in my output, for each group I want to return the ID(s) of those rows where the Parent column is "Yes", the GroupID, and the concatenated person values for each group.
I am able to concatenate all person values for a particular group and remove any duplicate values from the person column in my output. Here is what I have so far:
import pandas as pd
inputcsv = path to the input csv file
outputcsv = path to the output csv file
colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)
#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.
df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()
df2.to_csv(outputcsv, sep=',', index=False)
This yields the following output:
GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker
I can't figure out how to include the ID column and include all rows in a group where the Parent is "Yes" (as shown in the desired output above).
IIUC
df.Person=df.Person.str.split(';')#1st split the string to list
df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]:
ID GroupID Person Parent
0 ID_001 A001 James Smith;John Doe;Mary Jane Yes
2 ID_003 A001 James Smith;John Doe;Mary Jane Yes
3 ID_004 B003 Troy Baker;Nathan Drake Yes
I have a following dataframe:
id ip
1 219.237.42.155
2 75.74.144.120
3 219.237.42.155
By using maxmindb-geolite2 package, I can find out what city a specific ip is assigned to. The following code:
from geolite2 import geolite2
reader = geolite2.reader()
reader.get('219.237.42.155')
will return a dictionary, and by looking up keys, I can actually get a city name:
reader.get('219.237.42.155')['city']['names']['en']
returns:
'Beijing'
The problem I have is that I do not know how to get the city for each ip in the dataframe and put it in the third column, so the result would be:
id ip city
1 219.237.42.155 Beijing
2 75.74.144.120 Hollywood
3 219.237.42.155 Beijing
The farthest I got was mapping the whole dictionary to a separate column by using the code:
df['city'] = df['ip'].apply(lambda x: reader.get(x))
On the other hand:
df['city'] = df['ip'].apply(lambda x: reader.get(x)['city']['names']['en'])
throws a key error.. What am I missing?
#you can use apply to check if the key exists before trying to access its values.
df.apply(lambda x: reader.get(x.ip,np.nan),axis=1).apply(lambda x: np.nan if pd.isnull(x) else x['city']['names']['en'])
Out[39]:
0 Beijing
1 NaN
2 Beijing
dtype: object
I have a list of dicts in a single column but for each row, a different post_id in a separate column. I've gotten the dataframe I am looking for via pd.concat(json_normalize(d) for d in data['comments']) but I'd like to add another column to this from the original dataframe to attach the original post_id.
Original
'post_id' 'comments'
123456 [{'from':'Bob','present':True}, {'from':'Jon', 'present':False}]
Current Result (after json_normalize)
comments.from comments.present
Bob True
Jon False
Desired Result
comments.from comments.present post_id
Bob True 123456
Jon False 123456
Thanks for any help
Consider first outputting dataframe to_json then run json_normalize:
import json
from pandas import DataFrame
from pandas.io.json import json_normalize
df = DataFrame({'post_id':123456,
'comments': [{'from':'Bob','present':True},
{'from':'Jon', 'present':False}]})
df_json = df.to_json(orient='records')
finaldf = json_normalize(json.loads(df_json), meta=['post_id'])
print(finaldf)
# comments.from comments.present post_id
# 0 Bob True 123456
# 1 Jon False 123456