I have a very large, deeply nested json, from which I need only some key-values pairs, not all of them. Because it's very deeply nested, it's not comfortable to create a pandas dataframe directly from the json, because all the values I need will not be in columns.
I need to create pandas dataframe that should look like:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
From my complex json I extract the values I need as follows (All the values are extracted correctly so there is no error here):
jfile=json.loads(data)
groupe=jfile['location']['groupe']
id=jfile['id']
MotherName=jfile['Mother']['MotherName']
FatherName=jfile['Father']['FatherName']
jfile looks like:
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing","season":["summer","spring"]}
Then I create an empty dataframe and try to fill it with these values:
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
for index,row in df.iterrows():
row['groupe']=jfile['location']['groupe']
row['id']=jfile['id']
row['MotherName']=jfile['Father']['FatherName']
row['id']=jfile['Mother']['MotherName']
but when I try to print it, it says the dataframe is empty:
Empty DataFrame Columns: [groupe, id, MotherName, FatherName] Index:
[]
Why is it empty with this method, and how I can fill it properly?
If you're just trying to add a single row to your dataframe, you can use
df = df.append({"groupe":group,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
Related
How would I remove data rows in a dataset which do not match the dictionary values of two columns. for example take the snippet of my data set.
I want to remove the rows where the data doesn't match the dictionary like in row 6 and apply this function to a large data set.
dictionary = SiteID, Location [188:A, 203:B, 206:C, 270:D, 463:E]
Try to do with merge
d = {188:'A', 203:'B', 206:'C', 270:'D', 463:'E'}
out = df.merge(pd.Series(d).rename_axis('SiteID').to_frame('Location').reset_index(),how='left')
I want to read a file 'tos_year.csv' into a Pandas dataframe, such that all values are in one single column. I will later use pd.concat() to add this column to an existing dataframe.
The CSV file holds 80 entries in the form of years, i.e. "... 1966,1966,1966,1966,1967,1967,... "
What I can't figure out is how to read the values into one column with 80 rows, instead of 80 columns with one row.
This is probably quite basic but I'm new to this. Here's my code:
import pandas as pd
tos_year = pd.read_csv('tos_year.csv').T
tos_year.reset_index(inplace=True)
tos_year.columns = ['Year']
As you can see, I tried reading it in and then transposing the dataframe, but when it gets read in initially, the year numbers are interpreted as column names, and there apparently cannot be several columns with identical names, so I end up with a dataframe that holds str-values like
...
1966
1966.1
1966.2
1966.3
1967
1967.1
...
which is not what I want. So clearly, it's preferable to read it in correctly from the start.
Thanks for any advice!
Add header=None for avoid parse years like columns names, then transpose and rename column, e.g. by DataFrame.set_axis:
tos_year = pd.read_csv('tos_year.csv', header=None).T.set_axis(['Year'], axis=1)
Or:
tos_year = pd.read_csv('tos_year.csv', header=None).T
tos_year.columns = ['Year']
I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)
I want to create a dictionary from a dataframe in python.
In this dataframe, frame one column contains all the keys and another column contains multiple values of that key.
DATAKEY DATAKEYVALUE
name mayank,deepak,naveen,rajni
empid 1,2,3,4
city delhi,mumbai,pune,noida
I tried this code to first convert it into simple data frame but all the values are not separating row-wise:
columnnames=finaldata['DATAKEY']
collist=list(columnnames)
dfObj = pd.DataFrame(columns=collist)
collen=len(finaldata['DATAKEY'])
for i in range(collen):
colname=collist[i]
keyvalue=finaldata.DATAKEYVALUE[i]
valuelist2=keyvalue.split(",")
dfObj = dfObj.append({colname: valuelist2}, ignore_index=True)
You should modify you title question, it is misleading because pandas dataframes are "kind of" dictionaries in themselves, that is why the first comment you got was relating to the .to_dict() pandas' built-in method.
What you want to do is actually iterate over your pandas dataframe row-wise and for each row generate a dictionary key from the first column, and a dictionary list from the second column.
For that you will have to use:
an empty dictionary: dict()
the method for iterating over dataframe rows: dataframe.iterrows()
a method to split a single string of values separated by a separator as the split() method you suggested: str.split().
With all these tools all you have to do is:
output = dict()
for index, row in finaldata.iterrows():
output[row['DATAKEY']] = row['DATAKEYVALUE'].split(',')
Note that this generates a dictionary whose values are lists of strings. And it will not work if the contents of the 'DATAKEYVALUE' column are not singles strings.
Also note that this may not be the most efficient solution if you have a very large dataframe.
I have turned a list of dicts into a dataframe using this:
import pandas as pd
d = pd.read_csv('/path/to/file.csv')
res_df = pd.DataFrame(d)
res_df.head()
Pandas did not interpret the format successfully, I'm guessing bc there were no quotes around the keys of the dict. It looked like this:
[{location:'playroom',state:'NY',zip:10011},{..}]
As a workaround, I stripped out "'","{}", and "[]", to make the file standard csv. However, when I call the names argument from pd.read_csv, I have two issues: 1 - the names columns are blank, and 2 - I end up with a dataframe that is 1 row with thousands of columns. res_df.transpose() did not work.
If my csv has no header row, and assuming it has the same number of fields for each record, why is it that I can't give pandas my column names, and create new dataframe rows based on these arguments/instructions?
What is the quicker/better way to do this?
*Update: here is a snippet of the csv file:
websitedotcomcom/,Jim,jim#testdotcom,777-444-5555,Long City, NY,1,http://document-url,,another_field,,,true,12 Feb 2015 (18:17),http://website.com/,Jim,jim#test.com,777-444-5555,Long City, NY,1,http://document-url,,another_field,,,true,12 Feb 2015 (18:17)
This looks like JSON rather than CSV. You should use pandas read_json method.
df = pd.read_json('/path/to/file.json')
Note: that it is sensitive to valid json, say you may have to do some string manipulation (e.g. replacing ' with ").