Passing 'records' parameter to Pandas to_dict method - python

I have a dataframe that I'd like to convert to a dictionary. I want each column in the dataframe to be mapped to the values in it.
import numpy as np
import pandas as pd
data = np.random.randn(5,5)
df = pd.DataFrame(data)
dict1 = df.to_dict(outtype = 'records')
Here is the error I get:
Value Error: outtype records not understood
However, when I check the docs for this method, records is listed as a valid parameter. Moreover, I am able to convert my dataframe to all the other dictionary formats mentioned in docs (list, series, and dict). I'm not sure what could be going wrong.

Try this:
dict1 = df.to_dict('records')

Related

Unable to update Pandas dataframe element with dictionary

I have a Pandas dataframe where its just 2 columns: the first being a name, and the second being a dictionary of information relevant to the name. Adding new rows works fine, but if I try to updates the dictionary column by assigning a new dictionary in place, I get
ValueError: Incompatible indexer with Series
So, to be exact, this is what I was doing to produce the error:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[pos, 'attributes'] = {'c':2}
I was able to find another solution that seems to work:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[:,'attributes'].at[pos] = {'c':2}
but I was hoping to get an answer as to why the first method doesn't work, or if there was something wrong with how I had it initially.
Since you are trying to access a dataframe with an index 'pos', you have to use iloc to access the row. So changing your last row as following would work as intended:
df.iloc[pos]['attributes'] = {'c':2}
For me working DataFrame.at:
df.at[pos, 'attributes'] = {'c':2}
print (df)
name attributes
0 a {'c': 2}

Pandas - Duplicate rows while creating a dataframe from dictionary

I have the following code where I talk to a service to get data. The dictionary that is returned from this service looks like below
{'karnatakaCount': 44631, 'bangaloreCount': 20870, 'mangaloreCount': 985, 'kodaguCount': 743, 'udupiCount': 556, 'kasargodCount': 673, 'mangaloreVaccinations': 354095, 'karnatakaVaccinations': 9892349, 'keralaVaccinations': 7508437, 'mangaloreDeath': 4, 'indiaDailyConfirmed': 382602}
I want to store these values in a CSV file for which I had created a script using Pandas and then saving it to a CSV file. While creating the file, I get multiple duplicate values which looks as follows
The code looks like below
import pandas as pd
from getData import getData
from datetime import datetime
data = getData()
print(data)
dateToday=str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
df = pd.DataFrame(data, index=list(data.keys()))
df['Date'] = dateToday
# df.drop_duplicates(subset='Date', inplace=True)
df.to_csv('data1.csv',index=False)
To remove the duplicates, I have added the following code.
df.drop_duplicates(subset='Date', inplace=True)
How do I avoid this line and why is my data repeating in the DataFrame?
The problem is in df = pd.DataFrame(data, index=list(data.keys())).
pandas.DataFrame() can accepts a list of dictionary or a dictionary whose value is list-like objects.
a list of dictionary
[{'karnatakaCount': 44631, 'bangaloreCount': 20870}, {'karnatakaCount': 44631, 'bangaloreCount': 20870}]
a dictionary whose value is list-like objects
{'karnatakaCount': [44631], 'bangaloreCount': [20870]}
However, your data is a dictionary whose value only contains constants.
{'karnatakaCount': 44631, 'bangaloreCount': 20870}
Pandas fails to determine how many rows to create. If you do df = pd.DataFrame(data) in your example, pandas will give you an error
ValueError: If using all scalar values, you must pass an index
The usage of index is to determine how many times you want to repeat the data. In your example, you choose to repeat it list(data.keys()) times. That's why you got so many duplicated rows.
You can use df = pd.DataFrame([data]) to create one row dataframe.
Just try the from_dict method.
df = pd.DataFrame.from_dict(data)
And then create a new column with the keys and assgin it as index.
df.set_index('column_name')
Use the from_dict method after reformatting your data as such :
data["Date"] = str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
data = {k:[v] for k,v in data.items()}
df = pd.DataFrame.from_dict(data)

Python set to array and dataframe

Interpretation by a friendly editor:
I have data in the form of a set.
import numpy as n , pandas as p
s={12,34,78,100}
print(n.array(s))
print(p.DataFrame(s))
The above code converts the set without a problem into a numpy array.
But when I try to create a DataFrame from it I get the following error:
ValueError: DataFrame constructor not properly called!
So is there any way to convert a python set/nested set into a numpy array/dictionary so I can create a DataFrame from it?
Original Question:
I have a data in form of set .
Code
import numpy as n , pandas as p
s={12,34,78,100}
print(n.array(s))
print(p.DataFrame(s))
The above code returns same set for numpyarray and DataFrame constructor not called at o/p . So is there any way to convert python set , nested set into numpy array and dictionary ??
Pandas can't deal with sets (dicts are ok you can use p.DataFrame.from_dict(s) for those)
What you need to do is to convert your set into a list and then convert to DataFrame:
import pandas as pd
s = {12,34,78,100}
s = list(s)
print(pd.DataFrame(s))
You can use list(s):
import pandas as p
s = {12,34,78,100}
df = p.DataFrame(list(s))
print(df)
Why do you want to convert it to a list first? The DataFrame() method accepts data which can be iterable. Sets are iterable.
dataFrame = pandas.DataFrame(yourSet)
This will create a column header: "0" which you can rename it like so:
dataFrame.columns = ['columnName']
import numpy as n , pandas as p
s={12,34,78,100}
#Create DataFrame directly from set
df = p.DataFrame(s)
#Can also create a keys, values pair (dictionary) and then create Data Frame,
#it useful as key will be used as Column Header and values as data
df1 = p.DataFrame({'Values': data} for data in s)

Using Pandas to parse a JSON column w/nested values in a huge CSV

I have a huge CSV file (3.5GB and getting bigger everyday) which has normal values and one column called 'Metadata' with nested JSON values. My script is as below and the intention is simply to convert the JSON column into normal columns for each of its key-value pairs. I am using Python3 (Anaconda; Windows).
import pandas as pd
import numpy as np
import csv
import datetime as dt
from pandas.io.json import json_normalize
for df in pd.read_csv("source.csv", engine='c',
dayfirst=True,
encoding='utf-8',
header=0,
nrows=10,
chunksize=2,
converters={'Metadata':json.loads}):
## parsing code comes here
with open("output.csv", 'a', encoding='utf-8') as ofile:
df.to_csv(ofile, index=False, encoding='utf-8')
And the column has JSON in the following format:
{
"content_id":"xxxx",
"parental":"F",
"my_custom_data":{
"GroupId":"NA",
"group":null,
"userGuid":"xxxxxxxxxxxxxx",
"deviceGuid":"xxxxxxxxxxxxx",
"connType":"WIFI",
"channelName":"VOD",
"assetId":"xxxxxxxxxxxxx",
"GroupName":"NA",
"playType":"VOD",
"appVersion":"2.1.0",
"userEnvironmentContext":"",
"vodEncode":"H.264",
"language":"English"
}
}
The desired output is to have all the above key-value pairs as columns. The dataframe will have other non-JSON columns to which I need to add the columns parsed from the above JSON. I tried json_normalize but I am not sure how to apply json_normalize to a Series object and then convert it (or explode it) into multiple columns.
Just use json_normalize() on the series directly, and then use pandas.concat() to merge the new dataframe with the existing dataframe:
pd.concat([df, json_normalize(df['Metadata'])])
You can add a .drop('Metadata', axis=1) if you no longer need the old column with the JSON datastructure in it.
The columns produced for the my_custom_data nested dictionary will have my_custom_data. prefixed. If all the names in that nested dictionary are unique, you could drop that prefix with a DataFrame.rename() operation:
json_normalize(df['Metadata']).rename(
columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)
If you are using some other means to convert each dictionary value to a flattened structure (say, with flatten_json, then you want to use Series.apply() to process each value and then return each resulting dictionary as a pandas.Series() object:
def some_conversion_function(dictionary):
result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
return pd.Series(something_that_processes_dictionary_into_a_flat_dict)
You can then concatenate the result of the Series.apply() call (which will be a dataframe) back onto your original dataframe:
pd.concat([df, df['Metadata'].apply(some_conversion_function)])

Python 3.4, error when creating a DataFrame with panda

I'm attempting to create a DataFrame with the following:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
# The inital set of baby names and birth rates
names =['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
#Now we wil zip them together
BabyDataSet = zip(names,births)
##we have to add the 'list' for version 3.x
print (list(BabyDataSet))
#create the DataFrame
df = DataFrame(BabyDataSet, columns = ['Names', 'Births'] )
print (df)
when I run the program I get the following error: 'data type can't be an iterator'
I read the following, 'What does the "yield" keyword do in Python?', but I do not understand how that applies to what I'm doing. Any help and further understanding would be greatly appreciated.
In python 3, zip returns an iterator, not a list like it does in python 2. Just convert it to a list as you construct the DataFrame, like this.
df = DataFrame(list(BabyDataSet), columns = ['Names', 'Births'] )
You can also create the dataframe using an alternate syntax that avoids the zip/generator issue entirely.
df = DataFrame({'Names': names, 'Births': births})
Read the documentation on initializing dataframes. Pandas simply takes the dictionary, creates one column for each entry with the key as the name and the value as the value.
Dict can contain Series, arrays, constants, or list-like objects

Categories