Pandas - Duplicate rows while creating a dataframe from dictionary

Pandas - Duplicate rows while creating a dataframe from dictionary - python

I have the following code where I talk to a service to get data. The dictionary that is returned from this service looks like below
{'karnatakaCount': 44631, 'bangaloreCount': 20870, 'mangaloreCount': 985, 'kodaguCount': 743, 'udupiCount': 556, 'kasargodCount': 673, 'mangaloreVaccinations': 354095, 'karnatakaVaccinations': 9892349, 'keralaVaccinations': 7508437, 'mangaloreDeath': 4, 'indiaDailyConfirmed': 382602}
I want to store these values in a CSV file for which I had created a script using Pandas and then saving it to a CSV file. While creating the file, I get multiple duplicate values which looks as follows
The code looks like below
import pandas as pd
from getData import getData
from datetime import datetime
data = getData()
print(data)
dateToday=str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
df = pd.DataFrame(data, index=list(data.keys()))
df['Date'] = dateToday
# df.drop_duplicates(subset='Date', inplace=True)
df.to_csv('data1.csv',index=False)
To remove the duplicates, I have added the following code.
df.drop_duplicates(subset='Date', inplace=True)
How do I avoid this line and why is my data repeating in the DataFrame?

The problem is in df = pd.DataFrame(data, index=list(data.keys())).
pandas.DataFrame() can accepts a list of dictionary or a dictionary whose value is list-like objects.
a list of dictionary
[{'karnatakaCount': 44631, 'bangaloreCount': 20870}, {'karnatakaCount': 44631, 'bangaloreCount': 20870}]
a dictionary whose value is list-like objects
{'karnatakaCount': [44631], 'bangaloreCount': [20870]}
However, your data is a dictionary whose value only contains constants.
{'karnatakaCount': 44631, 'bangaloreCount': 20870}
Pandas fails to determine how many rows to create. If you do df = pd.DataFrame(data) in your example, pandas will give you an error
ValueError: If using all scalar values, you must pass an index
The usage of index is to determine how many times you want to repeat the data. In your example, you choose to repeat it list(data.keys()) times. That's why you got so many duplicated rows.
You can use df = pd.DataFrame([data]) to create one row dataframe.

Just try the from_dict method.
df = pd.DataFrame.from_dict(data)
And then create a new column with the keys and assgin it as index.
df.set_index('column_name')

Use the from_dict method after reformatting your data as such :
data["Date"] = str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
data = {k:[v] for k,v in data.items()}
df = pd.DataFrame.from_dict(data)

Related

Unable to update Pandas dataframe element with dictionary

I have a Pandas dataframe where its just 2 columns: the first being a name, and the second being a dictionary of information relevant to the name. Adding new rows works fine, but if I try to updates the dictionary column by assigning a new dictionary in place, I get
ValueError: Incompatible indexer with Series
So, to be exact, this is what I was doing to produce the error:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[pos, 'attributes'] = {'c':2}
I was able to find another solution that seems to work:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[:,'attributes'].at[pos] = {'c':2}
but I was hoping to get an answer as to why the first method doesn't work, or if there was something wrong with how I had it initially.

Since you are trying to access a dataframe with an index 'pos', you have to use iloc to access the row. So changing your last row as following would work as intended:
df.iloc[pos]['attributes'] = {'c':2}

For me working DataFrame.at:
df.at[pos, 'attributes'] = {'c':2}
print (df)
name attributes
0 a {'c': 2}

How do I extract the date from a column in a csv file using pandas?

This is the 'aired' column in the csv file:
as
Link to the csv file:
https://drive.google.com/file/d/1w7kIJ5O6XIStiimowC5TLsOCUEJxuy6x/view?usp=sharing
I want to extract the date and the month (in words) from the date following the 'from' word and store it in a separate column in another csv file. The 'from' is an obstruction since had it been just the date it would have been easily extracted as a timestamp format.

You are starting from a string and want to break out the data within it. The single quotes is a clue that this is a dict structure in string form. The Python standard libraries include the ast (Abstract Syntax Trees) module whose literal_eval method can read a string into a dict, gleaned from this SO answer: Convert a String representation of a Dictionary to a dictionary?
You want to apply that to your column to get the dict, at which point you expand it into separate columns using .apply(pd.Series), based on this SO answer: Splitting dictionary/list inside a Pandas Column into Separate Columns
Try the following
import pandas as pd
import ast
df = pd.read_csv('AnimeList.csv')
# turn the pd.Series of strings into a pd.Series of dicts
aired_dict = df['aired'].apply(ast.literal_eval)
# turn the pd.Series of dicts into a pd.Series of pd.Series objects
aired_df = aired_dict.apply(pd.Series)
# pandas automatically translates that into a pd.DataFrame
# concatenate the remainder of the dataframe with the new data
df_aired = pd.concat([df.drop(['aired'], axis=1), aired_df], axis=1)
# convert the date strings to datetime values
df_aired['aired_from'] = pd.to_datetime(df_aired['from'])
df_aired['aired_to'] = pd.to_datetime(df_aired['to'])

import pandas as pd
file = pd.read_csv('file.csv')
result = []
for cell in file['aired']:
date = cell[8:22]
date_ts = pd.to_datetime(date, format='%Y-%m-%d')
result.append((date_ts.month_name(), date_ts))
df = pd.DataFrame(result, columns=['month', 'date'])
df.to_csv('result_file.csv')

Using Pandas to parse a JSON column w/nested values in a huge CSV

I have a huge CSV file (3.5GB and getting bigger everyday) which has normal values and one column called 'Metadata' with nested JSON values. My script is as below and the intention is simply to convert the JSON column into normal columns for each of its key-value pairs. I am using Python3 (Anaconda; Windows).
import pandas as pd
import numpy as np
import csv
import datetime as dt
from pandas.io.json import json_normalize
for df in pd.read_csv("source.csv", engine='c',
dayfirst=True,
encoding='utf-8',
header=0,
nrows=10,
chunksize=2,
converters={'Metadata':json.loads}):
## parsing code comes here
with open("output.csv", 'a', encoding='utf-8') as ofile:
df.to_csv(ofile, index=False, encoding='utf-8')
And the column has JSON in the following format:
{
"content_id":"xxxx",
"parental":"F",
"my_custom_data":{
"GroupId":"NA",
"group":null,
"userGuid":"xxxxxxxxxxxxxx",
"deviceGuid":"xxxxxxxxxxxxx",
"connType":"WIFI",
"channelName":"VOD",
"assetId":"xxxxxxxxxxxxx",
"GroupName":"NA",
"playType":"VOD",
"appVersion":"2.1.0",
"userEnvironmentContext":"",
"vodEncode":"H.264",
"language":"English"
}
}
The desired output is to have all the above key-value pairs as columns. The dataframe will have other non-JSON columns to which I need to add the columns parsed from the above JSON. I tried json_normalize but I am not sure how to apply json_normalize to a Series object and then convert it (or explode it) into multiple columns.

Just use json_normalize() on the series directly, and then use pandas.concat() to merge the new dataframe with the existing dataframe:
pd.concat([df, json_normalize(df['Metadata'])])
You can add a .drop('Metadata', axis=1) if you no longer need the old column with the JSON datastructure in it.
The columns produced for the my_custom_data nested dictionary will have my_custom_data. prefixed. If all the names in that nested dictionary are unique, you could drop that prefix with a DataFrame.rename() operation:
json_normalize(df['Metadata']).rename(
columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)
If you are using some other means to convert each dictionary value to a flattened structure (say, with flatten_json, then you want to use Series.apply() to process each value and then return each resulting dictionary as a pandas.Series() object:
def some_conversion_function(dictionary):
result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
return pd.Series(something_that_processes_dictionary_into_a_flat_dict)
You can then concatenate the result of the Series.apply() call (which will be a dataframe) back onto your original dataframe:
pd.concat([df, df['Metadata'].apply(some_conversion_function)])

Python 3.4, error when creating a DataFrame with panda

I'm attempting to create a DataFrame with the following:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
# The inital set of baby names and birth rates
names =['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
#Now we wil zip them together
BabyDataSet = zip(names,births)
##we have to add the 'list' for version 3.x
print (list(BabyDataSet))
#create the DataFrame
df = DataFrame(BabyDataSet, columns = ['Names', 'Births'] )
print (df)
when I run the program I get the following error: 'data type can't be an iterator'
I read the following, 'What does the "yield" keyword do in Python?', but I do not understand how that applies to what I'm doing. Any help and further understanding would be greatly appreciated.

In python 3, zip returns an iterator, not a list like it does in python 2. Just convert it to a list as you construct the DataFrame, like this.
df = DataFrame(list(BabyDataSet), columns = ['Names', 'Births'] )

You can also create the dataframe using an alternate syntax that avoids the zip/generator issue entirely.
df = DataFrame({'Names': names, 'Births': births})
Read the documentation on initializing dataframes. Pandas simply takes the dictionary, creates one column for each entry with the key as the name and the value as the value.
Dict can contain Series, arrays, constants, or list-like objects

Passing 'records' parameter to Pandas to_dict method

I have a dataframe that I'd like to convert to a dictionary. I want each column in the dataframe to be mapped to the values in it.
import numpy as np
import pandas as pd
data = np.random.randn(5,5)
df = pd.DataFrame(data)
dict1 = df.to_dict(outtype = 'records')
Here is the error I get:
Value Error: outtype records not understood
However, when I check the docs for this method, records is listed as a valid parameter. Moreover, I am able to convert my dataframe to all the other dictionary formats mentioned in docs (list, series, and dict). I'm not sure what could be going wrong.

Try this:
dict1 = df.to_dict('records')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Duplicate rows while creating a dataframe from dictionary - python

Just try the from_dict method. df = pd.DataFrame.from_dict(data) And then create a new column with the keys and assgin it as index. df.set_index('column_name')

Use the from_dict method after reformatting your data as such : data["Date"] = str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year) data = {k:[v] for k,v in data.items()} df = pd.DataFrame.from_dict(data)

Related

Unable to update Pandas dataframe element with dictionary

How do I extract the date from a column in a csv file using pandas?

Using Pandas to parse a JSON column w/nested values in a huge CSV

Python 3.4, error when creating a DataFrame with panda

Passing 'records' parameter to Pandas to_dict method

Categories

Resources