Finding duplicate values in a data frame inside a python Dictionary - python

I want to find the duplicate values inside a certain column found inside a dictionary. The dictionary keys are the file names and the values are the column headers (ID, and Group). I want to read all the IDs and see if there are any duplicates.
import pandas as pd
import os
folderpath= (r"path")
filenames = dict()
for file in os.listdir(folderpath):
if file.endswith(".csv"):
filenames[file] = pd.read_csv(os.path.join((folderpath), file))
splitfiles=({k:v for k,v in filenames.items() if 'split.csv' in k})
I am reading csv files from a folder and store the filename as the dictionary key and the values are the column headings with the dictionary data type as Data frame. I only want to keep all the file names that contain "split_csv" so that's the splitfiles variable.
splitfiles dictionary looks something like this:
splitfiles=
{{test1_split: {'ID': 123, 'Group': 'whatever'},
{test2_split: {'ID': 123, 'Group': 'whatever'}}
I want to be able to find when there are duplicate ID between the files I have tried something like this but it only work for tuples in a dictionary not a dataframe:
rev_dict = {}
for key, value in splitfiles.items():
rev_dict.setdefault(value, set()).add(key)
result = [key for key, values in rev_dict.items()
if len(values) > 1]
print("duplicate values", str(result))
TypeError: unhashable type: 'DataFrame'
My desired output is something like "duplicate values [2]"

Assuming the dictionary you're trying to get the duplicates from is like the example you provided, this will get you the number of duplicates:
data_list = []
for value in splitfiles.values():
data_list.append(value["ID"])
data_set = set(data_list)
print("Duplicate values: ",len(data_list)-len(data_set))

Related

How to create a nested dictionary with csv file

I want to create a nested dictionary. From a csv file (see pic) where i want to keep keys same
e.g
{'name':'john' , 'sname':'doe' , 'address':'120 Jefferson st'} ,
{'name':'jack' , 'sname':'McGinnis', 'address':'202 hobo'}}
all the row data in one dictionay with keys as their column name.
stuck here
I understand you want a list of dictionaries.
import pandas as pd
data = pd.read_csv("dataset.csv")
dict_list = []
for i in range (len(data)):
dict = {}
for col in data.columns :
dict[col] = data[col].iloc[i]
dict_list.append(dict)
print(dict_list)

convert list of strings in a column into separated columns Python

I have a column named (events) comes from csv file, I have loaded csv into dataframe.
This column contains the events of each soccer match:
here is an example:
sample of data
I need each key to be a column and the rows will be its value
to be like:
event_team event_time event_type ....
home 34 Yellow card
away 14 Goal
....
this is a sample file
Sample from data
how can I do it please ?
Pandas support reading straight from a list of dicts like this.
list1 = [dict1, dict2, dict3]
df = pd.DataFrame(list1)
Using that you can later select a column using:
column = df["column_name"]
If you want a non pandas way you can do this:
list1 = [dict1, dict2, dict3]
columns = {}
# Initializing the keywords
for d in list1:
for k in d:
if k not in columns:
columns[k] = []
for d in list1:
for k in columns:
if k in d:
columns[k].append(d[k])
else:
# because you want all columns to have the same length
columns[k].append(None)
print(columns)
EDIT: This script unpacks the "events_list" column to a new dataframe with the given blueprint described by the OP.
import pandas as pd
import ast
df = pd.read_csv("Sampleofuefa.csv")
l = []
for d in df["events_list"]:
# the values in the columns are strings, you have to interpret them
# since ast.literal_eval returns a list of dicts, we extend the following
# list with that list of dict: l = l1 + l2
l.extend(ast.literal_eval(d))
event_df = pd.DataFrame(l)

Converting a string representation of dicts to an actual dict

I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.

Nested dictionary from dataframe with loop over list

complete newbie here.
I want to read data from two columns in several excel sheets in a nested dictionary. In the end, I'd like to have a dictionary looking like this:
{SheetName1:{Index1: Value1, Index2: Value2,...}, SheetName2:{Index1: Value1, Index2: Value2} ...}
So far I read in the data using pandas and figured out how to combine the two columns I need into the inner dictionary {Index: Value}, which afterwards gets assigned the name of the sheet as a key to from the outer dictionary:
#read excel sheet into dataframe
df = ExcelWorkbook.parse(sheet_name = None, header= 1, usecols= 16, skiprows= 6)
#read in the different excel sheet names in a List
SHEETNAMES = []
SHEETNAMES = ExcelWorkbook.sheet_names
#nested dictionary
for Sheet in SHEETNAMES:
df[Sheet] = df[Sheet].loc[0:87,:]
dic = dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))
dic = {Sheet: dic}
Now when I run this, it only returns the last sheet with its corresponding {Index: Value} pair:
{'LastSheetName': {Key1: Value1, Key2: Value2,...}
Now it seems to me that I've done the "harder" part but I can't seem to figure out, how to fill a new dictionary with the dictionaries generated by this loop....
Any help is greatly appreciated!
Best regards,
Jan
You are assigning dic as a new variable each time you iterate through your for loop. Instead, instantiate dic as an empty list [] outside of the loop and then append the dictionaries you define inside the loop to it, such as:
#read excel sheet into dataframe
df = ExcelWorkbook.parse(sheet_name = None, header= 1, usecols= 16, skiprows= 6)
#nested dictionary
dic = []
for Sheet in ExcelWorkbook.sheet_names:
df[Sheet] = df[Sheet].iloc[0:87,:]
out = {Sheet: dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))}
dic.update(out)
Also, you want you to use .iloc in place of .loc considering you are specifying index locations inside of the dataframe.
I just figured it out after tweaking #rahlf23 response a bit. So for anyone looking this up:
dic.append() does not work for dictionaries, instead I used dic.update():
#nested dictionary
dic1 = {}
for Sheet in SHEETNAMES:
df[Sheet] = df[Sheet].iloc[0:87,:]
out = dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))
out2 = {Sheet: out}
dic1.update(out2)
Now one can access the values with:
print(dic1[SheetName][Index])
Thanks for your help #rahlf23, without your comment I'd still be trapped in the loop :)

How to append a dictionary to a pandas dataframe?

I have a set of urls containing json files and an empty pandas dataframe with columns representing the attributes of the jsnon files. Not all json files have all the attributes in the pandas dataframe. What I need to do is to create dictionaries out of the json files and then append each dictionary to the pandas dataframe as a new row and, in case the json file doesn't have an attribute matching a column in the dataframe this has to be filled blank.
I managed to create dictionaries as:
import urllib2
import json
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=ULST:7BIS01CF"
data = urllib2.urlopen(url).read()
data = json.loads(data)
and then I tried to create a for loop as follows:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
for column in df.columns:
if str(column) == str(key):
df.loc[[str(column)],row] = data[str(key)]
else:
df.loc[[str(column)],row] = None
where df is the dataframe and links is the set of urls
However, I get the following error:
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['2_seater_depth_mm'] not in index"
where ['2_seater_depth_mm'] is the first column of the pandas dataframe
For me below code works:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
df.loc[row,key] = data[key]
You have mixed order of arguments in .loc() and have one to much []
Assuming that df is empty and has the same columns as the url dictionary keys, i.e.
list(df)
#[u'alternate_product_code',
# u'availability',
# u'boz',
# ...
len(df)
#0
then you can use pandas.append
for url in links:
url_data = urllib2.urlopen(str(url)).read()
url_dict = json.loads(url_data)
a_dict = { k:pandas.Series([str(v)], index=[0]) for k,v in url_dict.iteritems() }
new_df = pandas.DataFrame.from_dict(a_dict)
df.append(new_df, ignore_index=True)
Not too sure why your code won't work, but consider the following few edits which should clean things up, should you still want to use it:
for row,url in enumerate(links):
data = urllib2.urlopen(str(url)).read()
data_dict = json.loads(data)
for key,val in data_dict.items():
if key in list(df):
df.ix[row,key] = val
I used enumerate to iterate over the index and value of links array, in this way you dont need an index counter (row in your code) and then I used the .items dictionary method, so I can iterate over key and values at once. I believe pandas will automatically handle the empty dataframe entries.

Categories