Detecting duplicates in pandas when a column contains lists - python

Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels... wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I'm going after.
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
Both methods (the latter from this alternative find duplicates answer) result in
TypeError: unhashable type: 'list'.
They would work, of course, if the dataframe looked like this:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
Which made me wonder if something like integer encoding might be reasonable? It's not that different from converting to/from strings, but at least it's legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).

With map tuple
out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]:
author date rating
0 Jefe98 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]

Related

How to flatten nested json that has dict in a column after using json normalize?

This is the flattened version of the column. I still need the keys as column titles for the dataframe and the values as values for the corresponding column.
reaction
{ "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
I want my data to look like this so I can add it to the dataframe.
veddra_term_code veddra_version veddra_term_name
99026 3 'Tablets, Abnormal'
Use f-strings. Theyre made for creating strings formatted like you want:
d = { "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
s = f'veddra_term_code veddra_version veddra_term_name {d["veddra_term_code"]} {d["veddra_version"]} \'{d["veddra_term_name"]}\''
print(s) # prints veddra_term_code veddra_version veddra_term_name 99026 3 'Tablets, Abnormal'

Convert two CSV tables with one-to-many relation to JSON with embedded list of subdocuments

I have two CSV files which have one-to-many relation between them.
main.csv:
"main_id","name"
"1","foobar"
attributes.csv:
"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"
I would like to convert this to JSON of this structure:
[
{
"main_id": "1",
"name": "foobar",
"attributes": [
{
"id": "100",
"name": "color",
"value": "red",
"updated_at": "2020-10-10"
},
{
"id": "101",
"name": "shape",
"value": "square",
"updated_at": "2020-10-10"
},
{
"id": "103",
"name": "size",
"value": "small",
"updated_at": "2020-10-10"
}
]
}
]
I tried using Python and Pandas like:
import pandas
def transform_group(group):
group.reset_index(inplace=True)
group.drop('main_id', axis='columns', inplace=True)
return group.to_dict(orient='records')
main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)
attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"
main = main.merge(
right=attributes,
on='main_id',
how='left',
validate='m:1',
copy=False,
)
main.to_json('out.json', orient='records', indent=2)
It works. But the issue is that it does not seem to scale. When running on my whole dataset I have, I can load individual CSV files without problems, but when trying to modify data structure before calling to_json, memory usage explodes.
So is there a more efficient way to do this transformation? Maybe there is some Pandas feature I am missing? Or is there some other library to use? Moreover, use of apply seems to be pretty slow here.
This is a tough problem and we have all felt your pain.
There are three ways I would attack this problem. First, groupby is slower if you allow pandas to do the break out.
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame({'id': np.random.randint(0, 100, 5000),
'name': np.random.randint(0, 100, 5000)})
now if you do the standard groupby
groups = []
for k, rows in df.groupby('id'):
groups.append(rows)
you will find that
groups = defaultdict(lambda: [])
for id, name in df.values:
groups[id].append((id, name))
is about 3 times faster.
The second method is I would use change it to use Dask and the dask parallelization. A discussion about dask is what is dask and how is it different from pandas.
The third is algorithmic. Load up the main file and then by ID, then only load the data for that ID, having multiple bites at what is in memory and what is in disk, then saving out a partial result as it becomes available.
So in my case I was able to load original tables in memory, but doing embedding exploded the size so that it did not fit memory anymore. So I ended up still using Pandas to load CSV files, but then I iteratively generate row by row and saving each row into a separate JSON. This means I do not have a large data structure in the memory for one large JSON.
Another important realization was that it is important to make the related column an index, and that it has to be sorted, so that querying it is fast (because generally there are duplicate entries in the related column).
I made the following two helper functions:
def get_related_dict(related_table, label):
assert related_table.index.is_unique
if pandas.isna(label):
return None
row = related_table.loc[label]
assert isinstance(row, pandas.Series), label
result = row.to_dict()
result[related_table.index.name] = label
return result
def get_related_list(related_table, label):
# Important to be more performant when selecting non-unique labels.
assert related_table.index.is_monotonic_increasing
try:
# We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
return related_table.loc[[label], :].to_dict(orient='records')
except KeyError:
return []
And then I do:
main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)
# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')
columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
assert len(columns) == len(row)
data = dict(zip(columns, row))
data['attributes'] = get_related_list(attributes, data['main_id'])
json.dump(data, sys.stdout, indent=2)
sys.stdout.write("\n")

Extract list from dict of lists then append to dataframe

I'm trying to extract a field from a json that contains a list then append that list to a dataframe, but I'm running in to a few different errors.
I think I can write it to a csv then read the csv with Pandas, but I'm trying to avoid writing any files. I know that I can also use StringIO to make a csv, but that has issues with null bytes. Replacing those would be (i think) another line-by-line step that will further extend the time the script takes to complete... i'm running this against a query that returns thens of thousands of results so keeping it fast and simple is a priority
First I tried this:
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
df = df.append(ln['_source'], ignore_index=True)
print(df)
This gives me a result that looks like this:
1 2 3 4
a b d,e,f... x
Then I tried this:
df = df.append(ln['_source']['payload'], ignore_index=True)
But that gives me this error:
TypeError: cannot concatenate object of type "<class 'str'>"; only pd.Series,
pd.DataFrame, and pd.Panel (deprecated) objs are valid
What I'm looking for would be something like this:
0 1 2 3 4
d e f g h
On top of this... I need to figure out a way to handle a specific string in this list that contains a comma... which may be a headache that's best handled in a different question... something like:
# Obviously this is incorrect but I think you get the idea :)
str.replace(',', '^')
except if ',' followed by ' '
Greatly appreciate any help!
EDITING TO ADD JSON AS REQUESTED
{
"_index": "sanitized",
"_type": "sanitized",
"_id": "sanitized".,
"_score": sanitized,
"_source": {
"sanitized": sanitized,
"sanitized": "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,\"34,35\",36,37,38,39,40",
"sanitized": "sanitized",
"sanitized": ["sanitized"],
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
}
}]
}
}
You can maybe write a temporary file with StringIO, like it's done here.
Then for the second part you could do
if ',' in data and ', ' not in data:
data = data.replace(',', '^')
You can try the following
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
data = ln['_source']["payload"].split(",")
df.loc[len(df)] = pd.Series(data, index=range(len(data)))
print(df)
The benefit of the loc is that you will not create a new dataframe each time so it will be fast. You can find the post here.
I would also like to suggest an alternative that can be faster. First create a dictionary with all the data and then dump the dictionary into a dataframe.

Using json_normalize for structured multi level dictionaries with lists

I've successfully transferred the data from a JSON file (structured as per the below example), into a three column ['tag', 'time', 'score'] DataFrame using the following iterative approach:
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
df.loc[len(df)] = [v['tag_id'], v1['time'], v1['value']]
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure. I'm assuming that an iterative approach is not the ideal way to tackle this sort of problem. Using pandas.io.json.json_normalize instead, I've tried the following:
result = json_normalize(my_request, ['content'], ['data', 'score', ['time', 'value']])
Which returns KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('data',)). I believe I've misinterpreted the pandas documentation on json_normalize, and can't quite figure out how I should pass the parameters.
Can anyone point me in the right direction?
(alternatively using errors='ignore' returns ValueError: Conflicting metadata name data, need distinguishing prefix.)
JSON Structure
{
'content':[
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:30',
'value':75.0
},
{
'time':'2015-03-01 23:50:30',
'value':58.0
}
]
},
'tag_id':320676
},
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:25',
'value':78.0
},
{
'time':'2015-03-01 00:05:25',
'value':57.0
}
]
},
'tag_id':320677
}
],
'meta':None,
'requested':'2018-04-15 13:00:00'
}
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure.
I would suggest the following:
Check whether the problem is with your iterated appends. Pandas is not very good at sequentially adding rows. How about this code:
tups = []
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
tups.append(v['tag_id'], v1['time'], v1['value'])
df = pd.DataFrame(tups, columns=['tag_id', 'time', 'value])
If the preceding is not fast enough, check if it's the JSON-parsing part with
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
v['tag_id'], v1['time'], v1['value']
It is probable that 1. will be fast enough. If not, however, check if ujson might be faster for this case.

Extract JSON | API | Pandas DataFrame

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

Categories