remove the duplicate rows in CSV file

remove the duplicate rows in CSV file - python

I have the following python function that exports JSON data to CSV file, it works fine - the keys(csv headers) and values(csv rows) are populated in the CSV, but I'm trying to remove the duplicates rows in the the csv file?
instead of manually removing them in Excel, how do I remove the duplicate values in python?
def toCSV(res):
with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['process_hash', 'process_name', "process_effective_reputation"]
dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames,extrasaction='ignore')
dict_writer.writeheader()
for r in res:
dict_writer.writerow(r)
Thank you
for example in the csv, the duplicate row on apmsgfwd.exe information.
duplicate data below:
process_hash process_name process_effective_reputation
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0'] c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml NOT_LISTED
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37'] c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe ADAPTIVE_WHITE_LIST
json data:
[{'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b1bvf6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toh122soft\\thcasdf3\\toho34rce.exe', 'process_username': ['JOHN\\user1']}, {'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe', 'process_username': ['JOHN\\user2']}, {'device_name': '6asdsdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f698e11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a', 'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe', 'process_username': ['JOHN\\user3']}]

Is it necessary to use above approach, if not then I usually use pandas library for reading csv files.
import pandas as pd
data = pd.read_csv('EnrichedEvents.csv')
data.drop_duplicates(inplace=True)
data.to_csv('output.csv',index=False)

Below is standalone example that shows how to filter duplicates. The idea is to get the values of each dict and convert them into tuple. Using a set we can filter out the duplicates.
import csv
csv_columns = ['No', 'Name', 'Country']
dict_data = [
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 2, 'Name': 'Ben', 'Country': ['USA']},
]
csv_file = "Names.csv"
with open(csv_file, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
entries = set()
for data in dict_data:
val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
if val not in entries:
writer.writerow(data)
entries.add(val)
print('done')
Names.csv
No,Name,Country
1,Alex,['India']
2,Ben,['USA']

Related

Write Python Dictionary List Value into single CSV Cell?

This is probably easy, but I'm not sure how to do it. This is purely an example, but for Algeria - how would I write both "DZ" and "DB" to a single row? The cell value is bracketed like a list, ['DZ', 'DB'] instead of something like DZ DB
import csv
# csv header
fieldnames = ['name', 'area', 'country_code2', 'country_code3']
# csv data
rows = [
{'name': 'Albania',
'area': 28748,
'country_code2': 'AL',
'country_code3': 'ALB'},
{'name': 'Algeria',
'area': 2381741,
'country_code2': ['DZ', 'DB'],
'country_code3': 'DZA'},
{'name': 'American Samoa',
'area': 199,
'country_code2': 'AS',
'country_code3': 'ASM'}
]
with open('countries.csv', 'w', encoding='UTF8', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)

Convert a list of rows to a delimited string.
for row in rows:
if isinstance(row['country_code2'], list):
row['country_code2'] = ' '.join(row['country_code2'])

You need to preprocess the rows before saving them into CSV.
so we using isinstance to check if the country_code2 is list or string
import csv
# csv header
fieldnames = ['name', 'area', 'country_code2', 'country_code3']
# csv data
rows = [
{'name': 'Albania',
'area': 28748,
'country_code2': 'AL',
'country_code3': 'ALB'},
{'name': 'Algeria',
'area': 2381741,
'country_code2': ['DZ', 'DB'],
'country_code3': 'DZA'},
{'name': 'American Samoa',
'area': 199,
'country_code2': 'AS',
'country_code3': 'ASM'}
]
for row in rows:
# check the country_code2 of every row if its and list
if isinstance(row['country_code2'], list):
# join the list to single string with ' ' seperating the values
row['country_code2'] = ' '.join(row['country_code2'])
with open('countries.csv', 'w', encoding='UTF8', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)

Write csv file (including header) via dataset

I want to create a function that takes a file name and a dataset, and in turn, creates a .csv file with the dictionary keys as a header and the values in the rows beneath the header. I would like this to be independent of the number of columns in the dataset. Also, if any of the columns shows no data I still want the rest of the data to be written in the .csv file.
My current code looks like this:
def write_data_to_csv(csv_name, dataset):
import csv
with open(f"{csv_name}.csv", "w") as csv_file:
writer = csv.writer(csv_file, delimiter = ";")
writer.writerow(["iso_country", "continent", "country", "population", "population_density", "date", "total_cases"])
for i in dataset:
try:
writer.writerow([i['iso_country'], i['continent'], i['date'], i['total_cases']])
except:
0
My data looks like this:
[{'iso_country': 'ALB',
'continent': 'Europe',
'location': 'Albania',
'population': 2877800.0,
'population_density': 104.871,
'date': '2021-03-05',
'total_cases': 111301.0},
{'iso_country': 'ALB',
'continent': 'Europe',
'location': 'Albania',
'population': 2877800.0,
'population_density': 104.871,
'date': '2021-03-05',
'total_cases': 111301.0}]
Thanks in advance!

You can use this simple write_csv() function to write any list of dicts to a csv:
import csv
from typing import List, Dict
def write_csv(data: List[Dict], path: str) -> None:
with open(path, 'w', newline='') as file:
fields = set().union(*data)
writer = csv.DictWriter(file, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
csv_data = [{'iso_country': 'ALB',
'continent': 'Europe',
'location': 'Albania',
'population': 2877800.0,
'population_density': 104.871,
'date': '2021-03-05',
'total_cases': 111301.0},
{'iso_country': 'ALB',
'continent': 'Europe',
'location': 'Albania',
'population': 2877800.0,
'population_density': 104.871,
'date': '2021-03-05',
'total_cases': 111301.0}]
write_csv(csv_data, 'data.csv')
The typing import is just to make the documentation automatic for most IDEs

How to split CSV columns?

i am using the following code to export dictionary/json data to CSV, I am trying to split the "process_hash" column into two columns, so that there's one for MD5 and another column for SHA256, along with the other existing columns.
The "process_hash" column currently contains list values, i am not sure how to split them into columns, MD5 and SHA256?
[{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b1bvf6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files '
'(x86)\\toh122soft\\thcasdf3\\toho34rce.exe',
'process_username': ['JOHN\\user1']},
{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe',
'process_username': ['JOHN\\user2']},
{'device_name': '6asdsdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f698e11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a',
'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe',
'process_username': ['JOHN\\user3']}]
Code to export to csv:
def toCSV(res):
with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['process_hash', 'process_name', "process_effective_reputation"]
dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
dict_writer.writeheader()
entries = set()
for data in res:
val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
if val not in entries:
dict_writer.writerow(data)
entries.add(val)
csv data:
process_hash process_name process_effective_reputation
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0'] c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml NOT_LISTED
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37'] c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe ADAPTIVE_WHITE_LIST
What I'm trying to achieve with the CSV file:
md5 sha256 process_name process_effective_reputation

You can do it by processing the process_hash field list separately and just copying the other two fields as shown below:
import csv
data = [{'device_name': 'fk6sdc2',
rest of your data ...
def toCSV(res):
with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = 'md5,sha256,process_name,process_effective_reputation'.split(',')
dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
dict_writer.writeheader()
for obj in res:
md5, sha256 = obj['process_hash'] # Extract values from process_hash list.
row = {'md5': md5, 'sha256': sha256} # Initialize a row with them.
row.update({field: obj[field] # Copy the last two fields into it.
for field in fieldnames[-2:]})
dict_writer.writerow(row)
toCSV(data)

Here is one way. The function apply() converts one list to multiple columns.
import pandas as pd
data = [{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b1bvf6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files '
'(x86)\\toh122soft\\thcasdf3\\toho34rce.exe',
'process_username': ['JOHN\\user1']},
{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe',
'process_username': ['JOHN\\user2']},
{'device_name': '6asdsdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f698e11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a',
'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe',
'process_username': ['JOHN\\user3']}]
Now process the data:
df = pd.DataFrame(data, columns=['process_hash', 'process_name', 'process_effective_reputation'])
df[['md5', 'sha256']] = df['process_hash'].apply(lambda x: pd.Series(x))
df = df.drop(columns='process_hash')
Finally here are the results:
print(df)
process_name \
0 c:\program files (x86)\toh122soft\thcasdf3\toh...
1 c:\program files (x86)\oft\tf3\tootsice.exe
2 c:\program files (x86)\toht\th3\tohce.exe
process_effective_reputation md5 \
0 LIST bfc7dcf5935830f3a9df8e9b6425c37a
1 LIST bfc7dcf5935f3a9df8e9b6830425c37a
2 LIST 9df8ebfc7dcf5935830f3a9b6425c37a
sha256
0 ca9f3a24506cc518fc939a33c100b2d557f96e040f712f...
1 ca9f3a24506cc518fc939a33c100b2d557f96e040f712f...
2 ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f...

Trouble editing a CSV file using python csv module dictwriter

I am making a program that reads data from a form, stores it into a dictionary, and then uses csv.DictWrite to make append the data to a csv file. I run the program but nothing happens to my data.csv file. The main program and the data file are in the same working directory, and csvmodule is installed as well.
Here's the code,
def response_to_csv(data):
#append w/ dictionary -> more efficiewn
with open('data.csv', 'a', newline = '') as csvfile:
fieldnames = ['date', 'first', 'last', 'age', 'email', 'country',
'city/town', 'Uni Student', 'Instagram','Followers','Affiliate'
]
writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
writer.writeheader()
writer.writerow({
'date' : data['date'],
'first': data['first'],
'last' : data['last'],
'age' : data['age'],
'email': data['email'],
'country': data['country'],
'city/town': data['city/town'],
'Uni Student': data['Uni Student'],
'Instagram': data['Instagram'],
'Followers': data['Followers'],
'Affiliate': data['Affiliate']
})
Here's the data dictionary
data = {
'date' : date,
'first': fname,
'last' : lname,
'age' : age,
'email': email,
'country': country,
'city/town': city_town,
'Uni Student': is_Uni_Student,
'Instagram': insta,
'Followers': ig_followers,
'Affiliate': affiliation
}
response_to_csv(data)

import csv
data = {
'date' : '202001',
'first': 'Bob',
'last' : 'Smith',
'age' : 45,
'email': 'bsmith#gmail.com',
'country': 'USA',
'city/town': 'New York',
'Uni Student': 1,
'Instagram': '#bsmith',
'Followers': 45678,
'Affiliate': 'Red Bull'
}
def response_to_csv(data):
fieldnames = ['date', 'first', 'last', 'age', 'email', 'country',
'city/town', 'Uni Student', 'Instagram','Followers','Affiliate'
]
with open('data.csv', 'a', newline = '') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
writer.writeheader()
writer.writerow(data)
response_to_csv(data)

Your code worked for me, although I had to fix the indentation of the body of your function, with open(...) should not be at the same indent as def response_to_csv(data):
import csv
def response_to_csv(data):
#append w/ dictionary -> more efficiewn
with open('data.csv', 'a', newline = '') as csvfile:
fieldnames = ['date', 'first', 'last', 'age', 'email', 'country',
'city/town', 'Uni Student', 'Instagram','Followers','Affiliate'
]
writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
writer.writeheader()
writer.writerow({
'date' : data['date'],
'first': data['first'],
'last' : data['last'],
'age' : data['age'],
'email': data['email'],
'country': data['country'],
'city/town': data['city/town'],
'Uni Student': data['Uni Student'],
'Instagram': data['Instagram'],
'Followers': data['Followers'],
'Affiliate': data['Affiliate']
})
data = {
'date' : '2019_01_01',
'first': 'firstname',
'last' : 'lname',
'age' : '99',
'email': 'email#address.com',
'country': 'USA',
'city/town': 'MyTown',
'Uni Student': True,
'Instagram': 'MyInsta',
'Followers': 24,
'Affiliate': 'affiliation'
}
response_to_csv(data)
$ cat data.csv
date,first,last,age,email,country,city/town,Uni Student,Instagram,Followers,Affiliate
2019_01_01,firstname,lname,99,email#address.com,USA,MyTown,True,MyInsta,24,affiliation

Better way to create json file from multiple lists?

I have three lists like below and I want to create a JSON file from them:
devices = ['iphone', 'ipad', 'ipod', 'watch'],
cities = ['NY', 'SFO', 'LA', 'NJ'],
companies = ['Apple', 'Samsung', 'Walmart']
I have done like below.
First manually create a dictionary:
data = {
'devices': ['iphone', 'ipad', 'ipod', 'watch'],
'cities': ['NY', 'SFO', 'LA', 'NJ'],
'companies': ['Apple', 'Samsung', 'Walmart']
}
Then convert it to JSON format like this:
import json
with open('abc.json', 'w') as outfile:
json.dump(data, outfile, indent=4)
Is there a better way of doing this when we have more number of lists.
Ideally if I have N number of lists, I want to create a JSON formatted file a minimal amount of manual work.

Your question doesn't show getting the lists from an external source like another .py file, so here's how to do it given their variable names when they've been defined in-line as shown in it:
import json
devices = ['iphone', 'ipad', 'ipod', 'watch']
cities = ['NY', 'SFO', 'LA', 'NJ']
companies = ['Apple', 'Samsung', 'Walmart']
lists = ['devices', 'cities', 'companies']
data = {listname: globals()[listname] for listname in lists}
with open('abc.json', 'w') as outfile:
json.dump(data, outfile, indent=4)
Contents of the abc.json file it creates:
{
"devices": [
"iphone",
"ipad",
"ipod",
"watch"
],
"cities": [
"NY",
"SFO",
"LA",
"NJ"
],
"companies": [
"Apple",
"Samsung",
"Walmart"
]
}

This method will work for any number of lists providing they have the same format as the ones provided in your question. Hope this helps.
# define the list vars
devices = ['iphone', 'ipad', 'ipod', 'watch'],
cities = ['NY', 'SFO', 'LA', 'NJ'],
companies = ['Apple', 'Samsung', 'Walmart']
# get the variables into a big list
v = locals()['In'][2]
output = {}
#break down the lists and turn them into dict entries
v1 = v.split(',\n')
for each in v1:
#print(each)
name = each.split(' = ')[0]
data = each.split(' = ')[1]
data = data[2:-2]
datalist = data.split("', '")
output[name] = datalist
#show the output
output
#export as JSON
import json
with open('C:\\abc.json', 'w') as outfile:
json.dump(output, outfile, indent=4)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove the duplicate rows in CSV file - python

Is it necessary to use above approach, if not then I usually use pandas library for reading csv files. import pandas as pd data = pd.read_csv('EnrichedEvents.csv') data.drop_duplicates(inplace=True) data.to_csv('output.csv',index=False)

Related

Write Python Dictionary List Value into single CSV Cell?

Write csv file (including header) via dataset

How to split CSV columns?

Trouble editing a CSV file using python csv module dictwriter

Better way to create json file from multiple lists?

Categories

Resources