i am using the following code to export dictionary/json data to CSV, I am trying to split the "process_hash" column into two columns, so that there's one for MD5 and another column for SHA256, along with the other existing columns.
The "process_hash" column currently contains list values, i am not sure how to split them into columns, MD5 and SHA256?
[{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b1bvf6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files '
'(x86)\\toh122soft\\thcasdf3\\toho34rce.exe',
'process_username': ['JOHN\\user1']},
{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe',
'process_username': ['JOHN\\user2']},
{'device_name': '6asdsdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f698e11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a',
'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe',
'process_username': ['JOHN\\user3']}]
Code to export to csv:
def toCSV(res):
with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['process_hash', 'process_name', "process_effective_reputation"]
dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
dict_writer.writeheader()
entries = set()
for data in res:
val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
if val not in entries:
dict_writer.writerow(data)
entries.add(val)
csv data:
process_hash process_name process_effective_reputation
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0'] c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml NOT_LISTED
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37'] c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe ADAPTIVE_WHITE_LIST
What I'm trying to achieve with the CSV file:
md5 sha256 process_name process_effective_reputation
You can do it by processing the process_hash field list separately and just copying the other two fields as shown below:
import csv
data = [{'device_name': 'fk6sdc2',
rest of your data ...
def toCSV(res):
with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = 'md5,sha256,process_name,process_effective_reputation'.split(',')
dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
dict_writer.writeheader()
for obj in res:
md5, sha256 = obj['process_hash'] # Extract values from process_hash list.
row = {'md5': md5, 'sha256': sha256} # Initialize a row with them.
row.update({field: obj[field] # Copy the last two fields into it.
for field in fieldnames[-2:]})
dict_writer.writerow(row)
toCSV(data)
Here is one way. The function apply() converts one list to multiple columns.
import pandas as pd
data = [{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b1bvf6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files '
'(x86)\\toh122soft\\thcasdf3\\toho34rce.exe',
'process_username': ['JOHN\\user1']},
{'device_name': 'fk6sdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f6e17ee11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a',
'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe',
'process_username': ['JOHN\\user2']},
{'device_name': '6asdsdc2',
'device_timestamp': '2020-10-27T00:50:46.176Z',
'event_id': '9b151f698e11eb81b',
'process_effective_reputation': 'LIST',
'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a',
'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'],
'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe',
'process_username': ['JOHN\\user3']}]
Now process the data:
df = pd.DataFrame(data, columns=['process_hash', 'process_name', 'process_effective_reputation'])
df[['md5', 'sha256']] = df['process_hash'].apply(lambda x: pd.Series(x))
df = df.drop(columns='process_hash')
Finally here are the results:
print(df)
process_name \
0 c:\program files (x86)\toh122soft\thcasdf3\toh...
1 c:\program files (x86)\oft\tf3\tootsice.exe
2 c:\program files (x86)\toht\th3\tohce.exe
process_effective_reputation md5 \
0 LIST bfc7dcf5935830f3a9df8e9b6425c37a
1 LIST bfc7dcf5935f3a9df8e9b6830425c37a
2 LIST 9df8ebfc7dcf5935830f3a9b6425c37a
sha256
0 ca9f3a24506cc518fc939a33c100b2d557f96e040f712f...
1 ca9f3a24506cc518fc939a33c100b2d557f96e040f712f...
2 ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f...
Related
This is probably easy, but I'm not sure how to do it. This is purely an example, but for Algeria - how would I write both "DZ" and "DB" to a single row? The cell value is bracketed like a list, ['DZ', 'DB'] instead of something like DZ DB
import csv
# csv header
fieldnames = ['name', 'area', 'country_code2', 'country_code3']
# csv data
rows = [
{'name': 'Albania',
'area': 28748,
'country_code2': 'AL',
'country_code3': 'ALB'},
{'name': 'Algeria',
'area': 2381741,
'country_code2': ['DZ', 'DB'],
'country_code3': 'DZA'},
{'name': 'American Samoa',
'area': 199,
'country_code2': 'AS',
'country_code3': 'ASM'}
]
with open('countries.csv', 'w', encoding='UTF8', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
Convert a list of rows to a delimited string.
for row in rows:
if isinstance(row['country_code2'], list):
row['country_code2'] = ' '.join(row['country_code2'])
You need to preprocess the rows before saving them into CSV.
so we using isinstance to check if the country_code2 is list or string
import csv
# csv header
fieldnames = ['name', 'area', 'country_code2', 'country_code3']
# csv data
rows = [
{'name': 'Albania',
'area': 28748,
'country_code2': 'AL',
'country_code3': 'ALB'},
{'name': 'Algeria',
'area': 2381741,
'country_code2': ['DZ', 'DB'],
'country_code3': 'DZA'},
{'name': 'American Samoa',
'area': 199,
'country_code2': 'AS',
'country_code3': 'ASM'}
]
for row in rows:
# check the country_code2 of every row if its and list
if isinstance(row['country_code2'], list):
# join the list to single string with ' ' seperating the values
row['country_code2'] = ' '.join(row['country_code2'])
with open('countries.csv', 'w', encoding='UTF8', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
I have the following python function that exports JSON data to CSV file, it works fine - the keys(csv headers) and values(csv rows) are populated in the CSV, but I'm trying to remove the duplicates rows in the the csv file?
instead of manually removing them in Excel, how do I remove the duplicate values in python?
def toCSV(res):
with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['process_hash', 'process_name', "process_effective_reputation"]
dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames,extrasaction='ignore')
dict_writer.writeheader()
for r in res:
dict_writer.writerow(r)
Thank you
for example in the csv, the duplicate row on apmsgfwd.exe information.
duplicate data below:
process_hash process_name process_effective_reputation
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0'] c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml NOT_LISTED
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2'] c:\windows\system32\delltpad\apmsgfwd.exe ADAPTIVE_WHITE_LIST
['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37'] c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe ADAPTIVE_WHITE_LIST
json data:
[{'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b1bvf6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toh122soft\\thcasdf3\\toho34rce.exe', 'process_username': ['JOHN\\user1']}, {'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe', 'process_username': ['JOHN\\user2']}, {'device_name': '6asdsdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f698e11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a', 'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe', 'process_username': ['JOHN\\user3']}]
Is it necessary to use above approach, if not then I usually use pandas library for reading csv files.
import pandas as pd
data = pd.read_csv('EnrichedEvents.csv')
data.drop_duplicates(inplace=True)
data.to_csv('output.csv',index=False)
Below is standalone example that shows how to filter duplicates. The idea is to get the values of each dict and convert them into tuple. Using a set we can filter out the duplicates.
import csv
csv_columns = ['No', 'Name', 'Country']
dict_data = [
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 2, 'Name': 'Ben', 'Country': ['USA']},
]
csv_file = "Names.csv"
with open(csv_file, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
entries = set()
for data in dict_data:
val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
if val not in entries:
writer.writerow(data)
entries.add(val)
print('done')
Names.csv
No,Name,Country
1,Alex,['India']
2,Ben,['USA']
I have three lists like below and I want to create a JSON file from them:
devices = ['iphone', 'ipad', 'ipod', 'watch'],
cities = ['NY', 'SFO', 'LA', 'NJ'],
companies = ['Apple', 'Samsung', 'Walmart']
I have done like below.
First manually create a dictionary:
data = {
'devices': ['iphone', 'ipad', 'ipod', 'watch'],
'cities': ['NY', 'SFO', 'LA', 'NJ'],
'companies': ['Apple', 'Samsung', 'Walmart']
}
Then convert it to JSON format like this:
import json
with open('abc.json', 'w') as outfile:
json.dump(data, outfile, indent=4)
Is there a better way of doing this when we have more number of lists.
Ideally if I have N number of lists, I want to create a JSON formatted file a minimal amount of manual work.
Your question doesn't show getting the lists from an external source like another .py file, so here's how to do it given their variable names when they've been defined in-line as shown in it:
import json
devices = ['iphone', 'ipad', 'ipod', 'watch']
cities = ['NY', 'SFO', 'LA', 'NJ']
companies = ['Apple', 'Samsung', 'Walmart']
lists = ['devices', 'cities', 'companies']
data = {listname: globals()[listname] for listname in lists}
with open('abc.json', 'w') as outfile:
json.dump(data, outfile, indent=4)
Contents of the abc.json file it creates:
{
"devices": [
"iphone",
"ipad",
"ipod",
"watch"
],
"cities": [
"NY",
"SFO",
"LA",
"NJ"
],
"companies": [
"Apple",
"Samsung",
"Walmart"
]
}
This method will work for any number of lists providing they have the same format as the ones provided in your question. Hope this helps.
# define the list vars
devices = ['iphone', 'ipad', 'ipod', 'watch'],
cities = ['NY', 'SFO', 'LA', 'NJ'],
companies = ['Apple', 'Samsung', 'Walmart']
# get the variables into a big list
v = locals()['In'][2]
output = {}
#break down the lists and turn them into dict entries
v1 = v.split(',\n')
for each in v1:
#print(each)
name = each.split(' = ')[0]
data = each.split(' = ')[1]
data = data[2:-2]
datalist = data.split("', '")
output[name] = datalist
#show the output
output
#export as JSON
import json
with open('C:\\abc.json', 'w') as outfile:
json.dump(output, outfile, indent=4)
I currently have a list of data containing these properties:
properties = {
'address':address,
'city':city,
'state':state,
'postal_code':postal_code,
'price':price,
'facts and features':info,
'real estate provider':broker,
'url':property_url,
'title':title
}
These are populated with about 25 rows.
I am attempting to write them to a csv file using this:
with open("ogtest-%s-%s.csv" % (listValue, zipCodes),'w') as csvfile:
fieldnames = ['title','address','city','state','postal_code','price','facts and features','real estate provider','url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in scraped_data:
writer.writerow(row)
my end csv result is a list with the data in one row. Each area such as address has title and below it ALL the values.
['/profile/Colin-Welman/', '/profile/andreagressinger/', '/profile/Regina-Vannicola/', '/profile/Kathryn-Perkins/', etc
The scraped_data appears like this:
{'city': '(844) 292-5128(310) 505-7493(310) 562-8483(310) 422-9001(310) 439-5303(323) 736-4891(310) 383-8111(310) 482-2033(646) 872-4990(310) 963-1648', 'state': None, 'postal_code': None, 'facts and features': u'', 'address': ['/profile/Jonathan-Pearson/', '/profile/margret2/', '/profile/user89580694/', '/profile/Rinde-Philippe/', '/profile/RogerPerryLA/', '/profile/tamaramattox/', '/profile/JFKdogtownrealty/', '/profile/The-Cunningham-Group/', '/profile/TheHeatherGroup/', '/profile/EitanSeanConstine1/'], 'url': None, 'title': None, 'price': None, 'real estate provider': 'Jonathan PearsonMargret EricksonSusan & Rachael RosalesRinde PhilippeRoger PerryTamara Mattoxjeff koneckeThe Cunningham GroupHeather Shawver & Heather RogersEitan Sean Constine'}
My goal is for each item to be on it's own row.
I've tried this: adding 'newline' in the with open (it gives an error)
adding delimiter to csv.Dictwriter (this added individual columns for each record not rows)
Any help would be much appreciated.
Your scraped_data should be a list of dictionaries for csv.DictWriter to work. For instance:
import csv
# mock data
scraped_data = list()
row = {
'title': 'My Sweet Home',
'address': '1000, Abc St.',
'city': 'San Diego',
'facts and features': 'Miniramp in the backyard',
'postal_code': '000000',
'price': 1000000,
'real estate provider': 'Abc Real Estate',
'state': 'CA',
'url': 'www.mysweethome.com'
}
scraped_data.append(row)
fieldnames = [
'title', 'address', 'city', 'state', 'postal_code', 'price',
'facts and features', 'real estate provider', 'url'
]
# writing CSV file
with open('ogtest.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in scraped_data:
writer.writerow(row)
Hope it helps.
Cheers
I am trying to write dictionary data into csv file.
Keys:
['file_name', 'candidate_skills', 'SF_name', 'RB_name', 'mb_number', 'email']
Dictionary
{'file_name': 'Aarti Banarashi.docx', 'candidate_skills': ['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS'], 'SF_name': None, 'RB_name': 'aarti banarashi\t\t\t', 'mb_number': ['+918108493333'], 'email': 'aartisingh271294#gmail.com'}
I was thinking each dictionary will be written in on row with each value in new column
'file_name' 'candidate_skills' 'SF_name' 'RB_name' 'mb_number' 'email'
I am getting results like this, into single column only:
file_name,Aarti Banarashi.docx
candidate_skills,"['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS']"
SF_name,
RB_name,aarti banarashi
mb_number,['+918108493333']
email,aartisingh271294#gmail.com
Can you please help me to write it in correct manner? Also when I add new records, it should get appended
My code:
with open('dict.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in res.items():
writer.writerow([key, value])
Expected output
enter image description here
As soon as you work with tables I recommend pandas.
Here is the pandas solution:
d = {'file_name': 'Aarti Banarashi.docx', 'candidate_skills': ['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS'], 'SF_name': None, 'RB_name': 'aarti banarashi\t\t\t', 'mb_number': ['+918108493333'], 'email': 'aartisingh271294#gmail.com'}
import pandas as pd
df = pd.DataFrame.from_dict(d, orient='index').T
df.to_csv("output.csv",index=False)
Output:
file_name,candidate_skills,SF_name,RB_name,mb_number,email
Aarti Banarashi.docx,"['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS']",,aarti banarashi ,['+918108493333'],aartisingh271294#gmail.com
Your script was iterating over each key value pair in your dictionary and then calling writerow() for each pair. writerow() will give you a single new row, so calling it multiple time in this way will give you one row per pair.
res only contains data for a single row in your CSV file. Using a csv.DictWriter(), a single call to writerow() will convert all the dictionary entries into a single output row:
import csv
res = {'file_name': 'Aarti Banarashi.docx', 'candidate_skills': ['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS'], 'SF_name': None, 'RB_name': 'aarti banarashi\t\t\t', 'mb_number': ['+918108493333'], 'email': 'aartisingh271294#gmail.com'}
fieldnames = ['file_name', 'candidate_skills', 'SF_name', 'RB_name', 'mb_number', 'email']
with open('dict.csv', 'wb') as f_file:
csv_writer = csv.DictWriter(f_file, fieldnames=fieldnames)
csv_writer.writeheader()
csv_writer.writerow(res)
Giving you an output dict.csv file as:
file_name,candidate_skills,SF_name,RB_name,mb_number,email
Aarti Banarashi.docx,"['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS']",,aarti banarashi ,['+918108493333'],aartisingh271294#gmail.com
By explicitly passing fieldnames is forces the ordering of the columns in the output to what you provide. If the ordering is not important, and you can instead use fieldnames=res.keys()