Write key to separate csv based on value in dictionary - python

[Using Python3] I have a csv file that has two columns (an email address and a country code; script is made to actually make it two columns if not the case in the original file - kind of) that I want to split out by the value in the second column and output in separate csv files.
eppetj#desrfpkwpwmhdc.com us ==> output-us.csv
uheuyvhy#zyetccm.com de ==> output-de.csv
avpxhbdt#reywimmujbwm.com es ==> output-es.csv
gqcottyqmy#romeajpui.com it ==> output-it.csv
qscar#tpcptkfuaiod.com fr ==> output-fr.csv
qshxvlngi#oxnzjbdpvlwaem.com gb ==> output-gb.csv
vztybzbxqq#gahvg.com us ==> output-us.csv
... ... ...
Currently my code kind of does this, but instead of writing each email address to the csv it overwrites the email placed before that. Can someone help me out with this?
I am very new to programming and Python and I might not have written the code in the most pythonic way, so I would really appreciate any feedback on the code in general!
Thanks in advance!
Code:
import csv
def tsv_to_dict(filename):
"""Creates a reader of a specified .tsv file."""
with open(filename, 'r') as f:
reader = csv.reader(f, delimiter='\t') # '\t' implies tab
email_list = []
# Checks each list in the reader list and removes empty elements
for lst in reader:
email_list.append([elem for elem in lst if elem != '']) # List comprehension
# Stores the list of lists as a dict
email_dict = dict(email_list)
return email_dict
def count_keys(dictionary):
"""Counts the number of entries in a dictionary."""
return len(dictionary.keys())
def clean_dict(dictionary):
"""Removes all whitespace in keys from specified dictionary."""
return { k.strip():v for k,v in dictionary.items() } # Dictionary comprehension
def split_emails(dictionary):
"""Splits out all email addresses from dictionary into output csv files by country code."""
# Creating a list of unique country codes
cc_list = []
for v in dictionary.values():
if not v in cc_list:
cc_list.append(v)
# Writing the email addresses to a csv based on the cc (value) in dictionary
for key, value in dictionary.items():
for c in cc_list:
if c == value:
with open('output-' +str(c) +'.csv', 'w') as f_out:
writer = csv.writer(f_out, lineterminator='\r\n')
writer.writerow([key])

You can simplify this a lot by using a defaultdict:
import csv
from collections import defaultdict
emails = defaultdict(list)
with open('email.tsv','r') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
if row:
if '#' in row[0]:
emails[row[1].strip()].append(row[0].strip()+'\n')
for key,values in emails.items():
with open('output-{}.csv'.format(key), 'w') as f:
f.writelines(values)
As your separated files are not comma separated, but single columns - you don't need the csv module and can simply write the rows.
The emails dictionary contains a key for each country code, and a list for all the matching email addresses. To make sure the email addresses are printed correctly, we remove any whitespace and add the a line break (this is so we can use writelines later).
Once the dictionary is populated, its simply a matter of stepping through the keys to create the files and then writing out the resulting list.

The problem with your code is that it keeps opening the same country output file each time it writes an entry into it, thereby overwriting whatever might have already been there.
A simple way to avoid that is to open all the output files at once for writing and store them in a dictionary keyed by the country code. Likewise, you can have another that associates each country code to acsv.writerobject for that country's output file.
Update: While I agree that Burhan's approach is probably superior, I feel that you have the idea that my earlier answer was excessively long due to all the comments it had -- so here's another version of essentially the same logic but with minimal comments to allow you better discern its reasonably-short true length (even with the contextmanager).
import csv
from contextlib import contextmanager
#contextmanager # to manage simultaneous opening and closing of output files
def open_country_csv_files(countries):
csv_files = {country: open('output-'+country+'.csv', 'w')
for country in countries}
yield csv_files
for f in csv_files.values(): f.close()
with open('email.tsv', 'r') as f:
email_dict = {row[0]: row[1] for row in csv.reader(f, delimiter='\t') if row}
countries = set(email_dict.values())
with open_country_csv_files(countries) as csv_files:
csv_writers = {country: csv.writer(csv_files[country], lineterminator='\r\n')
for country in countries}
for email_addr,country in email_dict.items():
csv_writers[country].writerow([email_addr])

Not a Python answer, but maybe you can use this Bash solution.
$ while read email country
do
echo $email >> output-$country.csv
done < in.csv
This reads the lines from in.csv, splits them into two parts email and country, and appends (>>) the email to the file called output-$country.csv.

Related

Same python code block gives different outputs at different time

I want to create a word dictionary. The dictionary looks like
words_meanings= {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Output: rekindle , pesky, verge, maneuver, accountability
Here rekindle , pesky, verge, maneuver, accountability they are the keys and relight, annoying, border, activity, responsibility they are the values.
Now I want to create a csv file and my code will take input from the file.
The file looks like
rekindle | pesky | verge | maneuver | accountability
relight | annoying| border| activity | responsibility
So far I use this code to load the file and read data from it.
from google.colab import files
uploaded = files.upload()
import pandas as pd
data = pd.read_csv("words.csv")
data.head()
import csv
reader = csv.DictReader(open("words.csv", 'r'))
words_meanings = []
for line in reader:
words_meanings.append(line)
print(words_meanings)
This is the output of print(words_meanings)
[OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
It looks very odd to me.
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Now I create an empty list and want to append only key values. But the output is [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
I am confused. As per the first code block it only included keys but now it includes both keys and their values. How can I overcome this situation?
I would suggest that you format your csv with your key and value on the same row. Like this
rekindle,relight
pesky,annoying
verge,border
This way the following code will work.
words_meanings = {}
with open(file_name, 'r') as file:
for line in file.readlines():
key, value = line.split(",")
word_meanings[key] = value.rstrip("\n")
if you want a list of the keys:
list_of_keys = list(word_meanings.keys())
To add keys and values to the file:
def add_values(key:str, value:str, file_name:str):
with open(file_name, 'a') as file:
file.writelines(f"\n{key},{value}")
key = input("Input the key you want to save: ")
value = input(f"Input the value you want to save to {key}:")
add_values(key, value, file_name)```
You run the same block of code but you use it with different objects and this gives different results.
First you use normal dictionary (check type(words_meanings))
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
and for-loop gives you keys from this dictionary
You could get the same with
keys_letter = list(words_meanings.keys())
or even
keys_letter = list(words_meanings)
Later you use list with single dictionary inside this list (check type(words_meanings))
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
and for-loop gives you elements from this list, not keys from dictionary which is inside this list. So you move full dictionary from one list to another.
You could get the same with
keys_letter = words_meanings.copy()
or even the same
keys_letter = list(words_meanings)
from collections import OrderedDict
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = list(words_meanings.keys())
keys_letter = list(words_meanings)
print(keys_letter)
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = words_meanings.copy()
keys_letter = list(words_meanings)
print(keys_letter)
The default field separator for the csv module is a comma. Your CSV file uses the pipe or bar symbol |, and the fields also seem to be fixed width. So, you need to specify | as the delimiter to use when creating the CSV reader.
Also, your CSV file is encoded as Big-endian UTF-16 Unicode text (UTF-16-BE). The file contains a byte-order-mark (BOM) but Python is not stripping it off, so you will notice the string '\ufeffrekindle' contains the FEFF UTF-16-BE BOM. That can be dealt with by specifying encoding='utf16' when you open the file.
import csv
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(f, delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
Running this on your CSV file produces this:
{'rekindle ': 'relight ', 'pesky ': 'annoying', 'verge ': 'border', 'maneuver ': 'activity ', 'accountability': 'responsibility'}
Notice that there is trailing whitespace in the key and values. skipinitialspace=True removed the leading whitespace, but there is no option to remove the trailing whitespace. That can be fixed by exporting the CSV file from Excel without specifying a field width. If that can't be done, then it can be fixed by preprocessing the file using a generator:
import csv
def preprocess_csv(f, delimiter=','):
# assumes that fields can not contain embedded new lines
for line in f:
yield delimiter.join(field.strip() for field in line.split(delimiter))
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(preprocess_csv(f, '|'), delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
which now outputs the stripped keys and values:
{'rekindle': 'relight', 'pesky': 'annoying', 'verge': 'border', 'maneuver': 'activity', 'accountability': 'responsibility'}
As I found that no one able to help me with the answer. Finally, I post the answer here. Hope this will help other.
import csv
file_name="words.csv"
words_meanings = {}
with open(file_name, newline='', encoding='utf-8-sig') as file:
for line in file.readlines():
key, value = line.split(",")
words_meanings[key] = value.rstrip("\n")
print(words_meanings)
This is the code to transfer a csv to a dictionary. Enjoy!!!

Looping through a dictionary to replace multiple values in text file

I'm trying to change several hex values in a text file. I made a CSV that has the original values in one column and the new values in another.
My goal is to write a simple Python script to find old values in the text file based on the first column and replace them with new values in the second.
I'm attempting to use a dictionary to facilitate this replace() that I created by looping through the CSV. Building it was pretty easy, but using it to executing a replace() hasn't been working out. When I print out the values after my script runs I'm still seeing the original ones.
I've tried reading in the text file using read() and executing the change to the whole file like above.
import csv
filename = "origin.txt"
csv_file = 'replacements.csv'
conversion_dict = {}
# Create conversion dictionary
with open(csv_file, "r") as replace:
reader = csv.reader(replace, delimiter=',')
for rows in reader:
conversion_dict.update({rows[0]:rows[1]})
#Replace values on text files based on conversion dict
with open(filename, "r") as fileobject:
txt = str(fileobject.read())
for keys, values, in conversion_dict.items():
new_text = txt.replace(keys, values)
I've also tried adding the updated text to a list:
#Replace values on text files based on conversion dict
with open(filename, "r") as fileobject:
txt = str(fileobject.read())
for keys, values, in conversion_dict.items():
new_text.append(txt.replace(keys, values))
Then, I tried using readlines() to replace the old values with new ones one line at a time:
# Replace values on text files based on conversion dict
with open(filename, "r") as reader:
reader.readlines()
type(reader)
for line in reader:
print(line)
for keys, values, in conversion_dict.items():
new_text.append(txt.replace(keys, values))
While troubleshooting, I ran a test to see if I was getting any matches between the keys in my dict and the text in the file:
for keys, values, in conversion_dict.items():
if keys in txt:
print("match")
else:
print("no match")
My output returned match on every row except the first one. I imagine with some trimming or something I could fix that. However, this proves that there are matches, so there must be some other issue with my code.
Any help is appreciated.
origin.txt:
oldVal9000,oldVal1,oldVal2,oldVal3,oldVal69
test.csv:
oldVal1,newVal1
oldVal2,newVal2
oldVal3,newVal3
oldVal4,newVal4
import csv
filename = "origin.txt"
csv_file = 'test.csv'
conversion_dict = {}
with open(csv_file, "r") as replace:
reader = csv.reader(replace, delimiter=',')
for rows in reader:
conversion_dict.update({rows[0]:rows[1]})
f = open(filename,'r')
txt = str(f.read())
f.close()
txt= txt.split(',') #not sure what your origin.txt actually looks like, assuming comma seperated values
for i in range(len(txt)):
if txt[i] in conversion_dict:
txt[i] = conversion_dict[txt[i]]
with open(filename, "w") as outfile:
outfile.write(",".join(txt))
modified origin.txt:
oldVal9000,newVal4,newVal1,newVal3,oldVal69

Need to print CSV output into separate rows in Python, instead of one long string

I am trying to print the output of a webscrape project into a CSV file.
So for example I have this list of supplier names under a list called SUPP_NAME: (just an example, the actual list has 50 items inside it)
['"FULIAN\\u0020\\u0028M\\u0029\\u0020SENDIRIAN\\u0020BERHAD"', '"RISO\\u0020SEKKEN\\u0020SDN.\\u0020BHD."', '"NATURE\\u0020PROFUSION\\u0020SDN.\\u0020BHD."']
and a list of numbers indicated years, under a list called SUPP_YEARS:
['"9"', '"4"', '"1"', '"1"']
My plan is to put them into a CSV, and then read them back in as a pandas dataframe, then perform decoding to get a bunch of values.
Code so far:
import csv
with open('output3.csv' , 'w') as f:
writer = csv.writer(f)
headers = "Supplier_name,Years\n"
f.write(headers)
supp_names = re.findall(r'("supplierName"):("\w+.+")', results[17].text)
supp_years = re.findall(r'("supplierYear"):("\d+")', results[17].text)
SUPP_NAME = []
for title, name in supp_names:
print (name)
SUPP_NAME.append(name)
#f.write(name + "\n")
SUPP_YEAR = []
for year,number in supp_years:
print (number)
SUPP_YEAR.append(number)
#f.write(number + "\n")
writer.writerow([SUPP_NAME, SUPP_YEAR])
However, what I get is that under the Supplier_name and Years columns, one cell under each of these 2 columns is filled with a LONG list of items still contained in the list, instead of the items separated one by one.
What am I doing wrong? Thanks in advance for answering.
The two re.findall() calls are giving you lists of items (hopefully both the same length). The idea is to then then extract an element from each and write this to your output file. Python has a useful function called zip() to do this. You give it both of your lists and the loop with give you an item from each on each iteration:
import csv
with open('output3.csv', 'w' newline='') as f_output:
writer = csv.writer(f_output)
writer.writerow(["Supplier_name" , "Years"])
supp_names = re.findall(r'("supplierName"):("\w+.+")', results[17].text)
supp_years = re.findall(r'("supplierYear"):("\d+")', results[17].text)
for name, year in zip(supp_names, supp_years):
writer.writerow([name, year])
The csv.writer() object is designed to take a list of items and write them to your file with the desired (i.e. comma) delimiter automatically added between them.
I assume you are using Python 3.x? If not you should change the following:
with open('output3.csv', 'wb') as f_output:

Python merge csv files with matching Index

I want to merge two CSV files based on a field
The 1st one looks like this:
ID, field1, field2
1,a,green
2,b,white
2,b,red
2,b,blue
3,c,black
The second one looks like:
ID, field3
1,value1
2,value2
What I want to have is:
ID, field1, field2,field3
1,a,green,value1
2,b,white,value2
2,b,red,value2
2,b,blue,value2
3,c,black,''
I'm using pydev on eclipse
import csv
endings0=[]
endings1=[]
with open("salaries.csv") as book0:
for line in book0:
endings0.append(line.split(',')[-1])
endings1.append(line.split(',')[0])
linecounter=0
res = open("result.csv","w")
with open('total.csv') as book2:
for line in book2:
# if not header line:
l=line.split(',')[0]
for linecounter in range(0,endings1.__len__()):
if( l == endings1[linecounter]):
res.writelines(line.replace("\n","") +','+str(endings0[linecounter]))
print("done")
There are a bunch of things wrong with what you're doing
You should really really be using the classes in the csv module to read and write csv files. Importing the module isn't enough. You actually need to call its functions
You should never find yourself typing endings1.__len__(). Use len(endings1) instead
You should never find yourself typing for linecounter in range(0,len(endings1)).
Use either for linecounter, _ in enumerate(endings1),
or better yet for end1, end2 in zip(endings1, endings2)
A dictionary is a much better data structure for lookup than a pair of parallel lists. To quote pike:
If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.
Here's how I'd do it:
import csv
with open('second.csv') as f:
# look, a builtin to read csv file lines as dictionaries!
reader = csv.DictReader(f)
# build a mapping of id to field3
id_to_field3 = {row['ID']: row['field3'] for row in reader}
# you can put more than one open inside a with statement
with open('first.csv') as f, open('result.csv', 'o') as fo:
# csv even has a class to write files!
reader = csv.DictReader(f)
res = csv.DictWriter(fo, fieldnames=reader.fieldnames + ['field3'])
res.writeheader()
for row in reader:
# .get returns its second argument if there was no match
row['field3'] = id_to_field3.get(row['ID'], '')
res.writerow(row)
I have a high-level solution for you.
Deserialize your first CSV into dict1 mapping ID to a list containing a list containing field1 and field2.
Deserialize your second CSV into dict2 mapping ID to field3.
for each (id, list) in dict1, do list.append(dict2.setdefault(id, '')). Now serialize it back into CSV using whatever serializer you were using before.
I used dictionary's setdefault because I noticed that ID 3 is in the first CSV file but not the second.

How do I create add new items to a dictionary while in a loop?

I'm writing a program that reads names and statistics related to those names from a file. Each line of the file is another person and their stats. For each person, I'd like to make their last name a key and everything else linked to that key in the dictionary. The program first stores data from the file in an array and then I'm trying to get those array elements into the dictionary, but I'm not sure how to do that. Plus I'm not sure if each time the for loop iterates, it will overwrite the previous contents of the dictionary. Here's the code I'm using to attempt this:
f = open("people.in", "r")
tmp = None
people
l = f.readline()
while l:
tmp = l.split(',')
print tmp
people = {tmp[2] : tmp[0])
l = f.readline()
people['Smith']
The error I'm currently getting is that the syntax is incorrect, however I have no idea how to transfer the array elements into the dictionary other than like this.
Use key assignment:
people = {}
for line in f:
tmp = l.rstrip('\n').split(',')
people[tmp[2]] = tmp[0]
This loops over the file object directly, no need for .readline() calls here, and removes the newline.
You appear to have CSV data; you could also use the csv module here:
import csv
people = {}
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
people[row[2]] = row[0]
or even a dict comprehension:
import csv
with open("people.in", "rb") as f:
reader = csv.reader(f)
people = {r[2]: r[0] for r in reader}
Here the csv module takes care of the splitting and removing newlines.
The syntax error stems from trying close the opening { with a ) instead of }:
people = {tmp[2] : tmp[0]) # should be }
If you need to collect multiple entries per row[2] value, collect these in a list; a collections.defaultdict instance makes that easier:
import csv
from collections import defaultdict
people = defaultdict(list)
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
people[row[2]].append(row[0])
In repsonse to Generalkidd's comment above, multiple people with the same last time, an addition to Martijn Pieter's solution, posted as an answer for better formatting:
import csv
people = {}
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
if not row[2] in people:
people[row[2]] = list()
people[row[2]].append(row[0])

Categories