Extracting data from csv - python

I have a csv file with each row containing lists of adjectives.
For example, the first 2 rows are as follows:
["happy","sad","colorful"]
["horrible","sad","cheerful","happy"]
I want to extract all the data from this file to get a list containing each adjective only one.
(Here, it would be a list as follows :
["happy","sad","colorful","horrible","cheerful"]
I am doing this using Python.
import csv
with open('adj.csv', 'rb') as f:
reader = csv.reader(f)
adj_list = list(reader)
filtered_list = []
for l in adj_list:
if l not in new_list:
filtered_list.append(l)

Supposing that "memory is not important" and that one liner is what you are looking for:
from itertools import chain
from csv import reader
print(list(set(chain(*reader(open('file.csv'))))))
having 'file.csv' content like this:
happy, sad, colorful
horrible, sad, cheerful, happy
OUTPUT:
['horrible', ' colorful', ' sad', ' cheerful', ' happy', 'happy']
You can remove the list() part if you don't mind receive a Python set instead of a list.

Assuming you are only interested in a list of unique words where order does not matter:
# Option A1
import csv
with open("adj.csv", "r") as f:
seen = set()
reader = csv.reader(f)
for line in reader:
for word in line:
seen.add(word)
list(seen)
# ['cheerful', 'colorful', 'horrible', 'happy', 'sad']
More concisely:
# Option A2
with open("adj.csv", "r") as f:
reader = csv.reader(f)
unique_words = {word for line in reader for word in line}
list(unique_words)
The with statement safely opens and closes the file. We are simply adding every word to a set. We cast the filtered result to list() and get a list of unique (unordered) words.
Alternatives
If ordered does matter, implement the unique_everseen itertools recipe.
From itertools recipes:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in it.filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
You can manually implement this or install a third-library that implements it for you, such as more_itertools, e.g. pip install more_itertools
# Option B
import csv
import more_itertools as mit
with open("adj.csv", "r") as f:
reader = csv.reader(f)
words = (word for line in reader for word in line)
unique_words = list(mit.unique_everseen(words))
unique_words
# ['happy', 'sad', 'colorful', 'horrible', 'cheerful']

Related

How to remove duplicate text from two different files in python

My problem: I have two files, "text1.txt" and "text2.txt"
"Text1.txt" contains the following:
Banana, rotten
Apple, ripe
Cheese, fresh
and "Text2.txt" contains the following:
Banana, good
Dragon, edible
Cheese, nice
What I want is to create a code that would check text2.txt with text1.txt and remove the word and the whole line that repeats itself before the comma. So, in this case, it would look like this:
"Text1.txt" changed to and Text2.txt would be left unchanged
Apple, ripe
What I managed to do is check if the words are duplicates without the comma, but even struggled to do that. My attempt is below:
New_food = open("text1.txt", "r+")
All_food = open("text2.txt")
food = New_food.readlines()
food2 = All_food.readlines()
#The following calculates how many lines the text file has
def file_len(fname):
with open(fname) as s:
for t, l in enumerate(s):
pass
return t+1
#calculates line number
n = file_len("text1.txt")
m = file_len("text2.txt")
for g in range(n):
food_r = food[g]
for j in range(m):
food2_r = food2[j]
if food_r == food2_r:
print(5) #only when they match
I have made the line break before it reaches a comma using this piece of code:
word = "cheese , fresh"
type_, *vals = word.split(',')
print(type_) #this would return cheese
I rewrote some of your code into the following script:
file1 = open("text1.txt", "r+")
file2 = open("text2.txt")
# List from files
food_list_1 = file1.readlines()
food_list_2 = file2.readlines()
# Unique food values in list
file_2_only_foods = list()
for line in food_list_2:
file_2_only_foods.append(line.split(',')[0])
def determine(x):
type = x.split(',')[0]
return type in file_2_only_foods
result = [x for x in food_list_1 if not determine(x)]
file1.close()
file1 = open("text1.txt", 'w')
file1.writelines(result)
This will put all the unique values into file_2_only_foods list to check if the values from list 1 are unique are not.
In order to write the file, we have to close the previous file and than reopen it to write your results. The result from my code is the exact same as what you described.
If there are no duplicates within files, you could go through both files and add all elements to a Counter (https://docs.python.org/2/library/collections.html), and then on a second pass remove all elements that have a count larger than 1.
from collections import Counter
>>> food1 = open("Text1.txt")
>>> food2 = open("Text2.txt")
>>> counter1 = Counter(item.split(",")[0] for item in food1.readlines())
>>> counter2 = Counter(item.split(",")[0] for item in food2.readlines())
>>> counter = counter1 + counter2
Counter({'Cheese': 2, 'Banana': 2, 'Apple': 1, 'Dragon': 1})
you can use regular expressions to extract the words from the text. Regular expressions reference: https://docs.python.org/3/library/re.html
You can extract all the first words from a file with this one-liner:
re.findall(r"^\s*(\w+)", file.read(), re.MULTILINE)
Demo:
>>> txt = """
... Banana, rotten
... Apple, ripe
... Cheese, fresh
... """
>>>
>>> re.findall(r"^\s*(\w+)", txt, re.MULTILINE)
['Banana', 'Apple', 'Cheese']
>>>
Extracts all the words to filter on, then efficiently filters the target file line-by-line.
>>> def filter_lines(filter_path, target_path, output_path):
...
... with open(filter_path, 'r' ) as filter_file,
... open(target_path, 'r' ) as target_file,
... open(output_path, 'w+') as output_file:
...
... filter_words = re.findall(r"^\s*(\w+)",
... filter_file.read(),
... re.MULTILINE)
... filter_words = set(filter_words)
...
... for line in target_file:
... m = re.findall(r"^\s*(\w+)", line)
... if not (m and m[0] in filter_words):
... output_file.write(line)
>>>
>>> filter_lines('text2.txt', 'text1.txt', 'filtered_text1.txt')
>>>
side note: Generally, in cases where you need to maintain a large list of items that's used in expressions like if item in long_list:, where the list is checked for membership. A set is much better than a list because lookups are fast; with list's, lookups are done by iterating over all items until what you're looking for is found.

Making python dictionary from a text file with multiple keys

I have a text file named file.txt with some numbers like the following :
1 79 8.106E-08 2.052E-08 3.837E-08
1 80 -4.766E-09 9.003E-08 4.812E-07
1 90 4.914E-08 1.563E-07 5.193E-07
2 2 9.254E-07 5.166E-06 9.723E-06
2 3 1.366E-06 -5.184E-06 7.580E-06
2 4 2.966E-06 5.979E-07 9.702E-08
2 5 5.254E-07 0.166E-02 9.723E-06
3 23 1.366E-06 -5.184E-03 7.580E-06
3 24 3.244E-03 5.239E-04 9.002E-08
I want to build a python dictionary, where the first number in each row is the key, the second number is always ignored, and the last three numbers are put as values. But in a dictionary, a key can not be repeated, so when I write my code (attached at the end of the question), what I get is
'1' : [ '90' '4.914E-08' '1.563E-07' '5.193E-07' ]
'2' : [ '5' '5.254E-07' '0.166E-02' '9.723E-06' ]
'3' : [ '24' '3.244E-03' '5.239E-04' '9.002E-08' ]
All the other numbers are removed, and only the last row is kept as the values. What I need is to have all the numbers against a key, say 1, to be appended in the dictionary. For example, what I need is :
'1' : ['8.106E-08' '2.052E-08' '3.837E-08' '-4.766E-09' '9.003E-08' '4.812E-07' '4.914E-08' '1.563E-07' '5.193E-07']
Is it possible to do it elegantly in python? The code I have right now is the following :
diction = {}
with open("file.txt") as f:
for line in f:
pa = line.split()
diction[pa[0]] = pa[1:]
with open('file.txt') as f:
diction = {pa[0]: pa[1:] for pa in map(str.split, f)}
You can use a defaultdict.
from collections import defaultdict
data = defaultdict(list)
with open("file.txt", "r") as f:
for line in f:
line = line.split()
data[line[0]].extend(line[2:])
Try this:
from collections import defaultdict
diction = defaultdict(list)
with open("file.txt") as f:
for line in f:
key, _, *values = line.strip().split()
diction[key].extend(values)
print(diction)
This is a solution for Python 3, because the statement a, *b = tuple1 is invalid in Python 2. Look at the solution of #cha0site if you are using Python 2.
Make the value of each key in diction be a list and extend that list with each iteration. With your code as it is written now when you say diction[pa[0]] = pa[1:] you're overwriting the value in diction[pa[0]] each time the key appears, which describes the behavior you're seeing.
with open("file.txt") as f:
for line in f:
pa = line.split()
try:
diction[pa[0]].extend(pa[1:])
except KeyError:
diction[pa[0]] = pa[1:]
In this code each value of diction will be a list. In each iteration if the key exists that list will be extended with new values from pa giving you a list of all the values for each key.
To do this in a very simple for loop:
with open('file.txt') as f:
return_dict = {}
for item_list in map(str.split, f):
if item_list[0] not in return_dict:
return_dict[item_list[0]] = []
return_dict[item_list[0]].extend(item_list[1:])
return return_dict
Or, if you wanted to use defaultdict in a one liner-ish:
from collections import defaultdict
with open('file.txt') as f:
return_dict = defaultdict(list)
[return_dict[item_list[0]].extend(item_list[1:]) for item_list in map(str.split, f)]
return return_dict

Python list write to CSV without the square brackets

I have this main function:
def main():
subprocess.call("cls", shell=True)
ipList,hostList,manfList,masterList,temp = [],[],[],[],[]
ipList,hostList,manfList, = getIPs(),getHosts(),getManfs()
entries = len(hostList)
i = 0
for i in xrange(i, entries):
temp = [[hostList[i]],[manfList[i]],[ipList[i]]]
masterList.append(temp)
with open("output.csv", "wb") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(masterList)
My current output is that it successfully writes to CSV but my objective is to remove the square brackets.
I tried using .join() method however I understand that it only takes single lists and not nested lists.
How can I achieve this given that I'm using a 3 dimensional list? Note, I intend to add more columns of data in the future.
Edit:
My current output for 1 row is similar to:
['Name1,'] ['Brand,'] ['1.1.1.1,']
I would like it to be:
Name1, Brand, 1.1.1.1,
Try to remove bracket for values in temp while creating masterList, because it will be nested list. So, the code should be:
def main():
subprocess.call("cls", shell=True)
ipList,hostList,manfList,masterList,temp = [],[],[],[],[]
ipList,hostList,manfList, = getIPs(),getHosts(),getManfs()
entries = len(hostList)
i = 0
for i in xrange(i, entries):
temp = [hostList[i], manfList[i], ipList[i]]
masterList.append(temp)
with open("output.csv", "wb") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(masterList)
What you could do is strip a string of the data maybe?
import string
writer.writerows(str(masterList).translate(string.maketrans('', ''), '[]\'')
E.g.
>>> import string
>>> temp = [['1.1.1'], ['Name1'], ['123']]
>>> str(temp).translate(string.maketrans('', ''), '[]\'')
'1.1.1, Name1, 123'
In Python 3.6:
>>> temp = [['1.1.1'], ['Name1'], ['123']]
>>> str(temp).translate({ord('['): '', ord(']'): '', ord('\''): ''})
'1.1.1, Name1, 123'
Try to change this:
temp = [[hostList[i]],[manfList[i]],[ipList[i]]]
to this:
temp = [hostList[i],manfList[i],ipList[i]]
I agree with the answers above, about the brackets removal, however if this is crucial to you for some reason, here is a function that takes a list as an input and returns you a csv row acceptable list.
def output_list(masterList):
output = []
for item in masterList:
if isinstance(item,list): #if item is a list
for i in output_list(item): #call this function on it and append its each value separately. If it has more lists in it this function will call itself again
output.append(i)
else:
output.append(item)
return output
You can use it in the line masterList.append(temp) as masterList.append(output_list(temp)), or even like this:
#in the end
with open("output.csv", "wb") as f:
writer = csv.writer(f, delimiter=',')
for i in masterList:
writer.writerow(output_list(i))

Python dictionary created from CSV file should merge the value (integer) whenever the key repeats

I have a file named report_data.csv that contains the following:
user,score
a,10
b,15
c,10
a,10
a,5
b,10
I am creating a dictionary from this file using this code:
with open('report_data.csv') as f:
f.readline() # Skip over the column titles
mydict = dict(csv.reader(f, delimiter=','))
After running this code mydict is:
mydict = {'a':5,'b':10,'c':10}
But I want it to be:
mydict = {'a':25,'b':25,'c':10}
In other words, whenever a key that already exists in mydict is encountered while reading a line of the file, the new value in mydict associated with that key should be the sum of the old value and the integer that appears on that line of the file. How can I do this?
The most straightforward way is to use defaultdict or Counter from useful collections module.
from collections import Counter
summary = Counter()
with open('report_data.csv') as f:
f.readline()
for line in f:
lbl, n = line.split(",")
n = int(n)
summary[lbl] = summary[lbl] + n
One of the most useful features in Counter class is the most_common() function, that is absent from the plain dictionaries and from defaultdict
This should work for you:
with open('report_data.csv') as f:
f.readline()
mydict = {}
for line in csv.reader(f, delimiter=','):
mydict[line[0]] = mydict.get(line[0], 0) + int(line[1])
try this.
mydict = {}
with open('report_data.csv') as f:
f.readline()
x = csv.reader(f, delimiter=',')
for x1 in x:
if mydict.get(x1[0]):
mydict[x1[0]] += int(x1[1])
else:
mydict[x1[0]] = int(x1[1])
print mydict

Create a dictionary from text file

Alright well I am trying to create a dictionary from a text file so the key is a single lowercase character and each value is a list of the words from the file that start with that letter.
The text file containts one lowercase word per line eg:
airport
bathroom
boss
bottle
elephant
Output:
words = {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e':['elephant']}
Havent got alot done really, just confused how I would get the first index from each line and set it as the key and append the values. would really appreatiate if someone can help me get sarted.
words = {}
for line in infile:
line = line.strip() # not sure if this line is correct
So let's examine your example:
words = {}
for line in infile:
line = line.strip()
This looks good for a beginning. Now you want to do something with the line. Probably you'll need the first character, which you can access through line[0]:
first = line[0]
Then you want to check whether the letter is already in the dict. If not, you can add a new, empty list:
if first not in words:
words[first] = []
Then you can append the word to that list:
words[first].append(line)
And you're done!
If the lines are already sorted like in your example file, you can also make use of itertools.groupby, which is a bit more sophisticated:
from itertools import groupby
from operator import itemgetter
with open('infile.txt', 'r') as f:
words = { k:map(str.strip, g) for k, g in groupby(f, key=itemgetter(0)) }
You can also sort the lines first, which makes this method generally applicable:
groupby(sorted(f), ...)
defaultdict from the collections module is a good choice for these kind of tasks:
>>> import collections
>>> words = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
... lines = [l.strip() for l in f if l.strip()]
...
>>> lines
['airport', 'bathroom', 'boss', 'bottle', 'elephant']
>>> for word in lines:
... words[word[0]].append(word)
...
>>> print words
defaultdict(<type 'list'>, {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e': ['elephant']})

Categories