Comparing two csv files in Python - python

I have this program that takes two csv files into consideration. It looks at "testclaims" (one column many rows) and sees if any words in "masterlist"(one column, many rows) are within the rows of "testclaims." If the rows in "testclaims" contains any word in "masterlist" it will list it into a new .csv file called "output." This part of the program works great.
The part that I can't seem to figure out is to output all the remaining rows in "testclaims" that don't contain ANY words in "masterlist" into another csv called "output2" I would think that the last two lines of my code should get this to work, but it's not outputting what I want. I hope I've explained this clearly enough. Here's my code:
import csv
with open("testclaims.csv") as file1, open("masterlist.csv") as file2,
open("stopwords.csv") as file3,\
open("output.csv", "wb+") as file4, open("output2.csv", "wb+") as file5:
writer = csv.writer(file4)
writer2 = csv.writer(file5)
key_words = [word.strip() for word in file2.readlines()]
stop_words = [word.strip() for word in file3.readlines()]
internal_stop_words = [' a ', ' an ', ' and ', 'as ', ' at ', ' be ', 'ed ',
'ers ', ' for ',\
' he ', ' if ', ' in ', ' is ', ' it ', ' of ', ' on ', ' to ', 'her ', 'hers '\
' do ', ' did ', ' a ', ' b ', ' c ', ' d ', ' e ', ' f ', ' g ', ' h ', ' i ',\
' j ', ' k ', ' l ', ' m ', 'n ', ' n', ' nc ' ' o ', ' p ', ' q ', ' r ', ' s ',\
' t ', ' u ', ' v ', ' w ', ' x ', ' y ', 'z ', ',', '"', 'ers ', ' th ', ' gc ',\
' so ', ' ot ', ' ft ', ' ow ', ' ir ', ' ho ', ' er ', ]
for row in file1:
row = row.strip()
row = row.lower()
for stop in stop_words:
if stop in row:
row = row.replace(stop," ")
for stopword in internal_stop_words:
if stopword in row:
row = row.replace(stopword," ")
for key in key_words:
if key in row:
writer.writerow([key, row])
elif key not in row:
writer2.writerow([row])
What output2 is outputting is every row in "testclaims" repeated multiple times.
For example if "testclaims" contains this one column:
Happy
Sad
Angry
Dog
Cat
"output2" is outputting a csv that prints this one column:
Happy
Happy
Happy
Happy
Happy
Sad
Sad
Sad
Sad
Angry
Angry
Angry
Angry
Angry
Dog
Dog
Dog
Dog
Dog
Cat
Cat
Cat
Cat
Cat
And it doesn't output the same number of each row either.

you have a double for loop and each time you print the row, but you only want it at most once per row.
you should adjust your last two lines:
for row in file1:
...
for key in key_words:
if key in row:
writer.writerow([key, row])
if not any(key in row for key in key_words):
writer2.writerow([row])

Related

How to find the longest continuous stretch of matching elements in 2 lists

I have 2 lists:
a = [
'Okay. ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when ',
"it's ",
'set ',
'up ',
'just ',
'one ',
'and ',
"we're ",
'like ',
'next ',
'to ',
'each ',
'other '
]
b = [
'Okay. ',
'Yeah. ',
'Everything ',
'as ',
'normal ',
'as ',
'possible. ',
'Yeah. ',
'Yeah. ',
'Okay. ',
'Is ',
'that ',
'better? ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when '
]
Each list is slightly different. However, there will be moments when a stretch of continuous elements in a will match a stretch of continuous elements in b.
For example:
The first 2 elements in both lists match. The matching list would be ['Okay.', 'Yeah.']. This is only 2 elements long.
There is a longer stretch of matching words. You can see that each contains the following continuous set:
['Yeah. ','So ','my ','thinking ','is, ','so ','when ']
This continuous matching sequence has 7 elements. This is the longest sequence.
I want the index of where this sequence starts for each list. For a, this should be 1 and for b this should be 13.
I understand that I can make every possible ordered sequence in a, starting with the longest, and check for a match in b, stopping once I get the match. However, this seems inefficent.
How I would solve this:
from difflib import SequenceMatcher
match = SequenceMatcher(None, a, b).find_longest_match()
print(a[match.a:match.a + match.size])
print(b[match.b:match.b + match.size])
You get:
['Yeah. ', 'So ', 'my ', 'thinking ', 'is, ', 'so ', 'when ']
['Yeah. ', 'So ', 'my ', 'thinking ', 'is, ', 'so ', 'when ']
So, we start from the top of 'a', and search through 'b' to find the longest match. Since this only continues as long as there is a match, it isn't terribly inefficient.
a = [
'Okay. ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when ',
"it's ",
'set ',
'up ',
'just ',
'one ',
'and ',
"we're ",
'like ',
'next ',
'to ',
'each ',
'other '
]
b = [
'Okay. ',
'Yeah. ',
'Everything ',
'as ',
'normal ',
'as ',
'possible. ',
'Yeah. ',
'Yeah. ',
'Okay. ',
'Is ',
'that ',
'better? ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when '
]
start = None
maxlen = 0
for i in range(len(a)):
for j in range(len(b)):
for k in range(min(len(a)-i,len(b)-j)):
if a[i+k] != b[j+k]:
break
if k > maxlen:
start = (i,j)
maxlen = k
print(start,maxlen)
Output:
(1, 13) 6

Insert 'BCH' into map

I wanted to insert 'BCH' inside a specific location in a list, but it gave me an error message.
Here is my code:
map = [[' ', ' ', ' ', ' '], \
[' ', ' ', ' ', ' '], \
[' ', ' ', ' ', ' '], \
[' ', ' ', ' ', ' ']
]
building = 'BCH'
map[0][1].append(building)
The error message they gave was "AttributeError: 'str' object has no attribute 'append' "
Strings are immutable, you can't use .append() on it. If you want to concatenate to the string, use an assignment.
map[0][1] += building

python - is it possible to compare the list between 2 lists using the specific digit?

I am a new student who is learning to programme with python and I have 2 example lists which are
selected_ipc = ['H01L']
df = [[ 'F24J3/02 ', 'A123'], [ 'G01N31/10 ', 'A124'], [ 'H01L27/14 ', 'A125'], ['G21H1/10 ', 'A126'], ['H01L21/36 ', 'A127']]
I have created a simple code like this
for item in selected_ipc:
for item1 in df:
if item == item1:
print (item)
else:
print("No match")
and the results are returned 'No match' while my expected result is
[[ 'H01L27/14 ', 'A125'], ['H01L21/36 ', 'A127']]
therefore, I would like to ask is it possible to compare the first list with the first 4 digits in the second list?
thank you in advance
You could use startswith:
selected_ipc = ['H01L']
df = ['F24J3/02 ', 'G01N31/10 ', 'H01L27/14 ', 'G21H1/10 ', 'H01L21/36 ']
for item in selected_ipc:
for item1 in df:
if item1.startswith(item):
print(item1)
else:
print("No match")
Output
No match
No match
H01L27/14
No match
H01L21/36
UPDATE
For a nested list you could use a list comprehension:
selected_ipc = ['H01L']
df = [['F24J3/02 ', 'A123'], ['G01N31/10 ', 'A124'], ['H01L27/14 ', 'A125'], ['G21H1/10 ', 'A126'],
['H01L21/36 ', 'A127']]
result = [lst for lst in df if any(lst[0].startswith(e) for e in selected_ipc)]
print(result)
Output
[['H01L27/14 ', 'A125'], ['H01L21/36 ', 'A127']]
As an alternative you could use a less pythonic way with two loops:
selected_ipc = ['H01L']
df = [['F24J3/02 ', 'A123'], ['G01N31/10 ', 'A124'], ['H01L27/14 ', 'A125'], ['G21H1/10 ', 'A126'],
['H01L21/36 ', 'A127']]
result = []
for lst in df:
found = False
for e in selected_ipc:
if lst[0].startswith(e):
found = True
result.append(lst)
break
if not found:
print("No match")
print(result)
Output
No match
No match
No match
[['H01L27/14 ', 'A125'], ['H01L21/36 ', 'A127']]
selected_ipc = ['H01L']
df = ['F24J3/02 ', 'G01N31/10 ', 'H01L27/14 ', 'G21H1/10 ', 'H01L21/36 ']
l = []
for i in df:
if selected_ipc[0] in i:
l.append(i)
print l
you can do it with list comprehensions like below
selected_ipc = ['H01L']
df = ['F24J3/02 ', 'G01N31/10 ', 'H01L27/14 ', 'G21H1/10 ', 'H01L21/36 ']
for item in selected_ipc:
match_lst = [item1 for item1 in df if item in item1]
print(match_lst)
UPDATE
If you want check for the other elements(instead of first one) of the lists in list "df" then you can checkout the below code
selected_ipc = ['H01L', 'G01N', 'A126']
df = [['F24J3/02 ', 'A123'], ['G01N31/10 ', 'A124'], ['H01L27/14 ', 'A125'], ['G21H1/10 ', 'A126'],
['H01L21/36 ', 'A127']]
match_lst = [item1 for item1 in df if any(i.startswith(item) for item in selected_ipc for i in item1)]
print(match_lst)
Output
[['G01N31/10 ', 'A124'], ['H01L27/14 ', 'A125'], ['G21H1/10 ', 'A126'], ['H01L21/36 ', 'A127']]
Use list comprehension check if the key is in the item if so add it to your list
res = [i for i in df if selected_ipc[0] in i[0]]
# [['H01L27/14 ', 'A125'], ['H01L21/36 ', 'A127']]

How to escape specific whitespaces when splitting line into words with regex

I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):
import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b cc dd! :ee "))
produces the expected output:
['', 'aa', ' ', 'b+b', ' ', 'cc', ' ', 'dd!', ' ', ':ee', ' ']
Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:
regexB.split("aa b+b cc(dd! :ee (ff gg) hh) ii ")
should produce:
['', 'aa', ' ', 'b+b', ' ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', ' ']
Using
regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')
works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?
And the final question: in my data, only words starting with % should be tested for the "parenthesis rule" (regexB), the other words should be treated by regexA. I have no idea how to combine two regexes in a single split.
Any hint is warmly welcome...
In the PCRE regex engine, sub-routine is supported and recursive pattern seems workable for the case including balanced nested parentheses.
(?m)\s+(?=[^()]*(\([^()]*(?1)?[^()]*\))*[^()]*$)
Demo,,, in which (?1) means calling sub-routine 1, (\([^()]*(?1)?[^()]*\)), namely recursive pattern which includes caller, (?1)
But python does not support sub-routinepattern in regex.
So I tried first replacing every ( , ) with another distinctive character( # in this example) and applying the regex to split and finally turn # back to ( or ) respectively in my pythone script.
Regex for spliting.
(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)
Demo,,, in which I changed your separator \S+ to consecutive spaces \s+ because #,(,) are included in [\S]' possible characters set.
Python script may be like this
import re
ss="""aa b+b cc(dd! :ee ((ff gg)) hh) ii """
ss=re.sub(r"\(|\)","#",ss) #repacing every `(`,`)` to `#`
regx=re.compile(r"(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)")
m=regx.split(ss)
for i in range(len(m)): # turn `#` back to `(` or `)` respectively
n= m[i].count('#')
if n < 2: continue
else:
for j in range(int(n/2)):
k=m[i].find('#'); m[i]=m[i][:k]+'('+m[i][k+1:]
m[i]= m[i].replace("#",')')
print(m)
Output is
['aa', ' ', 'b+b', ' ', 'cc(dd! :ee ((ff gg)) hh)', ' ', 'ii', ' ', '']
Finally after having tested several ideas based on the answers proposed by #Wiktor Stribiżew and #Thm Lee, I came to bunch of solutions dealing with different levels of complexity. To reduce dependency, I wanted to stick to the re module from the Python standard library, so here is the code:
import re
text = "aa b%b( %cc(dd! (:ee ff) gg) %hh ii) "
# Solution 1: don't process parentheses at all
regexA = re.compile(r'(\S+)')
print(regexA.split(text))
# Solution 2: works for non-nested parentheses
regexB = re.compile(r'(%[^(\s]*\([^)]*\)|\S+)')
print(regexB.split(text))
# Solution 3: works for one level of nested parentheses
regexC = re.compile(r'(%[^(\s]*\((?:[^()]*\([^)]*\))*[^)]*\)|\S+)')
print(regexC.split(text))
# Solution 4: works for arbitrary levels of nested parentheses
n, words = 0, []
for word in regexA.split(text):
if n: words[-1] += word
else: words.append(word)
if n or (word and word[0] == '%'):
n += word.count('(') - word.count(')')
print(words)
Here is the generated output:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
As stated in the OP, for my specific data, escaping whitespaces in parentheses has only to be done for words starting with %, other parentheses (e.g. word b%b( in my example) are not considered are special. If you want to escape whitespaces inside any pair of parentheses, simply remove the %char in the regexes. Here is the result with that modification:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg) %hh ii)', ' ']

Reversing list with strings in python

I have this list,
last_names = [
'Hag ', 'Hag ', 'Basmestad ', 'Grimlavaag ', 'Kleivesund ',
'Fintenes ', 'Svalesand ', 'Molteby ', 'Hegesen ']
and I want to print i reversed, so 'Hegesen' comes first, then ' Molteby' and at the end 'Hag'.
I have tried last_names.reverse(), but that returnes None..
Any help?
.reverse returns None because it reverses in-place:
>>> last_names = [
... 'Hag ', 'Hag ', 'Basmestad ', 'Grimlavaag ', 'Kleivesund ',
... 'Fintenes ', 'Svalesand ', 'Molteby ', 'Hegesen ']
>>> last_names.reverse()
>>> last_names
['Hegesen ', 'Molteby ', 'Svalesand ', 'Fintenes ', 'Kleivesund ', 'Grimlavaag ', 'Basmestad ', 'Hag ', 'Hag ']
To do this in an expression, do last_names[::-1].
As stated before, .reverse reverses the list in place, a more pythonic way to reverse a list and return it, is to use reversed:
>>> list(reversed([1,2,3]))
[3, 2, 1]

Categories