Using a dictionary as regex in Python - python

I had a Python question I was hoping for some help on.
Let's start with the important part, here is my current code:
import re #for regex
import numpy as np #for matrix
f1 = open('file-to-analyze.txt','r') #file to analyze
#convert files of words into arrays.
#These words are used to be matched against in the "file-to-analyze"
math = open('sample_math.txt','r')
matharray = list(math.read().split())
math.close()
logic = open('sample_logic.txt','r')
logicarray = list(logic.read().split())
logic.close()
priv = open ('sample_priv.txt','r')
privarray = list(priv.read().split())
priv.close()
... Read in 5 more files and make associated arrays
#convert arrays into dictionaries
math_dict = dict()
math_dict.update(dict.fromkeys(matharray,0))
logic_dict = dict()
logic_dict.update(dict.fromkeys(logicarray,1))
...Make more dictionaries from the arrays (8 total dictionaries - the same number as there are arrays)
#create big dictionary of all keys
word_set = dict(math_dict.items() + logic_dict.items() + priv_dict.items() ... )
statelist = list()
for line in f1:
for word in word_set:
for m in re.finditer(word, line):
print word.value()
The goal of the program is to take a large text file and perform analysis on it. Essentially, I want the program to loop through the text file and match words found in Python dictionaries and associate them with a category and keep track of it in a list.
So for example, let's say I was parsing through the file and I ran across the word "ADD". ADD is listed under the "math" or '0' category of words. The program should then add it to a list that it ran across a 0 category and then continue to parse the file. Essentially generating a large list that looks like [0,4,6,7,4,3,4,1,2,7,1,2,2,2,4...] with each of the numbers corresponding to a particular state or category of words as illustrated above. For the sake of understanding, we'll call this large list 'statelist'
As you can tell from my code, so far I can take as input the file to analyze, take and store the text files that contain the list of words into arrays and from there into dictionaries with their correct corresponding list value (a numerical value from 1 - 7). However, I'm having trouble with the analysis portion.
As you can tell from my code, I'm trying to go line by line through the text file and regex any of the found words with the dictionaries. This is done through a loop and regexing with an additional, 9th dictionary that is more or less a "super" dictionary to help simplify the parsing.
However, I'm having trouble matching all the words in the file and when I find the word, matching it to the dictionary value, not the key. That is when it runs across and "ADD" to add 0 to the list because it is a part of the 0 or "math" category.
Would someone be able to help me figure out how to write this script? I really appreciate it! Sorry for the long post, but the code requires a lot of explanation so you know what's going on. Thank you so much in advance for your help!

The simplest change to your existing code would just be to just keep track of both the word and the category in the loop:
for line in f1:
for word, category in word_set.iteritems():
for m in re.finditer(word, line):
print word, category
statelist.append(category)

Related

Create a code in python to get the most frequent tag and value pair from a list

I have a .txt file with 3 columns: word position, word and tag (NN, VB, JJ, etc.).
Example of txt file:
1 i PRP
2 want VBP
3 to TO
4 go VB
I want to find the frequency of the word and tag as a pair in the list in order to find the most frequently assigned tag to a word.
Example of Results:
3 (food, NN), 2 (Brave, ADJ)
My idea is to start by opening the file from the folder, read the file line by line and split, set a counter using dictionary and print with the most common to uncommon in descending order.
My code is extremely rough (I'm almost embarrassed to post it):
file=open("/Users/Desktop/Folder1/trained.txt")
wordcount={}
for word in file.read().split():
from collections import Counter
c = Counter()
for d in dicts.values():
c += Counter(d)
print(c.most_common())
file.close()
Obviously, i'm getting no results. Anything will help. Thanks.
UPDATE:
so i got this code posted on here which worked, but my results are kinda funky. here's the code (the author removed it so i don't know who to credit):
file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n')
d = {}
for i in file:
if i[1:] in d.keys():
d[i[1:]] += 1
else:
d[i[1:]] = 1
print (sorted(d.items(), key=lambda x: x[1], reverse=True))
here are my results:
[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),
there seem to be a mix of good results with words and other results with space or random numbers, anyone know a way to remove what aren't real words? also, i know \t is supposed to signify a tab, is there a way to remove that as well? you guys really helped a lot
You need to have a separate collections.Counter for each word. This code uses defaultdict to create a dictionary of counters, without checking every word to see if it is known.
from collections import Counter, defaultdict
counts = defaultdict(Counter)
for row in file: # read one line into `row`
if not row.strip():
continue # ignore empty lines
pos, word, tag = row.split()
counts[word.lower()][tag] += 1
That's it, you can now check the most common tag of any word:
print(counts["food"].most_common(1))
# Prints [("NN", 3)] or whatever
If you don't mind using pandas which is a great library for tabular data I would do the following:
import pandas as pd
df = pd.read_csv("/Users/Desktop/Folder1/trained.txt", sep=" ", header=None, names=["position", "word", "tag"])
df["word_tag_counts"] = df.groupby(["word", "tag"]).transform("count")
Then if you only want the maximum one from each group you can do:
df.groupby(["word", "tag"]).max()["word_tag_counts"]
which should give you a table with the values you want

How to read a set of characters from a column, even though they are all different lengths? Python 3

my name is Rhein and I have just started to learn Python, and I'm having a lot of fun :D. I just finished a course on YouTube and I am currently working on a project of mine. Currently, I am trying to separate the columns into their own strings from a crime-data csv.
with open('C:/Users/aferdous/python-works/data-set/crime-data/crime_data-windows-1000.csv') as crime_data:
for crime in crime_data:
id = crime_data.readline(8) #<- prints the first x char of each line
print(id)
case_number = crime_data.readline(8) #<- prints the first x char of each line
print(case_number)
date = crime_data.readline(22) #<- prints the first x char of each line
print(date)
block = crime_data.readline(25) #<- prints the first x char of each line
print(block)
This was easy for the first two columns, since they all have the same amount of character lengths. But for 'block', the words in the columns have different lengths, so I do not know how to extract the right amount of characters from each word in each line. And there is a 1000 lines total.
- Thanks
I assumen that your csv format is "value1, value2, value3" if that the case you can user a python function called split. Examples:
...
columns = crime_data.split(",")
print(columns[0]) #print column 1
print(columns[2]) #print column 2
...
But for read csv in python there a lot better options you can search in google a examples I found:
https://gist.github.com/ultrakain/79758ff811f87dd11a8c6c80c28397c4
Reading a CSV file using Python

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))
I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!
I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

Workaround for index out of range while searching through FASTA file

I'm working on a program that lets the user enter a sequence they want to find inside a FASTA file, after which the program shows the description line and the sequence that belongs to it.
The FASTA can be found at hugheslab.ccbr.utoronto.ca/supplementary-data/IRC/IRC_representative_cdna.fa.gz, it's approx. 87 MB.
The idea is to first create a list with the location of description lines, which always start with a >. Once you know what are the description lines, you can search for the search_term in the lines between two description lines. This is exactly what is done in the fourth paragraph, this results in a list of 48425 long, here is an idea of what the results are: http://imgur.com/Lxy8hnI
Now the fifth paragraph is meant to search between two description lines, let's take lines 0 and 15 as example, this would be description_list[a] and description_list[a+1] as a = 0 and a+1 = 1, and description_list[0] = 0 and description_list[1] = 15. Between these lines the if-statement searches for the search term, if it finds one it will save description_list[a] into the start_position_list and description_list[a+1] into the stop_position_list, which will be used later on.
So as you can imagine a simple term like 'ATCG' will occur often, which means the start_position_list and stop_position_list will have a lot of duplicates, which will be removed using list(set(start_position_list)) and afterwards sorting them. That way start_position_list[0] and start_position_list[0] will be 0 and 15, like this: http://imgur.com/QcOsuhM, which can then be used as a range for which lines to print out to show the sequence.
Now, of course, the big issue is that line 15, for i in range(description_list[a], description_list[a+1]): will eventually hit the [a+1] while it's already at the maximum length of description_list and therefore will give a list index out of range error, as you see here as well: http://imgur.com/hi7d4tr
What would be the best solution for this ? It's still necessary to go through all the description lines and I can't come up with a better structure to go through them all ?
file = open("IRC_representative_cdna.fa")
file_list = list(file)
search_term = input("Enter your search term: ")
description_list = []
start_position_list = []
stop_position_list = []
for x in range (0, len(file_list)):
if ">" in file_list[x]:
description_list.append(x)
for a in range(0, len(description_list)):
for i in range(description_list[a], description_list[a+1]):
if search_term in file_list[i]:
start_position_list.append(description_list[a])
stop_position_list.append(description_list[a+1])
The way to avoid the subscript out of range error is to shorten the loop. Replace the line
for a in range(0, len(description_list)):
by
for a in range(0, len(description_list)-1):
Also, I think that you can use a list comprehension to build up description_list:
description_list = [x for x in file_list if x.startswith('>')]
in addition to being shorter it is more efficient since it doesn't do a linear search over the entire line when only the starting character is relevant.
Here is a solution that uses the biopython package, thus saving you the headache of parsing interleaved fasta yourself:
from Bio import SeqIO
file = open("IRC_representative_cdna.fa")
search_term = input("Enter your search term: ")
for record in SeqIO.parse(file, "fasta"):
rec_seq = record.seq
if search_term in rec-seq:
print(record.id)
print(rec-seq)
it wasn't very clear to me what your desired output is, but this code can be changed easily to fit it.

Python : How to optimize comparison between two large sets?

I salute you ! I'm new here, and I've got a little problem trying to optimize this part of code.
I'm reading from two files :
Corpus.txt -----> Contains my text (of 1.000.000 words)
Stop_words.txt -----> Contains my stop_list (of 4000 words)
I must compare each word from my corpus with every word in the stop_list, because I want to have a text without stop words, so I've :
1.000.000*4000 comparisons to do with the code below :
fich= open("Corpus.txt", "r")
text = fich.readlines()
fich1= open("stop_words.txt", "r")
stop = fich1.read()
tokens_stop = nltk.wordpunct_tokenize(stop)
tokens_stop=sorted(set(tokens_stop))
for line in text :
tokens_rm = nltk.wordpunct_tokenize(line)
z = [val for val in tokens_rm if val not in tokens_stop]
for i in z:
print i
My question is : Is there anything to do it differently ? Any structure to optimize it ?
You can create a set of your stop_words, then for every word in your text see if it is in the set.
Actually it looks like you are already using a set. Though I don't know why you are sorting it.

Categories