Programmatically merge rows of a huge file for NLP - python

I need to use the Google ngram corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) that has data of frequency of n-grams appeared in a book year by year.
File format: Each of the files below is compressed tab-separated data and line has the following format:
ngram TAB year TAB match_count TAB volume_count NEWLINE.
I wrote a code to retrieve the frequency of my input ngram
And the code is written as :
file = 'D:\Chrome Downloads\googlebooks-eng-all-4gram-20120701-aj\googlebooks-eng-all-4gram-20120701-aj'
z = []
counter = 0
freq = 0
with open(file, 'rt', encoding='UTF8') as input:
for line in input:
if(counter == 150):
break
if('Ajax and Achilles ?' == (line.strip().split('\t')[0])):
#else:
print(line.strip().split('\t'))
freq += int((line.strip().split('\t')[2]))
print('Frequency :', freq)
This worked well only because Ajax and Achilles appears on the top part of the corpus (the counter stops it). When I try to search for an ngram that appears later, it takes forever.
The problem with using this corpus to get the frequency of an n-gram is that I have to look through the whole corpus regardless.
So, I was thinking of merging the rows ignoring the year and summing up the frequency.
Is this a valid idea? If so, how can I do this programmatically?
It not, what is a better way of doing this?

You do split the lines multiple times and, of course, reading the entire file for each ngram you want to check is not ideal. Why don't you write out the total frequencies for each ngram to another file. Guessing that this google file of yours is enormous, you probably cannot easily collect counts into a single data structrue before writing them out. Relying on the file being sorted by ngram already, you can write the new file without loading the whole corpus at once:
from csv import reader, writer
from itertools import groupby
from operator import itemgetter
get_ngram = itemgetter(0)
with open(file, 'rt', encoding='UTF8') as input, open('freq.txt', 'w', encoding='UTF8') as output:
r = reader(input, delimiter='\t')
w = writer(output, delimiter='\t')
for ngram, rows in groupby(r, key=get_ngram):
# for i, (ngram, rows) in enumerate(groupby(r, key=get_ngram)):
# the i and enumerate is just for the loop not being too silent ...
freq = sum(int(row[2]) for row in rows)
w.writerow((ngram, freq))
# if not i % 10000: # ... and give you some idea what's happening
# print('Processing ngram {}')
The csv classes just take over the csv parsing and writing part. The csv.reader is a lazy iterator over lists of strings. The groupby groups the rows produced by the csv reader by the first token using the key parameter with an appropriate function. This is done using the itemgetter just to avoid some clunky key=lambda x: x[0]. groupby produces pairs of the key value and the elements of the grouped iterator that have said value. Then it sums the frequencies for these grouped rows and writes only the ngram and frequency to the file using the csv.writer.

Related

Joblib too slow using "if not in" loop

I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.
Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.
I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!
def file_without_ids_to_remove(seq):
with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file
remove = r.read().split("\n")
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")
Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))
To be clear this is an example of the data (sequence.id + sequence):
WP_051064487.1
MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF
LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV
HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD
KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS
YAERWSDHAPLLVDYTIR
UPDATE: I tried in the following way after the suggestion and it works.
with open(new_dataset, "w") as filtered:
[SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]
This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.
with open("ids-to-remove") as f:
ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
# just in case there are blank lines
if "" in ids_to_remove:
ids_to_remove.remove("")
with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)

Replace csv column with for loop in python

I'm trying to replace my dataset with some different way. I know below code blocks seems unlogical but I have to do with this way. Is there any option replace my 'Text' values in csv file to my tokenized and filtered lines with for loop ?
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
counter=0
for field in dataset['text']:
tokens = word_tokenize(field.translate(table))
tokens2= [w for w in tokens if not w in stop_words]
tokens3 = [token for token in tokens2 if not all(char.isdigit() or char == '.' or char == '-' for char in token)]
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in tokens3]
stemmed_word = [snowball_stemmer.stem(word) for word in lemmatized_word]
##### ANY CODE TO REPLACE ITEMS IN dataset['Text'] to stemmed_word
##### LIKE ;
dataset['Text']s first value = stemmed_word[counter]
counter=counter+1
then save replaced csv file
because I have features at another columns like age , gender ,
experience.
You can just leave the data you don't intend to modify as they are, and write them to the new file along with your modified column of lemmatized words. Then whether you write the new processed dataset to a new file or overwrite your old one is entirely up to you. Though I'd personally choose to write to a new file (it's unlikely that adding another CSV file will be a problem to your computer's storage nowadays).
Anyway, to write files, you can use the csv module.
import pandas
import csv
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
# do your text processing on the desired column for your dataset
# ...
# ...
# ...
dataT = dataset.transpose()
with open('new_dataset', 'wb') as csvfile:
writer = csv.writer(csvfile)
for r in dataT:
writer.writerow(dataT[r])
I can't fully test it out, since I don't know the exact format of your dataset. But it should be something along this line (perhaps you should be writing the processed dataframe directly, and not its transpose; you should be able to figure that out yourself after playing around with it).

how to extract specific data from a csv file with given parameters?

I want to extract Neutral words from the given csv file (to a separate .txt file), but I'm fairly new to python and don't know much about file handling. I could not find a neutral words dataset, but after searching here and there, this is what I was able to find.
Here is the Gtihub project from where I want to extract data (just in case anyone needs to know) : hoffman-prezioso-projects/Amazon_Review_Sentiment_Analysis
Neutral Words
Word Sentiment Score
a 0.0125160264947
the 0.00423728459134
it -0.0294755274737
and 0.0810574365028
an 0.0318918766949
or -0.274298468178
normal -0.0270787859177
So basically I want to extract only those words (text) from csv where the numeric value is 0.something.
Even without using any libraries, this is fairly easy with the csv you're using.
First open the file (I'm going to assume you have the path saved in the variable filename), then read the file with the readlines() function, and then filter out according to the condition you give.
with open(filename, 'r') as csv: # Open the file for reading
rows = [line.split(',') for line in csv.readlines()] # Read each the file in lines, and split on commas
filter = [line[0] for line in rows if abs(float(line[1])) < 1]
# Filter out all lines where the second value is not equal to 1
This is now the accepted answer, so I'm adding a disclaimer. There are numerous reasons why this code should not be applied to other CSVs without thought.
It reads the entire CSV in memory
It does not account for e.g. quoting
It is acceptable for very simple CSVs but the other answers here are better if you cannot be certain that the CSV won't break this code.
Here is one way to do it with only vanilla libs and not holding the whole file in memory
import csv
def get_vals(filename):
with open(filename, 'rb') as fin:
reader = csv.reader(fin)
for line in reader:
if line[-1] <= 0:
yield line[0]
words = get_vals(filename)
for word in words:
do stuff...
Use pandas like so:
import pandas
df = pandas.read_csv("yourfile.csv")
df.columns = ['word', 'sentiment']
to choose words by sentiment:
positive = df[df['sentiment'] > 0]['word']
negative = df[df['sentiment'] < 0]['word']
neutral = df[df['sentiment'] == 0]['word']
If you don't want to use any additional libraries, you can try with csv module. Note that delimiter='\t' can be different in your case.
import csv
f = open('name.txt', 'r')
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
if(float(row[1]) > 0.0):
print(row[0] + ' ' row[1])

How to remove rows from a csv file when compared to a list in a txt file using Python?

I have a list of 12.000 dictionary entries (the words only, without their definitions) stored in a .txt file.
I have a complete dictionary with 62.000 entries (the words with their definitions) stored in .csv file.
I need to compare the small list in the .txt file with the larger list in the .csv file and delete the rows containing the entries that doesn't appear on the smaller list. In other words, I want to purge this dictionary to only 12.000 entries.
The .txt file is ordered in separate lines like this, line by line:
word1
word2
word3
The .csv file is ordered like this:
ID (column 1) WORD (column 2) MEANING (column 3)
How do I accomplish this using Python?
Good answers so far. If you want to get minimalistic...
import csv
lookup = set(l.strip().lower() for l in open(path_to_file3))
map(csv.writer(open(path_to_file2, 'w')).writerow,
(row for row in csv.reader(open(path_to_file))
if row[1].lower() in lookup))
The following will not scale well, but should work for the number of records indicated.
import csv
csv_in = csv.reader(open(path_to_file, 'r'))
csv_out = csv.writer(open(path_to_file2, 'w'))
use_words = open(path_to_file3, 'r').readlines()
lookup = dict([(word, None) for word in use_words])
for line in csv_in:
if lookup.has_key(line[0]):
csv_out.writerow(line)
csv_out.close()
One of the least known facts of current computers is that when you delete a line from a text file and save the file, most of the time the editor does this:
load the file into memory
write a temporary file with the rows you want
close the files and move the temp over the original
So you have to load your wordlist:
with open('wordlist.txt') as i:
wordlist = set(word.strip() for word in i) # you said the file was small
Then you open the input file:
with open('input.csv') as i:
with open('output.csv', 'w') as o:
output = csv.writer(o)
for line in csv.reader(i): # iterate over the CSV line by line
if line[1] not in wordlist: # test the value at column 2, the word
output.writerow(line)
os.rename('input.csv', 'output.csv')
This is untested, now go do your homework and comment here if you find any bug... :-)
i would use pandas for this. the data set's not large, so you can do it in memory with no problem.
import pandas as pd
words = pd.read_csv('words.txt')
defs = pd.read_csv('defs.csv')
words.set_index(0, inplace=True)
defs.set_index('WORD', inplace=True)
new_defs = words.join(defs)
new_defs.to_csv('new_defs.csv')
you might need to manipulate new_defs to make it look like you want it to, but that's the gist of it.

Python Extract Word/Token counts from items in a list?

I have a question about the best way to get word counts for items in a list.
I have 400+ items indexed in a list. They are of varying lengths. For example, if I enumerate, then I will get:
for index, items in enumerate(my_list):
print index, items
0 fish, line, catch, hook
1 boat, wave, reel, line, fish, bait
.
.
.
Each item will get written into individual rows in an csv file. I would like corresponding word counts to complement this text in the adjacent column. I can find word/token counts just fine using Excel, but I would like to be able to do this in Python so I don't have to keep going back and forth between programs to process my data.
I'm sure there are several ways to do this, but I can't seem to piece together a good solution. Any help would be appreciated.
As was posted in the comments, it's not really clear what your goal is here, but if it is to print a csv file that has one word per row along with each word's length,
import csv
with open(filename, 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(['Word', 'Length'])
for word in mylist:
writer.writerow([word, str(len(word))])
If I'm misunderstanding here and actually what you have is a list of strings in which each string contains a list of comma-separated words, what you'd want to do instead is:
import csv
with open(filename, 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(['Word', 'Length'])
for line in mylist:
for word in line.split(", "):
writer.writerow([word, str(len(word))])
If I undertstand correctly, you are looking for:
import csv
words = {}
for items in my_list:
for item in items.split(', '):
words.setdefault(item, 0)
words[item] += 1
with open('output.csv', 'w') as fopen:
writer = csv.writer(fopen)
for word, count in words.items():
writer.writerow([word, count])
This will write a CSV with unique words in one column and the number of occurrences of that word in the next column.
Is this what you were asking for?

Categories