I'm trying to replace my dataset with some different way. I know below code blocks seems unlogical but I have to do with this way. Is there any option replace my 'Text' values in csv file to my tokenized and filtered lines with for loop ?
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
counter=0
for field in dataset['text']:
tokens = word_tokenize(field.translate(table))
tokens2= [w for w in tokens if not w in stop_words]
tokens3 = [token for token in tokens2 if not all(char.isdigit() or char == '.' or char == '-' for char in token)]
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in tokens3]
stemmed_word = [snowball_stemmer.stem(word) for word in lemmatized_word]
##### ANY CODE TO REPLACE ITEMS IN dataset['Text'] to stemmed_word
##### LIKE ;
dataset['Text']s first value = stemmed_word[counter]
counter=counter+1
then save replaced csv file
because I have features at another columns like age , gender ,
experience.
You can just leave the data you don't intend to modify as they are, and write them to the new file along with your modified column of lemmatized words. Then whether you write the new processed dataset to a new file or overwrite your old one is entirely up to you. Though I'd personally choose to write to a new file (it's unlikely that adding another CSV file will be a problem to your computer's storage nowadays).
Anyway, to write files, you can use the csv module.
import pandas
import csv
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
# do your text processing on the desired column for your dataset
# ...
# ...
# ...
dataT = dataset.transpose()
with open('new_dataset', 'wb') as csvfile:
writer = csv.writer(csvfile)
for r in dataT:
writer.writerow(dataT[r])
I can't fully test it out, since I don't know the exact format of your dataset. But it should be something along this line (perhaps you should be writing the processed dataframe directly, and not its transpose; you should be able to figure that out yourself after playing around with it).
Related
I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.
Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.
I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!
def file_without_ids_to_remove(seq):
with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file
remove = r.read().split("\n")
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")
Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))
To be clear this is an example of the data (sequence.id + sequence):
WP_051064487.1
MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF
LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV
HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD
KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS
YAERWSDHAPLLVDYTIR
UPDATE: I tried in the following way after the suggestion and it works.
with open(new_dataset, "w") as filtered:
[SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]
This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.
with open("ids-to-remove") as f:
ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
# just in case there are blank lines
if "" in ids_to_remove:
ids_to_remove.remove("")
with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)
I want to extract Neutral words from the given csv file (to a separate .txt file), but I'm fairly new to python and don't know much about file handling. I could not find a neutral words dataset, but after searching here and there, this is what I was able to find.
Here is the Gtihub project from where I want to extract data (just in case anyone needs to know) : hoffman-prezioso-projects/Amazon_Review_Sentiment_Analysis
Neutral Words
Word Sentiment Score
a 0.0125160264947
the 0.00423728459134
it -0.0294755274737
and 0.0810574365028
an 0.0318918766949
or -0.274298468178
normal -0.0270787859177
So basically I want to extract only those words (text) from csv where the numeric value is 0.something.
Even without using any libraries, this is fairly easy with the csv you're using.
First open the file (I'm going to assume you have the path saved in the variable filename), then read the file with the readlines() function, and then filter out according to the condition you give.
with open(filename, 'r') as csv: # Open the file for reading
rows = [line.split(',') for line in csv.readlines()] # Read each the file in lines, and split on commas
filter = [line[0] for line in rows if abs(float(line[1])) < 1]
# Filter out all lines where the second value is not equal to 1
This is now the accepted answer, so I'm adding a disclaimer. There are numerous reasons why this code should not be applied to other CSVs without thought.
It reads the entire CSV in memory
It does not account for e.g. quoting
It is acceptable for very simple CSVs but the other answers here are better if you cannot be certain that the CSV won't break this code.
Here is one way to do it with only vanilla libs and not holding the whole file in memory
import csv
def get_vals(filename):
with open(filename, 'rb') as fin:
reader = csv.reader(fin)
for line in reader:
if line[-1] <= 0:
yield line[0]
words = get_vals(filename)
for word in words:
do stuff...
Use pandas like so:
import pandas
df = pandas.read_csv("yourfile.csv")
df.columns = ['word', 'sentiment']
to choose words by sentiment:
positive = df[df['sentiment'] > 0]['word']
negative = df[df['sentiment'] < 0]['word']
neutral = df[df['sentiment'] == 0]['word']
If you don't want to use any additional libraries, you can try with csv module. Note that delimiter='\t' can be different in your case.
import csv
f = open('name.txt', 'r')
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
if(float(row[1]) > 0.0):
print(row[0] + ' ' row[1])
I need to use the Google ngram corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) that has data of frequency of n-grams appeared in a book year by year.
File format: Each of the files below is compressed tab-separated data and line has the following format:
ngram TAB year TAB match_count TAB volume_count NEWLINE.
I wrote a code to retrieve the frequency of my input ngram
And the code is written as :
file = 'D:\Chrome Downloads\googlebooks-eng-all-4gram-20120701-aj\googlebooks-eng-all-4gram-20120701-aj'
z = []
counter = 0
freq = 0
with open(file, 'rt', encoding='UTF8') as input:
for line in input:
if(counter == 150):
break
if('Ajax and Achilles ?' == (line.strip().split('\t')[0])):
#else:
print(line.strip().split('\t'))
freq += int((line.strip().split('\t')[2]))
print('Frequency :', freq)
This worked well only because Ajax and Achilles appears on the top part of the corpus (the counter stops it). When I try to search for an ngram that appears later, it takes forever.
The problem with using this corpus to get the frequency of an n-gram is that I have to look through the whole corpus regardless.
So, I was thinking of merging the rows ignoring the year and summing up the frequency.
Is this a valid idea? If so, how can I do this programmatically?
It not, what is a better way of doing this?
You do split the lines multiple times and, of course, reading the entire file for each ngram you want to check is not ideal. Why don't you write out the total frequencies for each ngram to another file. Guessing that this google file of yours is enormous, you probably cannot easily collect counts into a single data structrue before writing them out. Relying on the file being sorted by ngram already, you can write the new file without loading the whole corpus at once:
from csv import reader, writer
from itertools import groupby
from operator import itemgetter
get_ngram = itemgetter(0)
with open(file, 'rt', encoding='UTF8') as input, open('freq.txt', 'w', encoding='UTF8') as output:
r = reader(input, delimiter='\t')
w = writer(output, delimiter='\t')
for ngram, rows in groupby(r, key=get_ngram):
# for i, (ngram, rows) in enumerate(groupby(r, key=get_ngram)):
# the i and enumerate is just for the loop not being too silent ...
freq = sum(int(row[2]) for row in rows)
w.writerow((ngram, freq))
# if not i % 10000: # ... and give you some idea what's happening
# print('Processing ngram {}')
The csv classes just take over the csv parsing and writing part. The csv.reader is a lazy iterator over lists of strings. The groupby groups the rows produced by the csv reader by the first token using the key parameter with an appropriate function. This is done using the itemgetter just to avoid some clunky key=lambda x: x[0]. groupby produces pairs of the key value and the elements of the grouped iterator that have said value. Then it sums the frequencies for these grouped rows and writes only the ngram and frequency to the file using the csv.writer.
I am trying to pass unique rows to a txt file after doing a web scraping for certain values. So the txt file involves the following:
Current date Amount Gained
15/07/2017 660
16/07/2017 -200
17/07/2017 300
So basically what I want to do is to write a script that only allows unique rows I dont want any duplicates because values change daily. So if a user by accident runs the script two times in one day I dont want a duplicate row in my txt file because it will affect further calculations in my data analysis. So this is the function that I currently have and I will like to know what modifications should I make?
def Cost_Revenues_Difference():
nrevenue = revenue
ndifference = difference
dateoftoday = time.strftime('%d/%m/%Y')
Net_Result.append(nrevenue)
with open('Net_Result.txt', 'a') as ac:
for x in Net_Result:
ac.write('\n' + dateoftoday + ' ' + str(Net_Result))
Cost_Revenues_Difference()
You can read all data of your file into list before:
with open('Net_Result.txt') as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
Then check if the line you want to add does not exist in your content list, if not, add that line to file:
newLine = dateoftoday + ' ' + str(Net_Result);
if not newLine in content:
ac.write('\n' + newLine)
If the file is affordable to be loaded into RAM and has the structure you given in the example lines, maybe dump the data as a python object into .pkl. For example:
import pickle
data = {'15/07/2017': 660,
'16/07/2017': -200,
'17/07/2017': 300}
with open('/path/to/the/file.pkl', 'wb') as file:
pickle.dump(data, file)
pickle files are friendly for python objects, you can utilise dictionary object's built-in methods to avoid redundant entries or make updates.
For more complicate structures, take a look at pandas.Dataframes. If your program works with languages other than python, json or xml might be better choices.
There are many ways you can do this. Two alternative ways described below.
1 (this alt updates the value)
One is to put them in a dictionary with key and value in pairs and use the json library to import and export data (benefit: very common data structure).
import json
with open("test.json") as f:
data = json.loads(f.read())
data["18-05-17"] = 123
with open("test.json", "w") as f:
json.dump(data,f,indent=4)
Test.json
{
"18-05-17": 123,
"17-05-17": 123
}
As a dictionary only can hold unique keys you won't have duplicates.
2 (this alt will not update the value)
Another solution that comes in mind is put the current date in the filename:
import datetime
import os
today = datetime.datetime.today().strftime("%y%m%d")
filedate = [i for i in os.listdir() if i.startswith("Net_result")][0]
# If today is different than the filedate continue
if today != os.path.splitext(filedate)[0].split("_")[-1]:
# code here
with open(filedate, "a") as f:
f.write('\n' + dateoftoday + ' ' + str(Net_Result))
# rename
os.rename(filedate,"Net_result_{}.csv".format(today))
You could start with a file with yesterdays date ("Net_result_170716") and the code would check if the file-ending is different from today (which it is) and add new value, rename file and save. Running the code again would not do anything (not even open the file).
I have a list of 12.000 dictionary entries (the words only, without their definitions) stored in a .txt file.
I have a complete dictionary with 62.000 entries (the words with their definitions) stored in .csv file.
I need to compare the small list in the .txt file with the larger list in the .csv file and delete the rows containing the entries that doesn't appear on the smaller list. In other words, I want to purge this dictionary to only 12.000 entries.
The .txt file is ordered in separate lines like this, line by line:
word1
word2
word3
The .csv file is ordered like this:
ID (column 1) WORD (column 2) MEANING (column 3)
How do I accomplish this using Python?
Good answers so far. If you want to get minimalistic...
import csv
lookup = set(l.strip().lower() for l in open(path_to_file3))
map(csv.writer(open(path_to_file2, 'w')).writerow,
(row for row in csv.reader(open(path_to_file))
if row[1].lower() in lookup))
The following will not scale well, but should work for the number of records indicated.
import csv
csv_in = csv.reader(open(path_to_file, 'r'))
csv_out = csv.writer(open(path_to_file2, 'w'))
use_words = open(path_to_file3, 'r').readlines()
lookup = dict([(word, None) for word in use_words])
for line in csv_in:
if lookup.has_key(line[0]):
csv_out.writerow(line)
csv_out.close()
One of the least known facts of current computers is that when you delete a line from a text file and save the file, most of the time the editor does this:
load the file into memory
write a temporary file with the rows you want
close the files and move the temp over the original
So you have to load your wordlist:
with open('wordlist.txt') as i:
wordlist = set(word.strip() for word in i) # you said the file was small
Then you open the input file:
with open('input.csv') as i:
with open('output.csv', 'w') as o:
output = csv.writer(o)
for line in csv.reader(i): # iterate over the CSV line by line
if line[1] not in wordlist: # test the value at column 2, the word
output.writerow(line)
os.rename('input.csv', 'output.csv')
This is untested, now go do your homework and comment here if you find any bug... :-)
i would use pandas for this. the data set's not large, so you can do it in memory with no problem.
import pandas as pd
words = pd.read_csv('words.txt')
defs = pd.read_csv('defs.csv')
words.set_index(0, inplace=True)
defs.set_index('WORD', inplace=True)
new_defs = words.join(defs)
new_defs.to_csv('new_defs.csv')
you might need to manipulate new_defs to make it look like you want it to, but that's the gist of it.