Hey I'm writing because I ran into a problem that I can't track down myself.
I'm trying to load some data in from a huge csv file (27.3 GB) which can be found here https://github.com/several27/FakeNewsCorpus, but every time i try to run the code below, I get a KeyError 'content' at row 116454. As far as I understand this should be because the 'content' field isn't set in the obj variable, but it should be. Where the fault happens are consistent with every run.
It doesn't only fail at this row, this is just the first row where it fails. It does work correctly on other rows since the length of words aren't zero. I have tried to alter the maximum size of the csv field to 2000000000 since this also has been a problem. I'm running it in jupiter notebook, the 'count' variable is only for tracking the error.
Codesnip
def get_words(text) :
regex = re.compile(r"\w+\'\w+|\w+|\,|\.")
return set(re.findall(regex, text))
words = set()
count = 0
with open(source, 'r', encoding='utf-8', newline= '') as articles:
reader = csv.reader(articles)
hds = next(reader, None)
print(hds)
for row in reader:
obj = {}
for hd, val in zip(hds, row):
obj[hd] = val
ws, _ = find_urls(lowercase(obj['content'])) <- error here
ws = get_words(ws)
words = words | ws
count = count + 1
try:
words.remove('URL')
except:
pass
The find_url and lowercase funktion just take a string as an input and return a altered string. they have been tested.
I'm running this on a asus laptop with an i7 intel CPU and 16 GB ram just to mention that too. the harddrive the csv file is on are a samsung SSD, and it is under a year old so there should not be any faulty pages on it yet. the csv file contains articles and the content field should never be emtpy since this will be the same as saying that the article has no content.
This is a stab in the dark without getting a look at your data (especially the rows around your infamous #116454), but zip() stops as soon as one of the iterators is exhausted. Try
from itertools import zip_longest
and replace the lines
for hd, val in zip(hds, row):
obj[hd] = val
with
for hd, val in zip_longest(hds, row, fillvalue=''):
obj[hd] = val
and see what happens. Also, read the docs.
Related
I have a file containing DBLP dataset which consists of bibliographic data in computer science. I want to delete some of the records with missing information. For example, I want to delete records with the missing venue. In this dataset, the venue is followed by '#c'.
In this code, I am splitting documents by the title of manuscripts ("#*"). Now, I am trying to delete records without venue name.
Input Data:
#*Toward Connectionist Parsing.
##Steven L. Small,Garrison W. Cottrell,Lokendra Shastri
#t1982
#c
#index14997
#*A Framework for Reinforcement Learning on Real Robots.
##William D. Smart,Leslie Pack Kaelbling
#t1998
#cAAAI/IAAI
#index14998
#*Efficient Goal-Directed Exploration.
##Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons
#t1996
#cAAAI/IAAI, Vol. 1
#index14999
My code:
inFile = open('lorem.txt','r')
Data = inFile.read()
data = Data.split("#*")
ouFile = open('testdata.txt','w')
for idx, word in enumerate(data):
print("i = ", idx)
if not('#!' in data[idx]):
del data[idx]
idx = idx - 1
else:
ouFile.write("#*" + data[idx])
ouFile.close()
inFile.close()
Expected Output:
#*A Framework for Reinforcement Learning on Real Robots.
##William D. Smart,Leslie Pack Kaelbling
#t1998
#cAAAI/IAAI
#index14998
#*Efficient Goal-Directed Exploration.
##Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons
#t1996
#cAAAI/IAAI, Vol. 1
#index14999
Actual Output:
An empty output file
str.find will give you an index of sub-string, or -1 if the sub-string does not exist.
DOCUMENT_SEP = '#*'
with open('lorem.txt') as in_file:
documents = in_file.read().split(DOCUMENT_SEP)
with open('testdata.txt', 'w') as out_file:
for document in documents:
i = document.find('#c')
if i < 0: # no "#c"
continue
# "#c" exists, but no trailing venue information
if not document[i+2:i+3].strip():
continue
out_file.write(DOCUMENT_SEP)
out_file.write(document)
Instead of closing manually, I used a with statement.
No need to use index; deleting an item in the middle of loop will make index calculation complex.
Using regular expressions like #c[A-Z].. will make the code simpler.
The reason your code wasn't working is because there's no #! in any of your entries.
If you want to exclude entries with empty #c fields, you can try this:
inFile = open('lorem.txt','r')
Data = inFile.read()
data = Data.split("#*")
ouFile = open('testdata.txt','w')
for idx, word in enumerate(data):
print("i = ", idx)
if not '#c\n' in data[idx] and len(word) > 0:
ouFile.write("#*" + data[idx])
ouFile.close()
inFile.close()
In general, try not to delete elements of a list you're looping through. It can cause a lot of unexpected drama.
I´m new in python so I would be thankful for every help...
My problem is the following:
I wrote a program in python analysing gene sequences of a huge database (more than 600 genes). With the help of the write() function the program should insert the results in a text file - one result per gene. Opening my output file, there are only the first genes followed by "..." followed by the last gene.
Is there a maximum this function could process? How do I make python write all results?
relevant part of code:
fasta_df3 = pd.read_table(fasta_out3, delim_whitespace=True, names=
('qseqid','sseqid', 'evalue', 'pident'))
fasta_df3_sorted = fasta_df3.sort_values(by='qseqid', ascending = True)
fasta_df3_grouped = fasta_df3_sorted.groupby('qseqid')
for qseqid, fasta_df3_sorted in fasta_df3_grouped:
subj3_pident_max = str(fasta_df3_grouped['pident'].max())
subj3_pident_min = str(fasta_df3_grouped['pident'].min())
current_gene = str(qseqid)
with open(dir_output+outputall_file+".txt","a") as gene_list:
gene_list.write("\n"+"subj3: {} \t {} \t {}".format(current_gene,
subj3_pident_max, subj3_pident_min))
gene_list.close()
I have genomic data from 16 nuclei. The first column represents the nucleus, the next two columns represent the scaffold (section of genome) and the position on the scaffold respectively, and the last two columns represent the nucleotide and coverage respectively. There can be equal scaffolds and positions in different nuclei.
Using input for start and end positions (scaffold and position of each), I'm supposed to output a csv file which shows the data (nucleotide and coverage) of each nucleus within the range from start to end. I was thinking of doing this by having 16 columns (one for each nucleus), and then showing the data from top to bottom. The leftmost region would be a reference genome in that range, which I accessed by creating a dictionary for each of its scaffolds.
In my code, I have a defaultdict of lists, so the key is a string which combines the scaffold and the location, while the data is an array of lists, so that for each nucleus, the data can be appended to the same location, and in the end each location has data from every nucleus.
Of course, this is very slow. How should I be doing it instead?
Code:
#let's plan this
#input is start and finish - when you hit first, add it and keep going until you hit next or larger
#dictionary of arrays
#loop through everything, output data for each nucleus
import csv
from collections import defaultdict
inrange = 0
start = 'scaffold_41,51335'
end = 'scaffold_41|51457'
locations = defaultdict(list)
count = 0
genome = defaultdict(lambda : defaultdict(dict))
scaffold = ''
for line in open('Allpaths_SL1_corrected.fasta','r'):
if line[0]=='>':
scaffold = line[1:].rstrip()
else:
genome[scaffold] = line.rstrip()
print('Genome dictionary done.')
with open('automated.csv','rt') as read:
for line in csv.reader(read,delimiter=','):
if line[1] + ',' + line[2] == start:
inrange = 1
if inrange == 1:
locations[line[1] + ',' + line[2]].append([line[3],line[4]])
if line[1] + ',' + line[2] == end:
inrange = 0
count += 1
if count%1000000 == 0:
print('Checkpoint '+str(count)+'!')
with open('region.csv','w') as fp:
wr = csv.writer(fp,delimiter=',',lineterminator='\n')
for key in locations:
nuclei = []
for i in range(0,16):
try:
nuclei.append(locations[key][i])
except IndexError:
nuclei.append(['',''])
wr.writerow([genome[key[0:key.index(',')][int(key[key.index(',')+1:])-1],key,nuclei])
print('Done!')
Files:
https://drive.google.com/file/d/0Bz7WGValdVR-bTdOcmdfRXpUYUE/view?usp=sharing
https://drive.google.com/file/d/0Bz7WGValdVR-aFdVVUtTbnI2WHM/view?usp=sharing
(Only focusing on the CSV section in the middle of your code)
The example csv file you supplied is over 2GB and 77,822,354 lines. Of those lines, you seem to only be focused on 26,804,253 lines or about 1/3.
As a general suggestion, you can speed thing up by:
Avoid processing the data you are not interested in (2/3 of the file);
Speed up identifying the data you are interested in;
Avoid the things that repeated millions of times that tend to be slower (processing each line as csv, reassembling a string, etc);
Avoid reading all data when you can break it up into blocks or lines (memory will get tight)
Use faster tools like numpy, pandas and pypy
You data is block oriented, so you can use a FlipFlop type object to sense if you are in a block or not.
The first column of your csv is numeric, so rather than splitting the line apart and reassembling two columns, you can use the faster Python in operator to find the start and end of the blocks:
start = ',scaffold_41,51335,'
end = ',scaffold_41,51457,'
class FlipFlop:
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
rtr=True if self.state else False
if self.patterns[self.state] in st:
self.state = not self.state
return self.state or rtr
lines_in_block=0
with open('automated.csv') as f:
ff=FlipFlop(start, end)
for lc, line in enumerate(f):
if ff(line):
lines_in_block+=1
print lines_in_block, lc
Prints:
26804256 77822354
That runs in about 9 seconds in PyPy and 46 seconds in Python 2.7.
You can then take the portion that reads the source csv file and turn that into a generator so you only need to deal with one block of data at a time.
(Certainly not correct, since I spent no time trying to understand your files overall..):
def csv_bloc(fn, start_pat, end_pat):
from itertools import ifilter
with open(fn) as csv_f:
ff=FlipFlop(start_pat, end_pat)
for block in ifilter(ff, csv_f):
yield block
Or, if you need to combine all the blocks into one dict:
def csv_line(fn, start, end):
with open(fn) as csv_in:
ff=FlipFlop(start, end)
for line in csv_in:
if ff(line):
yield line.rstrip().split(",")
di={}
for row in csv_line('/tmp/automated.csv', start, end):
di.setdefault((row[2],row[3]), []).append([row[3],row[4]])
That executes in about 1 minute on my (oldish) Mac in PyPy and about 3 minutes in cPython 2.7.
Best
I wrote a script to transform a large 4MB textfile with 40k+ lines of unordered data to a specifically formatted and easier to deal with CSV file.
Problem:
Analyzing my file sizes, it appears i've lost over 1MB of data (20K Lines | edit: original file was 7MB so lost ~4MB of data), and when I attempt to search specific data points present in CommaOnly.txt in sorted_CSV.csv I cannot find them.
I found this really weird so.
What I tried:
I searched for and replaced all unicode chars present in the CommaOnly.txt that might be causing a problem.. No luck!
Example: \u0b99 replaced with " "
Here's an example of some data loss
A line from: CommaOnly.txt
name,SJ Photography,category,Professional Services,
state,none,city,none,country,none,about,
Capturing intimate & milestone moment from pregnancy and family portraits to weddings
Sorted_CSV.csv
Not present.
What could be causing this?
Code:
import re
import csv
import time
# Final Sorted Order for all data:
#['name', 'data',
# 'category','data',
# 'about', 'data',
# 'country', 'data',
# 'state', 'data',
# 'city', 'data']
## Recieves String Item, Splits on "," Delimitter Returns the split List
def split_values(string):
string = string.strip('\n')
split_string = re.split(',', string)
return split_string
## Iterates through the list, reorganizes terms in the desired order at the desired indices
## Adds the field if it does not initially
def reformo_sort(list_to_sort):
processed_values=[""]*12
for i in range(11):
try:
## Terrible code I know, but trying to be explicit for the question
if(i==0):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="name"):
processed_values[0]=(list_to_sort[j])
processed_values[1]=(list_to_sort[j+1])
## append its neighbour
## if after iterating, name does not appear, add it.
if(processed_values[0] != "name"):
processed_values[0]="name"
processed_values[1]="None"
elif(i==2):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="category"):
processed_values[2]=(list_to_sort[j])
processed_values[3]=(list_to_sort[j+1])
if(processed_values[2] != "category"):
processed_values[2]="category"
processed_values[3]="None"
elif(i==4):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="about"):
processed_values[4]=(list_to_sort[j])
processed_values[5]=(list_to_sort[j+1])
if(processed_values[4] != "about"):
processed_values[4]="about"
processed_values[5]="None"
elif(i==6):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="country"):
processed_values[6]=(list_to_sort[j])
processed_values[7]=(list_to_sort[j+1])
if(processed_values[6]!= "country"):
processed_values[6]="country"
processed_values[7]="None"
elif(i==8):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="state"):
processed_values[8]=(list_to_sort[j])
processed_values[9]=(list_to_sort[j+1])
if(processed_values[8] != "state"):
processed_values[8]="state"
processed_values[9]="None"
elif(i==10):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="city"):
processed_values[10]=(list_to_sort[j])
processed_values[11]=(list_to_sort[j+1])
if(processed_values[10] != "city"):
processed_values[10]="city"
processed_values[11]="None"
except:
print("failed to append!")
return processed_values
# Converts desired data fields to a string delimitting values by ','
def to_CSV(values_to_convert):
CSV_ENTRY=str(values_to_convert[1])+','+str(values_to_convert[3])+','+str(values_to_convert[5])+','+str(values_to_convert[7])+','+str(values_to_convert[9])+','+str(values_to_convert[11])
return CSV_ENTRY
with open("CommaOnly.txt", 'r') as c:
print("Starting.. :)")
for line in c:
entry = c.readline()
to_sort = split_values(entry)
now_sorted = reformo_sort(to_sort)
CSV_ROW=to_CSV(now_sorted)
with open("sorted_CSV.csv", "a+") as file:
file.write(str(CSV_ROW)+"\n")
print("Finished! :)")
time.sleep(60)
I've rewritten the main loop that seems fishy to me, using csv package.
Your reformo_sort routine is incomplet and syntaxically incorrect, with empty elif blocks and missing processing, so I got incomplete lines, but that should work much better than your code. Note the usage of csv, the "binary" flag, the single open in write mode instead of open/close each line (much faster) and the 1-out-of-2 filtering of the now_sorted array.
with open("CommaOnly.txt", 'rb') as c:
print("Starting.. :)")
cr = csv.reader(c,delimiter=",",quotechar='"')
with open("sorted_CSV.csv", "wb") as fw:
cw = csv.writer(fw,delimiter=",",quotechar='"')
for to_sort in cr:
now_sorted = reformo_sort(to_sort)
cw.writerow(now_sorted[1::2])
I wrote a script to read and plot data into the graphs. I have three input files
wells.csv: list of observation wells that I want to create graph
1201
1202
...
well_summary_table.csv: contained information for each well (e.g. reference elevation, depth to water)
Bore_Name Ref_elev
1201 20
data.csv: contained observation data for each well (e.g. pH, Temp)
RowId Bore_Name Depth pH
1 1201 2 7
Not all wells in wells.csv have data to plot
My script is as follow
well_name_list = []
new_depth_list =[]
pH_list = []
from pylab import *
infile = open("wells.csv",'r')
for line in infile:
line=line.strip('\n')
well=line
if not well in well_name_list:
well_name_list.append(well)
infile.close()
for well in well_name_list:
infile1 = open("well_summary_table.csv",'r')
infile2 = open("data.csv",'r')
for line in infile1:
line = line.rstrip()
if not line.startswith('Bore_Name'):
words = line.split(',')
well_name1 = words[0]
if well_name1 == well:
ref_elev = words[1]
for line in infile2:
if not line.startswith("RowId"):
line = line.strip('\n')
words = line.split(',')
well_name2 = words[1]
if well_name2 == well:
depth = words[2]
new_depth = float(ref_elev) - float(depth)
pH = words[3]
new_depth_list.append(float(new_depth))
pH_list.append(float(pH))
fig.plt.figure(figsize = (2,2.7), facecolor='white')
plt.axis([0,8,0,60])
plt.plot(pH_list, new_depth_list, linestyle='', marker = 'o')
plt.savefig(well+'.png')
new_depth_list = []
pH_list = []
infile1.close()
infile2.close()
It works on more than half of my well list then it stops without giving me any error message. I don't know what is going on. Can anyone help me with that problem? Sorry if it is an obvious question. I am a newbie.
Many thanks,
#tcaswell spotted a potential issue - you aren't closing infile1 and infile2 after each time you open them - you'll at the very least have a lot of open file handles floating around, depending on how many wells you have in the wells.csv file. In some versions of python this may cause issues, but this may not be the only problem - it's hard to say without some test data files. There might be an issue with seeking to the start of the file - going back to the beginning when you move on to the next well. This could cause the program to run as you've been experiencing, but it might also be caused by something else. You should avoid problems like this by using with to manage the scope of your open files.
You should also use a dictionary to marry up the well names with the data, and read all of the data up front before doing your plotting. This will allow you to see exactly how you've constructed your data set and where any issues exist.
I've made a few stylistic suggestions below too. This is obviously incomplete but hopefully you get the idea!
import csv
from pylab import * #imports should always go before declarations
well_details = {} #empty dict
with open('wells.csv','r') as well_file:
well_reader = csv.reader(well_file, delimiter=',')
for row in well_reader:
well_name = row[0]
if not well_details.has_key(well_name):
well_details[well_name] = {} #dict to store pH, depth, ref_elev
with open('well_summary_table.csv','r') as elev_file:
elev_reader = csv.reader(elev_file, delimiter=',')
for row in elev_reader:
well_name = row[0]
if well_details.has_key(well_name):
well_details[well_name]['elev_ref'] = row[1]