I wrote a script to read and plot data into the graphs. I have three input files
wells.csv: list of observation wells that I want to create graph
1201
1202
...
well_summary_table.csv: contained information for each well (e.g. reference elevation, depth to water)
Bore_Name Ref_elev
1201 20
data.csv: contained observation data for each well (e.g. pH, Temp)
RowId Bore_Name Depth pH
1 1201 2 7
Not all wells in wells.csv have data to plot
My script is as follow
well_name_list = []
new_depth_list =[]
pH_list = []
from pylab import *
infile = open("wells.csv",'r')
for line in infile:
line=line.strip('\n')
well=line
if not well in well_name_list:
well_name_list.append(well)
infile.close()
for well in well_name_list:
infile1 = open("well_summary_table.csv",'r')
infile2 = open("data.csv",'r')
for line in infile1:
line = line.rstrip()
if not line.startswith('Bore_Name'):
words = line.split(',')
well_name1 = words[0]
if well_name1 == well:
ref_elev = words[1]
for line in infile2:
if not line.startswith("RowId"):
line = line.strip('\n')
words = line.split(',')
well_name2 = words[1]
if well_name2 == well:
depth = words[2]
new_depth = float(ref_elev) - float(depth)
pH = words[3]
new_depth_list.append(float(new_depth))
pH_list.append(float(pH))
fig.plt.figure(figsize = (2,2.7), facecolor='white')
plt.axis([0,8,0,60])
plt.plot(pH_list, new_depth_list, linestyle='', marker = 'o')
plt.savefig(well+'.png')
new_depth_list = []
pH_list = []
infile1.close()
infile2.close()
It works on more than half of my well list then it stops without giving me any error message. I don't know what is going on. Can anyone help me with that problem? Sorry if it is an obvious question. I am a newbie.
Many thanks,
#tcaswell spotted a potential issue - you aren't closing infile1 and infile2 after each time you open them - you'll at the very least have a lot of open file handles floating around, depending on how many wells you have in the wells.csv file. In some versions of python this may cause issues, but this may not be the only problem - it's hard to say without some test data files. There might be an issue with seeking to the start of the file - going back to the beginning when you move on to the next well. This could cause the program to run as you've been experiencing, but it might also be caused by something else. You should avoid problems like this by using with to manage the scope of your open files.
You should also use a dictionary to marry up the well names with the data, and read all of the data up front before doing your plotting. This will allow you to see exactly how you've constructed your data set and where any issues exist.
I've made a few stylistic suggestions below too. This is obviously incomplete but hopefully you get the idea!
import csv
from pylab import * #imports should always go before declarations
well_details = {} #empty dict
with open('wells.csv','r') as well_file:
well_reader = csv.reader(well_file, delimiter=',')
for row in well_reader:
well_name = row[0]
if not well_details.has_key(well_name):
well_details[well_name] = {} #dict to store pH, depth, ref_elev
with open('well_summary_table.csv','r') as elev_file:
elev_reader = csv.reader(elev_file, delimiter=',')
for row in elev_reader:
well_name = row[0]
if well_details.has_key(well_name):
well_details[well_name]['elev_ref'] = row[1]
Related
I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
Hey I'm writing because I ran into a problem that I can't track down myself.
I'm trying to load some data in from a huge csv file (27.3 GB) which can be found here https://github.com/several27/FakeNewsCorpus, but every time i try to run the code below, I get a KeyError 'content' at row 116454. As far as I understand this should be because the 'content' field isn't set in the obj variable, but it should be. Where the fault happens are consistent with every run.
It doesn't only fail at this row, this is just the first row where it fails. It does work correctly on other rows since the length of words aren't zero. I have tried to alter the maximum size of the csv field to 2000000000 since this also has been a problem. I'm running it in jupiter notebook, the 'count' variable is only for tracking the error.
Codesnip
def get_words(text) :
regex = re.compile(r"\w+\'\w+|\w+|\,|\.")
return set(re.findall(regex, text))
words = set()
count = 0
with open(source, 'r', encoding='utf-8', newline= '') as articles:
reader = csv.reader(articles)
hds = next(reader, None)
print(hds)
for row in reader:
obj = {}
for hd, val in zip(hds, row):
obj[hd] = val
ws, _ = find_urls(lowercase(obj['content'])) <- error here
ws = get_words(ws)
words = words | ws
count = count + 1
try:
words.remove('URL')
except:
pass
The find_url and lowercase funktion just take a string as an input and return a altered string. they have been tested.
I'm running this on a asus laptop with an i7 intel CPU and 16 GB ram just to mention that too. the harddrive the csv file is on are a samsung SSD, and it is under a year old so there should not be any faulty pages on it yet. the csv file contains articles and the content field should never be emtpy since this will be the same as saying that the article has no content.
This is a stab in the dark without getting a look at your data (especially the rows around your infamous #116454), but zip() stops as soon as one of the iterators is exhausted. Try
from itertools import zip_longest
and replace the lines
for hd, val in zip(hds, row):
obj[hd] = val
with
for hd, val in zip_longest(hds, row, fillvalue=''):
obj[hd] = val
and see what happens. Also, read the docs.
I am writing a code that read a large text file line by line and find the line that starts with UNIQUE-ID (there are many of them in the file) and it comes right before a certain line (in this example, the one that starts with 'REACTION-LAYOUT -' and in which the 5th element in the string is OLEANDOMYCIN). The code is the following:
data2 = open('pathways.dat', 'r', errors = 'ignore')
pathways = data2.readlines()
PWY_ID = []
line_cont = []
L_PRMR = [] #Left primary
car = []
#i is the line number (first element of enumerate),
#while line is the line content (2nd elem of enumerate)
for i,line in enumerate(pathways):
if 'UNIQUE-ID' in line:
line_cont = line
PWY_ID_line = line_cont.rstrip()
PWY_ID_line = PWY_ID_line.split(' ')
PWY_ID.append(PWY_ID_line[2])
elif 'REACTION-LAYOUT -' in line:
L_PWY = line.rstrip()
L_PWY = L_PWY.split(' ')
L_PRMR.append(L_PWY[4])
elif 'OLEANDOMYCIN' in line:
car.append(PWY_ID)
print(car)
However, the output is instead all the lines that contain PWY_ID (output of the first if statement), like it was ignoring all the rest of the code. Can anybody help?
Edit
Below is a sample of my data (there are like 1000-ish similar "pages" in my textfile):
//
UNIQUE-ID - PWY-741
.
.
.
.
PREDECESSORS - (RXN-663 RXN-662)
REACTION-LAYOUT - (RXN-663 (:LEFT-PRIMARIES CPD-1003) (:DIRECTION :L2R) (:RIGHT-PRIMARIES CPD-1004))
REACTION-LAYOUT - (RXN-662 (:LEFT-PRIMARIES CPD-1002) (:DIRECTION :L2R) (:RIGHT-PRIMARIES CPD-1003))
REACTION-LAYOUT - (RXN-661 (:LEFT-PRIMARIES CPD-1001) (:DIRECTION :L2R) (:RIGHT-PRIMARIES CPD-1002))
REACTION-LIST - RXN-663
REACTION-LIST - RXN-662
REACTION-LIST - RXN-661
SPECIES - TAX-351746
SPECIES - TAX-644631
SPECIES - ORG-6335
SUPER-PATHWAYS - PWY-5266
TAXONOMIC-RANGE - TAX-1224
//
I think it would have been helpful if you'd posted some examples of data. But an approximation to what you're looking for is:
with open('pathways.dat','r', errors='ignore') as infile:
i = infile.read().find(string_to_search)
infile.seek(i+number_of_chars_to_read)
I hope this piece of code will help you focus your script on this line.
print(car) is printing out the list of all lines added by PWD_ID.append(PWY_ID_line[2]) in the first if, since you are appending the whole list of PWD_ID to car when you do car.append(PWY_ID).
so, if you want to print out the list of lines with OLEANDOMYCIN, you might want to just do car.append(line).
I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?
I wrote a script to transform a large 4MB textfile with 40k+ lines of unordered data to a specifically formatted and easier to deal with CSV file.
Problem:
Analyzing my file sizes, it appears i've lost over 1MB of data (20K Lines | edit: original file was 7MB so lost ~4MB of data), and when I attempt to search specific data points present in CommaOnly.txt in sorted_CSV.csv I cannot find them.
I found this really weird so.
What I tried:
I searched for and replaced all unicode chars present in the CommaOnly.txt that might be causing a problem.. No luck!
Example: \u0b99 replaced with " "
Here's an example of some data loss
A line from: CommaOnly.txt
name,SJ Photography,category,Professional Services,
state,none,city,none,country,none,about,
Capturing intimate & milestone moment from pregnancy and family portraits to weddings
Sorted_CSV.csv
Not present.
What could be causing this?
Code:
import re
import csv
import time
# Final Sorted Order for all data:
#['name', 'data',
# 'category','data',
# 'about', 'data',
# 'country', 'data',
# 'state', 'data',
# 'city', 'data']
## Recieves String Item, Splits on "," Delimitter Returns the split List
def split_values(string):
string = string.strip('\n')
split_string = re.split(',', string)
return split_string
## Iterates through the list, reorganizes terms in the desired order at the desired indices
## Adds the field if it does not initially
def reformo_sort(list_to_sort):
processed_values=[""]*12
for i in range(11):
try:
## Terrible code I know, but trying to be explicit for the question
if(i==0):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="name"):
processed_values[0]=(list_to_sort[j])
processed_values[1]=(list_to_sort[j+1])
## append its neighbour
## if after iterating, name does not appear, add it.
if(processed_values[0] != "name"):
processed_values[0]="name"
processed_values[1]="None"
elif(i==2):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="category"):
processed_values[2]=(list_to_sort[j])
processed_values[3]=(list_to_sort[j+1])
if(processed_values[2] != "category"):
processed_values[2]="category"
processed_values[3]="None"
elif(i==4):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="about"):
processed_values[4]=(list_to_sort[j])
processed_values[5]=(list_to_sort[j+1])
if(processed_values[4] != "about"):
processed_values[4]="about"
processed_values[5]="None"
elif(i==6):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="country"):
processed_values[6]=(list_to_sort[j])
processed_values[7]=(list_to_sort[j+1])
if(processed_values[6]!= "country"):
processed_values[6]="country"
processed_values[7]="None"
elif(i==8):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="state"):
processed_values[8]=(list_to_sort[j])
processed_values[9]=(list_to_sort[j+1])
if(processed_values[8] != "state"):
processed_values[8]="state"
processed_values[9]="None"
elif(i==10):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="city"):
processed_values[10]=(list_to_sort[j])
processed_values[11]=(list_to_sort[j+1])
if(processed_values[10] != "city"):
processed_values[10]="city"
processed_values[11]="None"
except:
print("failed to append!")
return processed_values
# Converts desired data fields to a string delimitting values by ','
def to_CSV(values_to_convert):
CSV_ENTRY=str(values_to_convert[1])+','+str(values_to_convert[3])+','+str(values_to_convert[5])+','+str(values_to_convert[7])+','+str(values_to_convert[9])+','+str(values_to_convert[11])
return CSV_ENTRY
with open("CommaOnly.txt", 'r') as c:
print("Starting.. :)")
for line in c:
entry = c.readline()
to_sort = split_values(entry)
now_sorted = reformo_sort(to_sort)
CSV_ROW=to_CSV(now_sorted)
with open("sorted_CSV.csv", "a+") as file:
file.write(str(CSV_ROW)+"\n")
print("Finished! :)")
time.sleep(60)
I've rewritten the main loop that seems fishy to me, using csv package.
Your reformo_sort routine is incomplet and syntaxically incorrect, with empty elif blocks and missing processing, so I got incomplete lines, but that should work much better than your code. Note the usage of csv, the "binary" flag, the single open in write mode instead of open/close each line (much faster) and the 1-out-of-2 filtering of the now_sorted array.
with open("CommaOnly.txt", 'rb') as c:
print("Starting.. :)")
cr = csv.reader(c,delimiter=",",quotechar='"')
with open("sorted_CSV.csv", "wb") as fw:
cw = csv.writer(fw,delimiter=",",quotechar='"')
for to_sort in cr:
now_sorted = reformo_sort(to_sort)
cw.writerow(now_sorted[1::2])