I'm fairly new to Python and I've written a scraper that prints the data I scrap the exact way I need it, but I'm having trouble writing the data to a file. I need it to look the exact same way and be in the same order as it does when it prints in IDLE
import requests
import re
from bs4 import BeautifulSoup
year_entry = raw_input("Enter year: ")
week_entry = raw_input("Enter week number: ")
week_link = requests.get("http://sports.yahoo.com/nfl/scoreboard/?week=" + week_entry + "&phase=2&season=" + year_entry)
page_content = BeautifulSoup(week_link.content)
a_links = page_content.find_all('tr', {'class': 'game link'})
for link in a_links:
r = 'http://www.sports.yahoo.com' + str(link.attrs['data-url'])
r_get = requests.get(r)
soup = BeautifulSoup(r_get.content)
stats = soup.find_all("td", {'class':'stat-value'})
teams = soup.find_all("th", {'class':'stat-value'})
scores = soup.find_all('dd', {"class": 'score'})
try:
game_score = scores[-1]
game_score = game_score.text
x = game_score.split(" ")
away_score = x[1]
home_score = x[4]
home_team = teams[1]
away_team = teams[0]
away_team_stats = stats[0::2]
home_team_stats = stats[1::2]
print away_team.text + ',' + away_score + ',',
for stats in away_team_stats:
print stats.text + ',',
print '\n'
print home_team.text + ',' + home_score +',',
for stats in home_team_stats:
print stats.text + ',',
print '\n'
except:
pass
I am totally confused on how to get this to print to a txt file the same way it prints in IDLE. The code is built to only run on completed weeks of the NFL season. So if you test the code, I recommend year = 2014 and week = 12 (or before)
Thanks,
JT
To write to a file you need to build up the line as a string, then write that line to a file.
You'd use something like:
# Open/create a file for your output
with open('my_output_file.csv', 'wb') as csv_out:
...
# Your BeautifulSoup code and parsing goes here
...
# Then build up your output strings
for link in a_links:
away_line = ",".join([away_team.text, away_score])
for stats in away_team_stats:
away_line += [stats.text]
home_line = ",".join(home_team.text, home_score])
for stats in home_team_stats:
home_line += [stats.text]
# Write your output strings to the file
csv_out.write(away_line + '\n')
csv_out.write(home_line + '\n')
This is a quick and dirty fix. To do it properly you probably want to look into the csv module (docs)
From the structure of your output I agree with Jamie that using CSV is a logical choice.
But since you're using Python 2, it's possible to use an alternate form of the print statement to print to a file.
From https://docs.python.org/2/reference/simple_stmts.html#the-print-statement
print also has an extended form, defined by the second portion of the
syntax described above. This form is sometimes referred to as “print
chevron.” In this form, the first expression after the >> must
evaluate to a “file-like” object, specifically an object that has a
write() method as described above. With this extended form, the
subsequent expressions are printed to this file object. If the first
expression evaluates to None, then sys.stdout is used as the file for
output.
Eg,
outfile = open("myfile.txt", "w")
print >>outfile, "Hello, world"
outfile.close()
However, this syntax is not supported in Python 3, so I guess it's probably not a good idea to use it. :) FWIW, I generally use the file write() method in my code when writing to files, except that I tend to use print >>sys.stderr for error messages.
Related
I have tried moving around the strings and variables I am concatenating, using while loops, moved the line and method that I am opening the outfile, etc. No matter what I do my output prints/writes "curl" + my url variable. From there it ends in "..." ex: curl "https://examplesite/...
Does this have something to do with a buffer or slicing problem? Thank you for any and all help. Full code below.
import pandas as pd
# file = open("output.txt","wt")
header_list = ["COLA", "COLB"]
df = pd.read_csv("curl_data.csv", names=header_list)
df_length = len(df)
iterator = 0
with open("output.txt", "w") as file:
for row in df.iterrows():
url = '"https://examplesite'
lic = df.COLA # use %20 instead of spaces
name = df.COLB # use %20 instead of spaces
group = "example group" # use %20 instead of spaces
command = "curl " + url + "license=" + lic + "&name=" + name + "&group=" + group + '"'
print(command)
file.write(str(command))
iterator += 1
if iterator == 1:
break
file.close()
Solved. As Imre Kerr suggested in the comments the problem was with the length of the output.
I changed my for loop to be for i in range(len(df)): this only looped through the dataframe once (as per Barmars suggestion) and changed the references to the columns in my code from df.COLA to df.loc[i, "COLA] so that it did not print the whole dataset everytime. This fixed the problem of the lines being too long and thus I was able to see the full line for each outputted string.
I have a script of about 300 lines (part of which is pasted below) with a lot of print commands. I am trying to cleanup the output it produces. If I leave it the way it is then all the print commands print bytes with \r\n on to the console.
I figured if I add .decode('utf-8') in front of the variable that I need to print then the output is what I should be expecting (uni-code string). For example, compare print (data1) and print (data3) commands below. What I want to do is to go through all of the code and append .decode() to every print statement.
All the print commands are in this format: Print (dataxxxx)
import telnetlib
import time
import sys
import random
from xlwt import Workbook
shelfIp = "10.10.10.10"
shelf = "33"
print ("Shelf IP is: " + str(shelfIp))
print ("Shelf number is: " + str(shelf))
def addCard():
tn = telnetlib.Telnet(shelfIp)
### Telnet session
tn.read_until(b"<",5)
cmd = "ACT-USER::ADMIN:ONE::ADMIN;"
tn.write(bytes(cmd,encoding="UTF-8"))
data1 = tn.read_until(b"ONE COMPLD", 5)
print (data1.decode('utf-8'))
### Entering second network element
cmd = "ENT-CARD::CARD" + shelf + "-" + shelf + ":TWO:xyz:;"
tn.write(bytes(cmd,encoding="UTF-8"))
data3 = tn.read_until(b"TWO COMPLD", 5)
print (data3)
### Entering third network element
cmd = "ENT-CARD::CARD-%s-%s:ADM:ABC:;" %(shelf,shelf)
tn.write(bytes(cmd,encoding="UTF-8"))
dataAmp = tn.read_until(b"ADM COMPLD", 5)
print (dataAmp)
tn.close()
addCard()
If you are looking into doing some sort of find-replace on the code, you can try this:
import re
f = open('script.py','rb')
script = f.read()
f.close()
newscript = re.sub("(print\(.*)\)", "\g<1>.decode('utf-8'))", script)
f = open('script.py', 'wb')
f.write(newscript)
f.close()
What I did in the regular expression:
Catch text that contains print(......) and save the print(..... part into group 1
Replace the text after the print(.... which is ) with: .decode('utf-8')) using the syntax \g<1> which takes the saved group number 1 and put that as the prefix in the replaced text.
Appending .decode() to print() statements will fail because .decode() is a string method.
>>> x=u"testing"
>>> print(x).decode('utf-8')
testing
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'decode'
You must apply .decode('utf-8') to the variables you wish to decode, which is not easily accomplished using regex based tools.
I have searched the grep answers on here and cannot find an answer. They all seem to search for a string in a file, not a list of strings from a file. I already have a search function that works, but grep does it WAY faster. I have a list of strings in a file sn.txt (with one string on each line, no deliminators). I want to search another file (Merge_EXP.exp) for lines that have a match and write it out to a new file. The file I am searching in has a half millions lines, so searching for a few thousand in there takes hours without grep.
When I run it from command prompt in windows, it does it in minutes:
grep --file=sn.txt Merge_EXP.exp > Merge_EXP_Out.exp
How can I call this same process from Python? I don't really want alternatives in Python because I already have one that works but takes a while. Unless you think you can significantly improve the performance of that:
def match_SN(serialnumb, Exp_Merge, output_exp):
fout = open(output_exp,'a')
f = open(Exp_Merge,'r')
# skip first line
f.readline()
for record in f:
record = record.strip().rstrip('\n')
if serialnumb in record:
fout.write (record + '\n')
f.close()
fout.close()
def main(Output_CSV, Exp_Merge, updated_exp):
# create a blank output
fout = open(updated_exp,'w')
# copy header records
f = open(Exp_Merge,'r')
header1 = f.readline()
fout.write(header1)
header2 = f.readline()
fout.write(header2)
fout.close()
f.close()
f_csv = open(Output_CSV,'r')
f_csv.readline()
for rec in f_csv:
rec_list = rec.split(",")
sn = rec_list[2]
sn = sn.strip().rstrip('\n')
match_SN(sn,Exp_Merge,updated_exp)
Here is a optimized version of pure python code:
def main(Output_CSV, Exp_Merge, updated_exp):
output_list = []
# copy header records
records = open(Exp_Merge,'r').readlines()
output_list = records[0:2]
serials = open(Output_CSV,'r').readlines()
serials = [x.split(",")[2].strip().rstrip('\n') for x in serials]
for s in serials:
items = [x for x in records if s in x]
output_list.extend(items)
open(updated_exp, "w").write("".join(output_list))
main("sn.txt", "merge_exp.exp", "outx.txt")
Input
sn.txt:
x,y,0011
x,y,0002
merge_exp.exp:
Header1
Header2
0011abc
0011bcd
5000n
5600m
6530j
0034k
2000lg
0002gg
Output
Header1
Header2
0011abc
0011bcd
0002gg
Try this out and see how much time it takes...
When I use full path to grep location it worked (I pass it the grep_loc, Serial_List, Export):
import os
Export_Dir = os.path.dirname(Export)
Export_Name = os.path.basename(Export)
Output = Export_Dir + "\Output_" + Export_Name
print "\nOutput: " + Output + "\n"
cmd = grep_loc + " --file=" + Serial_List + " " + Export + " > " + Output
print "grep usage: \n" + cmd + "\n"
os.system(cmd)
print "Output created\n"
I think you have not chosen the right title for your question: What you want to do is the equivalent of a database JOIN. You can use grep for that in this particular instance, because one of your files only has keys and no other information. However, I think it is likely (but of course I don't know your case) that in the future your sn.txt may also contain extra information.
So I would solve the generic case. There are multiple solutions:
import all data into a database, then do a LEFT JOIN (in sql) or equivalent
use a python large data tool
For the latter, you could try numpy or, recommended because you are working with strings, pandas. Pandas has an optimized merge routine, which is very fast in my experience (uses cython under the hood).
Here is pandas PSEUDO code to solve your problem. It is close to real code but I need to know the names of the columns that you want to match on. I assumed here the one column in sn.txt is called key, and the matching column in merge_txt is called sn. I also see you have two header lines in merge_exp, read the docs for that.
# PSEUDO CODE (but close)
import pandas
left = pandas.read_csv('sn.txt')
right = pandas.read_csv('merge_exp.exp')
out = pandas.merge(left, right, left_on="key", right_on="sn", how='left')
out.to_csv("outx.txt")
Hi i am processing a 600Mb file. i have written the below code. What i am doing was, to search for a keyword in the data between <dest> tags and if it exists then add a city tag to <dest> tag. It worked fine for small set of data but when i ran the program on large file it is throwing MEMORY ERROR. I guess i am getting this error when i use return statement in if condition can any one please let me know how to solve this?
import re
def casp ( tx ):
def tbcnv( st ):
ct = ''
prt = re.compile(r"(?i)(Slip Copy,.*?\))", re.DOTALL|re.M)
val = re.search(prt, st)
try:
ct = val.group(1)
if re.search(r"(?i)alaska", ct):
jval = "Alaska"
print jval
if jval:
prt = re.compile(r"(?i)(.*?<dest.*?>)", re.DOTALL|re.M)
vl = re.sub(prt, "\\1\n" + "<city>" + jval + "</city>" + "\n" ,st)
return vl
else:
return st
else:
return st
except:
print "Not available"
return st
pt = re.compile("(?i)(<dest.*?</dest>)", re.DOTALL|re.M)
t = re.sub(pt, lambda m: tbcnv(m.group(1)), tx)
return t
with open('input.txt', 'r') as content_file:
content = content_file.read()
pt = re.compile(r"(?i)<Lrlevel level='3'>(.*?)</Lrlevel>", re.DOTALL|re.M)
content = re.sub(pt,lambda m: "<Lrlevel level='3'>" + casp(m.group(1) + "</Lrlevel>" ), content)
with open('out.txt', 'w') as out_file:
out_file.write(content)
If you remove the return statement just before the expect, then the string built by re.sub() is much smaller.
I'm getting memory usage that is 3 times the file size, which means that you'd get a MemoryError if you don't have (more than) 2GB. This is reasonable here --- or at least I can guess why. It's how re.sub() works.
This means that you're using somehow the wrong tools, as explained in the comments above. You should either use a full xml-processing tool like lxml, or if you want to stick with regular expressions, find a way to never need the whole string in memory; or at least to never call re.sub() on it (e.g. only the tx variable ever contains a big string, which is the input; and you do pt.search(tx, startpos) in a loop, locating the places to change, and writing piece by piece parts of tx).
Paul McGuire, the author of pyparsing, was kind enough to help a lot with a problem I'm trying to solve. We're on 1st down with a yard to goal, but I can't even punt it across the goal line. Confucius said if he gave a student 1/4 of the solution, and he did not return with the other 3/4s, then he would not teach that student again. So it is after almost a week of frustation and with great anxiety that I ask this...
How do I open an input file for pyparsing and print the output to another file?
Here is what I've got so far, but it's really all his work
from pyparsing import *
datafile = open( 'test.txt' )
# Backaus Nuer Form
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
partMatch = patientData("patientData") | gleason("gleason")
lastPatientData = None
# PARSE ACTIONS
def patientRecord( datafile ):
for match in partMatch.searchString(datafile):
if match.patientData:
lastPatientData = match
elif match.gleason:
if lastPatientData is None:
print "bad!"
continue
print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
lastPatientData.patientData, match.gleason
)
patientData.setParseAction(lastPatientData)
# MAIN PROGRAM
if __name__=="__main__":
patientRecord()
It looks like you need to call datafile.read() in order to read the contents of the file. Right now you are trying to call searchString on the file object itself, not the text in the file. You should really look at the Python tutorial (particularly this section) to get up to speed on how to read files, etc.
It seems like you need some help putting it together. The advice of #BrenBarn is spot-on, work with problem of simple complexity before you put it all together. I can help by giving you a minimal example of what you are trying to do, with a much simpler grammar. You can use this as a template to learn how to read/write a file in python. Consider the input text file data.txt:
cat 3
dog 5
foo 7
Let's parse this file and output the results. To have some fun, let's mulpitply the second column by 2:
from pyparsing import *
# Read the input data
filename = "data.txt"
FIN = open(filename)
TEXT = FIN.read()
# Define a simple grammar for the text, multiply the first col by 2
digits = Word(nums)
digits.setParseAction(lambda x:int(x[0]) * 2)
blocks = Group(Word(alphas) + digits)
grammar = OneOrMore(blocks)
# Parse the results
result = grammar.parseString( TEXT )
# This gives a list of lists
# [['cat', 6], ['dog', 10], ['foo', 14]]
# Open up a new file for the output
filename2 = "data2.txt"
FOUT = open(filename2,'w')
# Walk through the results and write to the file
for item in result:
print item
FOUT.write("%s %i\n" % (item[0],item[1]))
FOUT.close()
This gives in data2.txt:
cat 6
dog 10
foo 14
Break each piece down until you understand it. From here, you can slowly adapt this minimal example to your more complex problem above. It's OK to read the file in (as long as it is relatively small) since Paul himself notes:
parseFile is really just a simple shortcut around parseString, pretty
much the equivalent of expr.parseString(open(filename).read()).