Permutation List with Variable Dependencies- UnboundLocalError - python

I was trying to break down the code to the simplest form before adding more variables and such. I'm stuck.
I wanted it so when I use intertools the first response is the permutations of tricks and the second response is dependent on the trick's landings() and is a permutation of the trick's corresponding landing. I want to add additional variables that further branch off from landings() and so on.
The simplest form should print a list that looks like:
Backflip Complete
Backflip Hyper
180 Round Complete
180 Round Mega
Gumbi Complete
My Code:
from re import I
import pandas as pd
import numpy as np
import itertools
from io import StringIO
backflip = "Backflip"
one80round = "180 Round"
gumbi = "Gumbi"
tricks = [backflip,one80round,gumbi]
complete = "Complete"
hyper = "Hyper"
mega = "Mega"
backflip_landing = [complete,hyper]
one80round_landing = [complete,mega]
gumbi_landing = [complete]
def landings(tricks):
if tricks == backflip:
landing = backflip_landing
elif tricks == one80round:
landing = one80round_landing
elif tricks == gumbi:
landing = gumbi_landing
return landing
for trik, land in itertools.product(tricks,landings(tricks)):
trick_and_landing = (trik, land)
result = (' '.join(trick_and_landing))
tal = StringIO(result)
tl = (pd.DataFrame((tal)))
print(tl)
I get the error:
UnboundLocalError: local variable 'landing' referenced before assignment

Add a landing = "" after def landings(tricks): to get rid of the error.
But the if checks in your function are wrong. You check if tricks, which is a list, is equal to backflip, etc. which are all strings. So thats why none of the ifs are true and landing got no value assigned.
That question was also about permutation in python. Maybe it helps.

Related

Python Appending DataFrame, weird for loop error

I'm working on some NFL statistics web scraping, honestly the activity doesn't matter much. I spent a ton of time debugging because I couldn't believe what it was doing, either I'm going crazy or there is some sort of bug in a package or python itself. Here's the code I'm working with:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np
#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
print(letter)
players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
soup = bs(players_html.content,"html.parser")
for player in soup.find("div",{"id":"div_players"}).find_all("p"):
temp_row = {}
temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
temp_row["name"] = player.text.split("(")[0].strip()
years = player.text.split(")")[1].strip()
temp_row["startYear"] = int(years.split("-")[0])
temp_row["endYear"] = int(years.split("-")[1])
temp_row["positions"] = player.text.split("(")[1].split(")")[0]
players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)
game_df = pd.DataFrame()
def apply_test(row):
#print(row)
url = row['url']
#print(list(range(int(row['startYear']),int(row['endYear'])+1)))
for yr in range(int(row['startYear']),int(row['endYear'])+1):
print(yr)
content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
#overheader
over_headers = []
for over in soup.find("thead").find("tr").find_all("th"):
if("colspan" in over.attrs.keys()):
for i in range(0,int(over['colspan'])):
over_headers = over_headers + [over.text]
else:
over_headers = over_headers + [over.text]
#headers
headers = []
for header in soup.find("thead").find_all("tr")[1].find_all("th"):
headers = headers + [header.text]
all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
#remove first column, it's meaningless
all_headers = all_headers[1:len(all_headers)]
for row in soup.find("tbody").find_all("tr"):
temp_row = {}
for i,col in enumerate(row.find_all("td")):
temp_row[all_headers[i]] = col.text
game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)
Now again I could get into what I'm trying to do, but there seems to be a much higher-level issue here. startYear and endYear in the for loop are 2013 and 2014, so the loop should be setting the yr variable to 2013 then 2014. But when you look at what prints out due to the print(yr), you realize it's printing out 2013 twice. But if you simply comment out the game_df = game_df.append(temp_row,ignore_index=True) line, the printouts of yr are correct. There is an error shortly after the first two lines, but that is expected and one I am comfortable debugging. But the fact that appending to a global dataframe is causing a for loop to behave differently is blowing my mind right now. Can someone help with this?
Thanks.
I don't really follow what the overall aim is but I do note two things:
You either need the local game_df to be declared as global game_df before game_df = game_df.append(temp_row,ignore_index=True) or better still pass as an arg in the def signature though you would need to amend this: players.apply(apply_test,axis=1) accordingly.
You need to handle the cases of find returning None e.g. with soup.find("thead").find_all("tr")[1].find_all("th") for page https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014. Perhaps put in try except blocks with appropriate default values to be supplied.

How to compare the string stored in two list in python?

Issue is with the loop
I can't iterate and check the value from solu with dgu list.
It prints above output upto print(solu)
The loop used later lags and stops there with no output and I'm clueless here.
Could Someone explain how to compare strings if they exist in two different files from different sources?
from pandas import *
import pandas as pd
import csv
import re
import deepdiff
from pprint import pprint
import xlrd
from difflib import SequenceMatcher
import xlsxwriter
import tocamelcase
from spellchecker import SpellChecker
import numpy as np
xlsx = ExcelFile('WrongSpelling.xlsx')
df = xlsx.parse(xlsx.sheet_names[0])
dg = pd.read_csv("pfm.csv", usecols = ['Place Id','Name','Category'])
pla = dg['Place Id'].values.tolist()
nam = dg['Name'].values.tolist()
cat = dg['Category'].values.tolist()
print()
df2 = pd.DataFrame(df, columns = ['Spelling'])
bat= df2['Spelling'].values.tolist()
namo = [x.lower() for x in nam]
bato = [x.lower() for x in bat]
sol = set(namo) & set(bato)
solu = list(sol)
dgu= dg.values.tolist()
nam=list(nam)
print(solu)
print()
print("The Count of Matches with the incorrect data is" ,len(solu))
print(dg[:5])
print()
while i < len(dgu):
while i < len(solu):
# a = solu[i]
# b = dgu[i]
# c = nam[i]
if solu[i] in dgu[i]:
print(dgu[i])
else:
pass
i+=1
Your inner while loop is using the variable i as the conditional to when it passes the length of solu, but you enver increment within that while loop, so it will loop forever checking for i < len(solu) which will never evaluate to False if it enters the loop the first time.
As #offeltoffel mentioned, for loop seems to fit your need better here. Without being able to compile your code without a verifiable example, here is what the for loop could look like:
for i in range(len(dgu):
for j in range(len(solu)):
if solu[j] in dgu[i]:
print(dgu[i])
# don't need elsepass here, as it serves no purpose
# don't need to increment i/j in a for loop manually as it iterates through the range created from the length of dgu/solu

How to create a lettered list using docx?

I use python for teaching some of my science courses, where I use it to generate unique assignments and tests for students. I've run into an issue that I can't sort out on my own.
I'm trying to make a series of nested lists. For example, I would like to have a numbered question, and then sub parts to the question underneath. For example:
Use the Henderson-Hasselbalch equation to determine pH of the following solutions:
A. 250 mM Ammonium Chloride
B. 100 mM Acetic Acid
I've used style "List Number" to create the numbered list, but I can't figure out how to create a custom list that starts with the letters.
Here is what I've got so far:
import sys
import os
if os.uname()[1] == 'iMac':
sys.path.append("/Users/mgreene3/Library/Python/2.7/lib/python/site-packages")
else:
sys.path.append("/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python")
import numpy as np
import math
import random
import textwrap
from docx import Document
from docx.shared import Pt, Inches
from docx.enum.style import WD_STYLE_TYPE
from docx.text.tabstops import TabStop as ts
from docx.text.parfmt import ParagraphFormat
assignment = Document()
ordered = "a"
style = assignment.styles["Normal"]
font = style.font
font.name = "Calibri"
font.size = Pt(12)
style.paragraph_format.space_after = Pt(0)
LetteredList = style.paragraph_format._NumberingStyle(ordered)
sub_style = assignment.styles["ListBullet"]
sub_font = sub_style.font
sub_font.name = "Calibri"
###sub_style.paragraph_format.style("List")
sub_font.size = Pt(12)
sub_style.paragraph_format.left_indent = Inches(1)
sub_style.paragraph_format.space_before = Pt(0)
sub_style.paragraph_format.space_after = Pt(40)
doc_heading = assignment.add_paragraph("Name:_______________________")
doc_heading.add_run("\t" * 4)
doc_heading.add_run(" " * 12)
doc_heading.add_run("BIOL444: Biochemistry\t\t\t\t\t\t ")
doc_heading.add_run("\n")
doc_heading.add_run("Take Home 1, v.")
doc_heading.add_run((str(1).zfill(2)))
doc_heading.add_run("\n" * 2)
doc_heading.add_run("Instructions: Complete test (")
show_work = doc_heading.add_run("show work")
show_work.bold = True
show_work.underline = True
show_work
doc_heading.add_run("), submit ")
hard_copy = doc_heading.add_run("hard copy")
hard_copy.bold = True
hard_copy.underline = True
hard_copy
doc_heading.add_run(" by ")
doc_heading.add_run("11:59 pm, Friday, February 10").bold =True
doc_heading.add_run(". Late submissions will ")
doc_heading.add_run("NOT").bold=True
doc_heading.add_run(" be accepted.")
question1 = assignment.add_paragraph("Using the data for K", style = "List Number")
question1.add_run("a").font.subscript = True
question1.add_run(" and pK")
question1.add_run("a").font.subscript = True
question1.add_run(" of the following compounds, calculate the concentrations (M) of all ionic species as well as the pH of the following aqueous solutions: ")
question1.add_run("\n")
question1a = assignment.add_paragraph("100 mM Acetic acid", style = sub_style)
question1b = assignment.add_paragraph("250 mM NaOH", style = sub_style)
assignment.save("TestDocx.docx")
The short answer is that it's probably more trouble than it's worth. Creating numbered lists, especially nested numbered lists in Word is a complex operation, possibly for legacy reasons (we're on version 14 or something of Word). Partly because of this complexity, API support for this doesn't yet exist in python-docx.
If you really wanted to do it, it would entail manipulating numbering definitions that exist in another package part from the document part (I believe it's numbering.xml). This would be using low-level lxml calls.
For myself, I'd be strongly inclined to use RestructuredText for a job like this, rendering to PDF, perhaps using Sphinx. As a side-effect, you could easily get HTML version as well for posting assignments on the web. However, I'm too far away from your actual requirements to say that would really suit; you'll have to check it out and see for yourself :)

How To Make A Web Crawler More Efficient?

Here is a code:
str_regex = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'
import urllib.request
from Stacks import Stack
import re
import functools
import operator as op
from nary_tree import *
url = 'http://www.activeingredients.com/'
s = set()
List = []
url_list = []
def f_go(List, s, url):
try:
if url in s:
return
s.add(url)
with urllib.request.urlopen(url) as response:
html = response.read()
#print(url)
h = html.decode("utf-8")
lst0 = prepare_expression(list(h))
ntr = buildNaryParseTree(lst0)
lst2 = nary_tree_tolist(ntr)
lst3= functools.reduce(op.add, lst2, [])
str2 = ''.join(lst3)
List.append(str2)
f1 = re.finditer(str_regex, h)
l1 = []
for tok in f1:
ind1 = tok.span()
l1.append(h[ind1[0]:ind1[1]])
for exp in l1:
length = len(l1)
if (exp[-1] == 'g' and exp[length - 2] == 'p' and exp[length - 3] == 'j') or \
(exp[-1] == 'p' and exp[length - 2] == 'n' and exp[length - 3] == 'g'):
pass
else:
f_go(List, s, exp, iter_cnt + 1, url_list)
except:
return
It basically, using, urlllib.request.urlopen, opens urls recursively in a loop; does tis in certain domain (in that case activeingredients.com); link extraction form a page is done by regexpression. Inside, having open page it parse it and add to a list as a string. So, what this is suppose to do is go through given domain, extract information (meaningful text in that case), add to a list. Try except block, just returns in the case of all the http errors (and all the rest errors too, but this is tested and working).
It works, for example, for this small page, but for bigger is extremely slow and eat memory.
Parsing, preparing page, more or less do the right job, I believe.
Question is, is there an efficient way to do this? How web searches crawl through network so fast?
First: I don't think Google's webcrawler is running on one laptop or one pc. So don't worry if you can't get results like big companies do.
Points to consider:
You could start with a big list of words you can download from many websites. That sorts out some useless combinations of url's. After that you could crawl just with letters to get useless-named-sites on your index as well.
You could start with a list of all registered domains on dns servers. I.E. something like this: http://www.registered-domains-list.com
Use multiple threads
Have much bandwidth
Consider buying Google's Data-Center
These points are just ideas to give you a basic idea of how you could improve your crawler.

Print only not null values

I am trying to print only not null values but I am not sure why even the null values are coming up in the output:
Input:
from lxml import html
import requests
import linecache
i=1
read_url = linecache.getline('stocks_url',1)
while read_url != '':
page = requests.get(read_url)
tree = html.fromstring(page.text)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage != None:
print percentage
i = i + 1
read_url = linecache.getline('stocks_url',i)
Output:
$ python test_null.py
['76%']
['76%']
['80%']
['92%']
['77%']
['71%']
[]
['50%']
[]
['100%']
['67%']
You are getting empty lists, not None objects. You are testing for the wrong thing here; you see [], while if a Python null was being returned you'd see None instead. The Element.xpath() method will always return a list object, and it can be empty.
Use a boolean test:
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Empty lists (and None) test as false in a boolean context. I opted to print out the first element from the XPath result, you appear to only ever have one.
Note that linecache is primarily aimed at caching Python source files; it is used to present tracebacks when an error occurs, and when you use inspect.getsource(). It isn't really meant to be used to read a file. You can just use open() and loop over the file without ever having to keep incrementing a counter:
with open('stocks_url') as urlfile:
for url in urlfile:
page = requests.get(read_url)
tree = html.fromstring(page.content)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Change this in your code and it should work:
if percentage != []:

Categories