Python scraping from a txt web-page

Python scraping from a txt web-page - python

I have a large txt file from a website
https://en90.tribalwars.net/map/village.txt
These are the first few lines:
1,Barbarian+village,508,538,10342642,4208,0
2,ckouta+village,507,542,11001011,9761,0
3,Bonus+village,464,449,0,1513,1
4,Revenge+Will+Be+Sweet,501,532,9202536,9835,0
5,OFF,515,501,11158923,5644,0
I would now like to extract the the first figure from the line that matches with a given third and fourth column. For example: given I'm looking for x = 464 and y = 449 I would want my script to return 3.
I tried parsing the html page with beautifulsoup and then matching the correct line using regex but I cannot make this work.

You can use brackets and groups() in re module.
The following code will enable you to access to the 1st, 3rd and 4th number.
import re
pattern = r'(.+),.+,(.+),(.+),.+,.+,.+'
string = '3,Bonus+village,464,449,0,1513,1'
foo = re.match(pattern, string).groups()
print(foo)
which leaves you only to compare the 2nd of foo to'464', 3rd of foo to '449'.
I saw one of the comments recommending using csv and I believe that is a very rational idea. The equivalent to using csv can be done by using string.split(',')

On that particular case, I would not use regex. This data looks like CSV data (comma separated values) and is very consistent.
My suggestion:
from urllib import urlopen
from collections import namedtuple
text = 'https://en90.tribalwars.net/map/village.txt'
content = urlopen(text).read()
lines = content.split('\n')[0:-1] # last character is an empty string
village = namedtuple('village', ['id', 'name', 'x', 'y', 'z', 'whatever'])
def create_item(line):
item = village(
id=line.split(',')[0],
name=line.split(',')[1],
x=line.split(',')[2],
y=line.split(',')[3],
z=line.split(',')[4],
whatever=line.split(',')[5]
)
return item
lines = [create_item(line) for line in lines]
sample = lines[0]
print sample.id
print sample.name
print sample.x # 512
print sample.y # 529
I added a namedtuple too to make it even cooler. The lines contains all the data, and you should be able to write a function to filter based on x and y values.

Related

How to split a comma-separated line if the chunk contains a comma in Python?

I'm trying to split current line into 3 chunks.
Title column contains comma which is delimiter
1,"Rink, The (1916)",Comedy
Current code is not working
id, title, genres = line.split(',')
Expected result
id = 1
title = 'Rink, The (1916)'
genres = 'Comedy'
Any thoughts how to split it properly?

Ideally, you should use a proper CSV parser and specify that double quote is an escape character. If you must proceed with the current string as the starting point, here is a regex trick which should work:
inp = '1,"Rink, The (1916)",Comedy'
parts = re.findall(r'".*?"|[^,]+', inp)
print(parts) # ['1', '"Rink, The (1916)"', 'Comedy']
The regex pattern works by first trying to find a term "..." in double quotes. That failing, it falls back to finding a CSV term which is defined as a sequence of non comma characters (leading up to the next comma or end of the line).

lets talk about why your code does not work
id, title, genres = line.split(',')
here line.split(',') return 4 values(since you have 3 commas) on the other hand you are expecting 3 values hence you get.
ValueError: too many values to unpack (expected 3)
My advice to you will be to not use commas but use other characters
"1#\"Rink, The (1916)\"#Comedy"
and then
id, title, genres = line.split('#')

Use the csv package from the standard library:
>>> import csv, io
>>> s = """1,"Rink, The (1916)",Comedy"""
>>> # Load the string into a buffer so that csv reader will accept it.
>>> reader = csv.reader(io.StringIO(s))
>>> next(reader)
['1', 'Rink, The (1916)', 'Comedy']

Well you can do it by making it a tuple
line = (1,"Rink, The (1916)",Comedy)
id, title, genres = line

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}

Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).

Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

RegEx for capturing groups using dictionary key

I'm having trouble displaying the right named capture in my dictionary function. My program reads a .txt file and then transforms the text in that file into a dictionary. I already have the right regex formula to capture them.
Here is my File.txt:
file Science/Chemistry/Quantum 444 1
file Marvel/CaptainAmerica 342 0
file DC/JusticeLeague/Superman 300 0
file Math 333 0
file Biology 224 1
Here is the regex link that is able to capture the ones I want:
By looking at the link, the ones I want to display is highlighted in green and orange.
This part of my code works:
rx= re.compile(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+')
i = sub_pattern.match(data) # 'data' is from the .txt file
x = (i.group(1), i.group(3))
print(x)
But since I'm making the .txt into a dictionary I couldn't figure out how to make .group(1) or .group(3) as keys to display specifically for my display function. I don't know how to make those groups display when I use print("Title: %s | Number: %s" % (key[1], key[3])) and it will display those contents. I hope someone can help me implement that in my dictionary function.
Here is my dictionary function:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.findall(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+', line)
dictionary[line] = line_pattern
content = dictionary[line]
print(content)
return dictionary
I'm trying to make my output look like this from my text file:
Science 444
Marvel 342
DC 300
Math 333
Biology 224

You may create and populate a dictionary with your file data using
def create_dict(data):
dictionary = {}
for line in data:
m = re.search(r'file\s+([^/\s]*)\D*(\d+)', line)
if m:
dictionary[m.group(1)] = m.group(2)
return dictionary
Basically, it does the following:
Defines a dictionary dictionary
Reads data line by line
Searches for a file\s+([^/\s]*)\D*(\d+) match, and if there is a match, the two capturing group values are used to form a dictionary key-value pair.
The regex I suggest is
file\s+([^/\s]*)\D*(\d+)
See the Regulex graph explaining it:
Then, you may use it like
res = {}
with open(filepath, 'r') as f:
res = create_dict(f)
print(res)
See the Python demo.

You already used named group in your 'line_pattern', simply put them to your dictionary. re.findall would not work here. Also the character escape '\' before '/' is redundant. Thus your dictionary function would be:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.search(r'file (?P<path>.*?)( |/.*?)? (?P<views>\d+).+', line)
dictionary[line_pattern.group('path')] = line_pattern.group('views')
content = dictionary[line]
print(content)
return dictionary

This RegEx might help you to divide your inputs into four groups where group 2 and group 4 are your target groups that can be simply extracted and spaced with a space:
(file\s)([A-Za-z]+(?=\/|\s))(.*)(\d{3})

Formatting form data?

I need to convert the form data below to a slightly different format to be able to submit correctly.
I have this form data.
PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:
Wanted format:
PaReq=eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz%2FlO6en5PrLZCs6j%0D%0ANWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA%2F6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp%0D%0ACEXF7Pg8X9JRgAIICbhCWz9wMY%2Boj%2FEYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY%0D%0AYMu12NOtlOUUgKZp3N%2BikGUsRbF3WeHWO0CAVphXgMdnkFWtiap%2FY5sldBGFjf1Yuzzv0PL8evrc%0D%0ApDMCtMLqk1hyiqCHoT%2F0HIimCE%2FHmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ%0D%0AFfC2LHKuzqg%2B3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ%2BtakbbhOfr6h9sjC65rpSehE%0D%0Ad4Yy1TXkQb9zlNkWEmD%2Br642A6n71A0vHRBwP9j%2F7TDLBQ%3D%3D%0D%0A&TermUrl=https%3A%2F%2Fwww.footpatrol.co.uk%2Fcheckout%2F3d&MD=
I have tried this but seems to be a different format than what I need to submit correctly.
Code:
import urllib.parse
print(urllib.parse.quote_plus('''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''))
Is this obtainable with python? And what do i need to do to achieve the wanted end result?

if your paraneters are separated by newlines you can use the splitlines method to get a list of parameters, and use re.split on each item to get a list with name, value.
Then apply quote_plus on each name and value, '='.join them and '&'.join all parameters.
import urllib.parse
import re
data = '''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6jNWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNpCEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQYYMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrcpDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQFfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehEd4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''
data = [re.split(':(?!//)', line) for line in data.splitlines()]
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If your data is split by newlines arbitrarily, you could join the lines and split by name. Then zip names and values, quote and join.
data = ''.join(data.splitlines())
data = zip(['PaReq', 'TermUrl', 'MD'], re.split('PaReq:|TermUrl:|MD:', data)[1:])
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If you want to keep the newline cheracter, use only the last two lines in the second code snippet.

Split stock quote to tokens in Python

I need to read a text file of stock quotes and do some processing with each stock data (i.e. a line in the file).
The stock data looks like this :
[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']
How do I split the line into tokens where each token is what is enclosed in the square braces, .i.e. for the above line, the tokens should be "class, 'STOCK'" , "symbol, 'AAII'" etc.

print(re.findall("\[(.*?)\]", inputline))
Or perhaps without regex:
print(inputline[1:-1].split("],["))

Try this code:
#!/usr/bin/env python3
import re
str="[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
str = re.sub('^\[','',str)
str = re.sub('\]$','',str)
array = str.split("],[")
for line in array:
print(line)

Start with:
re.findall("[^,]+,[^,]+", a)
This would give you a list of:
[class,'STOCK'], [symbol,'AAII'] and such, then you could cut the brackets.
If you want a functional one liner, use:
map(lambda x: x[1:-1], re.findall("[^,]+,[^,]+", a))
The first part splits every second ,, the map (for each item in the list, use the lambda function..) cuts the brackets.

import re
s = "[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
m = re.findall(r"([a-zA-Z0-9]+),([a-zA-Z0-9']+)", s)
d= { x[0]:x[1] for x in m }
print d
you can run the snippet here : http://liveworkspace.org/code/EZGav$35

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping from a txt web-page - python

Related

How to split a comma-separated line if the chunk contains a comma in Python?

Extract time values from a list and add to a new list or array

RegEx for capturing groups using dictionary key

Formatting form data?

Split stock quote to tokens in Python

Categories

Resources