How to fix Arabic unicode in a list - python

I made a database containing Arabic words and when I fetch the data and print it it's OK and works well and prints:
مشاعر‬
مودة
But when I loop into that database and turn it out to a list then print that list to see whats happening, I get this:
['\u202b\u202bمشاعر\u202c', '\u202b\u202bالمودة\u202c']
Here is the code:
cors.execute("SELECT * FROM DictContents") # Selecting from database
self.AraList = [] # empty list to put arabic words in
for raw in cors.fetchall(): # fetching data from database
rawAra = raw[1] # the database includes more than that so this index refer to arabic table
print(rawAra) # here is the first print . works fine as i said .
self.AraList.append(rawAra)
print(self.AraList) # here is the other list printing
I tried more than one way to fix it before I ask but none of them worked for me.

Found ...
import re
cors.execute("SELECT * FROM DictContents")
self.AraList = []
for raw in cors.fetchall():
rawAra = raw[1]
cleanit = re.compile('\w+.*')
cleanone = cleanit .search(rawAra)
if cleanone:
print(cleanone.group()) # prints the clean strings : مشاعر‬ مودة
self.AraList.append(cleanone.group()) # adding strings to list to see how it will looks like .
print(self.AraList) # prints much better clean list than firs one
['مشاعر\u202c - ', 'المودة\u202c']

Related

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

Can you spot the problem with this REGEX statement?

Im running .txt files through a for loop which should slice out keywords and .append them into lists. For some reason my REGEX statements are returning really odd results.
My first statement which iterates through the full filenames and slices out the keyword works well.
# Creates a workflow list of file names within target directory for further iteration
stack = os.listdir(
"/Users/me/Documents/software_development/my_python_code/random/countries"
)
# declares list, to be filled, and their associated regular expression, to be used,
# in the primary loop
names = []
name_pattern = r"-\s(.*)\.txt"
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# extraction of country name from file name into `names` list
name_match = re.search(name_pattern, entry)
name = name_match.group(1)
names.append(name)
This works fine and creates the list that I expect
However, once I move on to a similar process with the actual contents of files, it no longer works.
religions = []
reli_pattern = r"religion\s=\s(.+)."
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# opens and reads file within `contents` variable
file_path = (
"/Users/me/Documents/software_development/my_python_code/random/countries" + "/" + entry
)
selection = open(file_path, "rb")
contents = str(selection.read())
# extraction of religion type and placement into `religions` list
reli_match = re.search(reli_pattern, contents)
religion = reli_match.group(1)
religions.append(religion)
The results should be something like: "therevada", "catholic", "sunni" etc.
Instead i'm getting seemingly random pieces of text from the document which have nothing to do with my REGEX like ruler names and stat values that do not contain the word "religion"
To try and figure this out I isolated some of the code in the following way:
contents = "religion = catholic"
reli_pattern = r"religion\s=\s(.*)\s"
reli_match = re.search(reli_pattern, contents)
print(reli_match)
And None is printed to the console so I am assuming the problem is with my REGEX. What silly mistake am I making which is causing this?
Your regular expression (religion\s=\s(.*)\s) requires that there be a trailing whitespace (the last \s there). Since your string doesn't have one, it doesn't find anything when searching thus re.search returns None.
You should either:
Change your regex to be r"religion\s=\s(.*)" or
Change the string you're searching to have a trailing whitespace (i.e 'religion = catholic' to 'religion = catholic ')

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

Extracting a string from html tags in python

Hopefully there isn't a duplicated question that I've looked over because I've been scouring this forum for someone who has posted to a similar to the one below...
Basically, I've created a python script that will scrape the callsigns of each ship from the url shown below and append them into a list. In short it works, however whenever I iterate through the list and display each element there seems to be a '[' and ']' between each of the callsigns. I've shown the output of my script below:
Output
*********************** Contents of 'listOfCallSigns' List ***********************
0 ['311062900']
1 ['235056239']
2 ['305500000']
3 ['311063300']
4 ['236111791']
5 ['245639000']
6 ['235077805']
7 ['235011590']
As you can see, it shows the square brackets for each callsign. I have a feeling that this might be down to an encoding problem within the BeautifulSoup library.
Ideally, I want the output to be without any of the square brackets and just the callsign as a string.
*********************** Contents of 'listOfCallSigns' List ***********************
0 311062900
1 235056239
2 305500000
3 311063300
4 236111791
5 245639000
6 235077805
7 235011590
This script I'm using currently is shown below:
My script
# Importing the modules needed to run the script
from bs4 import BeautifulSoup
import urllib2
import re
import requests
import pprint
# Declaring the url for the port of hull
url = "http://www.fleetmon.com/en/ports/Port_of_Hull_5898"
# Opening and reading the contents of the URL using the module 'urlib2'
# Scanning the entire webpage, finding a <table> tag with the id 'vessels_in_port_table' and finding all <tr> tags
portOfHull = urllib2.urlopen(url).read()
soup = BeautifulSoup(portOfHull)
table = soup.find("table", {'id': 'vessels_in_port_table'}).find_all("tr")
# Declaring a list to hold the call signs of each ship in the table
listOfCallSigns = []
# For each row in the table, using a regular expression to extract the first 9 numbers from each ship call-sign
# Adding each extracted call-sign to the 'listOfCallSigns' list
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4])))
print "\n\n*********************** Contents of 'listOfCallSigns' List ***********************\n"
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row
Does anyone know how to remove the square brackets surrounding each callsign and just display the string?
Thanks in advance! :)
Change the last lines to:
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row[0] # <-- added a [0] here
Alternatively, you can also add the [0] here:
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))[0]) <-- added a [0] here
The explanation here is that re.findall(...) returns a list (in your case, with a single element in it). So, listOfCallSigns ends up being a "list of sublists each containing a single string":
>>> listOfCallSigns
>>> [ ['311062900'], ['235056239'], ['311063300'], ['236111791'],
['245639000'], ['305500000'], ['235077805'], ['235011590'] ]
When you enumerate your listOfCallSigns, the row variable is basically the re.findall(...) that you appended earlier in the code (that's why you can add the [0] after either of them).
So row and re.findall(...) are both of type "list of string(s)" and look like this:
>>> row
>>> ['311062900']
And to get the string inside the list, you need access its first element, i.e.:
>>> row[0]
>>> '311062900'
Hope this helps!
This can also be done by stripping the unwanted characters from the string like so:
a = "string with bad characters []'] in here"
a = a.translate(None, "[]'")
print a

Categories