I use the following code to get data,as the data in text has two different structure, I need to make some judgment. the following codes can works, but I think it's really not a good one.
I'm a beginner in RE, I searched some articles, but I haven't found a way to refine it.
how to refine the following code?
import re
import html
import json
filepath="D:/Response.txt"
data=open(filepath,'r', encoding='utf-16').read()
rex1 = "msgList = '({.*?})'"
rex2='"general_msg_list":"({.*?})"'
def get_art(data,rex):
pattern = re.compile(pattern=rex, flags=re.S)
match = pattern.search(data)
if match:
data = match.group(1).replace('\\','')
# there is some difference for data.
if rex=="msgList = '({.*?})'":
data = html.unescape(data)
data = json.loads(data)
articles = data.get("list")
for item in articles:
print('\nthe result is:\n',item)
with open(filepath,'r', encoding='utf-16') as fp:
line = fp.readline()
while line:
try:
get_art(line.strip(),rex1)
except:
pass
try:
get_art(line.strip(),rex2)
except:
pass
line = fp.readline()
I need to catch the data in (msgList =....) or (general_msg_list":"...). and convert the string to json. for the data in (msgList =....), I found I need to use "data = html.unescape(data)", while if I use "data = html.unescape(data)" in (general_msg_list":"...), there would be error.
currently, I use
try:
get_art(line.strip(),rex1)
except:
pass
try:
get_art(line.strip(),rex2)
except:
pass
I think there should be a better way to replace it.
maybe a better way is I read the whole file, not line by line. the problem for me is I have difficulty to deal with the while file data, that's why I read it line by line.
Related
I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
I'm making an api call that pulls the desired endpoints from ...url/articles.json and transforms it into a csv file. My problem here is that the ['labels_name'] endpoint is a string with multiple values.(an article might have multiple labels)
How can I pull multiple values of a string without getting this error . "File "articles_labels.py", line 40, in <module>
decode_3 = unicodedata.normalize('NFKD', article_label)
TypeError: normalize() argument 2 must be str, not list"?
import requests
import csv
import unicodedata
import getpass
url = 'https://......./articles.json'
user = ' '
pwd = ' '
csvfile = 'articles_labels.csv'
output_1 = []
output_1.append("id")
output_2 = []
output_2.append("title")
output_3 = []
output_3.append("label_names")
output_4 = []
output_4.append("link")
while url:
response = requests.get(url, auth=(user, pwd))
data = response.json()
for article in data['articles']:
article_id = article['id']
decode_1 = int(article_id)
output_1.append(decode_1)
for article in data['articles']:
title = article['title']
decode_2 = unicodedata.normalize('NFKD', title)
output_2.append(decode_2)
for article in data['articles']:
article_label = article['label_names']
decode_3 = unicodedata.normalize('NFKD', article_label)
output_3.append(decode_3)
for article in data['articles']:
article_url = article['html_url']
decode_3 = unicodedata.normalize('NFKD', article_url)
output_3.append(decode_3)
print(data['next_page'])
url = data['next_page']
print("Number of articles:")
print(len(output_1))
with open(csvfile, 'w') as fp:
writer = csv.writer(fp,dialect = 'excel')
writer.writerows([output_1])
writer.writerows([output_2])
writer.writerows([output_3])
writer.writerows([output_4])
My problem here is that the ['labels_name'] endpoint is a string with multiple values.(an article might have multiple labels) How can I pull multiple values of a string
It's a list not a string, so you don't have "a string with multiple values" you have a list of multiple strings, already, as-is.
The question is what you want to do with them, CSV certainly isn't going to handle that, so you must decide on a way to serialise a list of strings to a single string e.g. by joining them together (with some separator like space or comma) or by just picking the first one (beware to handle the case where there is none), … either way the issue is not really technical.
unicodedata.normalize takes a unicode string, and not a list as the error says. The correct way to use unicodedata.normalize will be (example taken from How does unicodedata.normalize(form, unistr) work?
from unicodedata import normalize
print(normalize('NFD', u'\u00C7'))
print(normalize('NFC', u'C\u0327'))
#Ç
#Ç
Hence you need to make sure that unicodedata.normalize('NFKD', title) has title as a unicode string
I am trying to put data from a text file into an array. below is the array i am trying to create.
[("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)]
But instead when i use the text file and load the data from it I get below as my output. it should output as above, i realise i have to split it but i dont really know how for this sort of set array. could anyone help me with this
['("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)']
below is my text file I am trying to load.
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
And this is how im loading it
file = open("slide.txt", "r")
scale = [file.readline()]
If you mean a list instead of an array:
with open(filename) as f:
list_name = f.readlines()
Some questions come to mind about what the rest of your implementation looks like and how you figure it all will work, but below is an example of how this could be done in a pretty straight forward way:
class W(object):
pass
class S(object):
pass
class WS(W, S):
pass
class R(object):
pass
def main():
# separate parts that should become tuples eventually
text = str()
with open("data", "r") as fh:
text = fh.read()
parts = text.split("),")
# remove unwanted characters and whitespace
cleaned = list()
for part in parts:
part = part.replace('(', '')
part = part.replace(')', '')
cleaned.append(part.strip())
# convert text parts into tuples with actual data types
list_of_tuples = list()
for part in cleaned:
t = construct_tuple(part)
list_of_tuples.append(t)
# now use the data for something
print list_of_tuples
def construct_tuple(data):
t = tuple()
content = data.split(',')
for item in content:
t = t + (get_type(item),)
return t
# there needs to be some way to decide what type/object should be used:
def get_type(id):
type_mapping = {
'"harmonic minor"': 'harmonic minor',
'"major"': 'major',
'"relative minor"': 'relative minor',
's': S(),
'w': W(),
'w+s': WS(),
'r': R()
}
return type_mapping.get(id)
if __name__ == "__main__":
main()
This code makes some assumptions:
there is a file data with the content:
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
you want a list of tuples which contains the values.
It's acceptable to have w+s represented by some data type, as it would be difficult to have something like w+s appear inside a tuple without it being evaluated when the tuple is created. Another way to do it would be to have w and s represented by data types that can be used with +.
So even if this works, it might be a good idea to think about the format of the text file (if you have control of that), and see if it can be changed into something which would allow you to use some parsing library in a simple way, e.g. see how it could be more easily represented as csv or even turn it into json.
I want to read a txt file and store it as a list of string. This is a way that I come up with myself. It looks really clumsy. Is there any better way to do this? Thanks.
import re
import urllib2
import re
import numpy as np
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
f=open('graph1.txt','w')
f.write(txt)
f.close()
f=open('graph1.txt','r')
nodes=f.readlines()
I tried the solutions provided below, but they all actually return something different from my previous code.
This is string produced by split()
'node0, node1 0.04, node8 11.11, node14 72.21'
This is what my code produce
'node0, node1 0.04, node8 11.11, node14 72.21\n'
The problem is without the'\n' when I try process the string list it will confront some index error.
" row = index[0] IndexError: list index out of range "
for node in nodes:
index = re.findall('(?<=node)\w+',node)
index = map(int,index)
row = index[0]
del index[0]
According to the documentation, response is already a file-like object: you should be able to do response.readlines().
For those problems where you do need to create an intermediate file like this, though, you want to use io.StringIO
Look at split. So:
nodes = response.read().split("\n")
EDIT: Alternatively if you want to avoid \r\n newlines, use splitlines.
nodes = response.read().splitlines()
Try:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
with open('graph1.txt','w') as f:
f.write(txt)
nodes=txt.split("\n")
If you don't want the file, this should work:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
nodes=txt.split("\n")
I'm new to python and trying to create a script to modify the output of a JS file to match what is required to send data to an API. The JS file is being read via urllib2.
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
# JS Data
# m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
# m[mi++]="19.12.12 09:25:00|1911;2060;3277;293;59"
# Required format for API
# addbatchstatus.jsp?data=20121219,09:25,3277.0,1911,-1,-1,59.0,293.0;20121219,09:30,3440.0,1964,-1,-1,60.0,293.0
As a breakdown (Required values are bold)
m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
and need to add values of -1,-1 into the string
I've managed to get the date into the correct format and replace characters and line breaks to make the output look as such, but I have a feeling I'm heading down the wrong track if I need to be able to reorder this string values. Although it looks like the order is in reverse in regards to time as well.
20121219,09:30:00,1964,2121,3440,293,60;20121219,09:25:00,1911,2060,3277,293,59
Any help would be greatly appreciated! I'm thinking along the lines of regex might be what I need.
Here's a Regex pattern to strip out the bits you don't want
m\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+
and replace with
20\P<year>\P<month>\P<day>,\P<time>,\P<v3>,\P<v1>,-1,-1,\P<v5>,\P<v4>
This pattern assumes that the characters before the date are constant. You can replace m\[mi\+\+\]=" with [^\d]+ if you want more general handling of that bit.
So to put this in practice in python:
import re
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def repl(match):
return '20%s%s%s,%s,%s,%s,-1,-1,%s,%s'%(match.group('year'),
match.group('month'),
match.group('day'),
match.group('time'),
match.group('v3'),
match.group('v1'),
match.group('v5'),
match.group('v4'))
pattern = re.compile(r'm\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+')
data = [re.sub(pattern, repl, line).split(',') for line in getPage().split('\n')]
# If you want to sort your data
data = sorted(data, key=lambda x:x[0], reverse=True)
# If you want to write your data back to a formatted string
new_string = ';'.join(','.join(x) for x in data)
# If you want to write it back to file
with open('new/file.txt', 'w') as f:
f.write(new_string)
Hope that helps!