About file I/O in python - python

I want to read a txt file and store it as a list of string. This is a way that I come up with myself. It looks really clumsy. Is there any better way to do this? Thanks.
import re
import urllib2
import re
import numpy as np
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
f=open('graph1.txt','w')
f.write(txt)
f.close()
f=open('graph1.txt','r')
nodes=f.readlines()
I tried the solutions provided below, but they all actually return something different from my previous code.
This is string produced by split()
'node0, node1 0.04, node8 11.11, node14 72.21'
This is what my code produce
'node0, node1 0.04, node8 11.11, node14 72.21\n'
The problem is without the'\n' when I try process the string list it will confront some index error.
" row = index[0] IndexError: list index out of range "
for node in nodes:
index = re.findall('(?<=node)\w+',node)
index = map(int,index)
row = index[0]
del index[0]

According to the documentation, response is already a file-like object: you should be able to do response.readlines().
For those problems where you do need to create an intermediate file like this, though, you want to use io.StringIO

Look at split. So:
nodes = response.read().split("\n")
EDIT: Alternatively if you want to avoid \r\n newlines, use splitlines.
nodes = response.read().splitlines()

Try:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
with open('graph1.txt','w') as f:
f.write(txt)
nodes=txt.split("\n")
If you don't want the file, this should work:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
nodes=txt.split("\n")

Related

How do I import an external text file from a website and use a list from inside it

I'm having trouble trying to make this work
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data:
print(line)
I am trying to pull a txt file from the internet, and be able to use the list inside of the text file.
Right now all it does is assume each letter is a different string(?)
response.text seems to be characters, if you loop over them you get each string. (Read about how Python handles strings).
In this case Python doesn't know what a "line" is. So split the data with newlines and try again:
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data.split("\n"):
print(line)
The attribute response.text is a string, so iterating over it will give you individual chars. You can split the string by spaces (or maybe be newlines) to get what you need (I also added a few print statements to show the steps):
import requests
response = requests.get(
"https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
print('response.text type:', type(response.text))
print('response.text len:', len(response.text))
print(response.text)
print()
print('splitting by spaces:')
for i, s in enumerate(response.text.split()):
print(i, s)
print()
print('splitting by newlines:')
for i, line in enumerate(response.text.split('\n')):
print(i, line)
The code gives this output:
response.text type: <class 'str'>
response.text len: 21
a = ["please","work"]
splitting by spaces:
0 a
1 =
2 ["please","work"]
splitting by newlines:
0 a = ["please","work"]
#bruno suggested in a comment to use str.splitlines(); this will work even if the response is bytes, since there also exists the method bytes.splitlines().

how to refine the regex for conditions

I use the following code to get data,as the data in text has two different structure, I need to make some judgment. the following codes can works, but I think it's really not a good one.
I'm a beginner in RE, I searched some articles, but I haven't found a way to refine it.
how to refine the following code?
import re
import html
import json
filepath="D:/Response.txt"
data=open(filepath,'r', encoding='utf-16').read()
rex1 = "msgList = '({.*?})'"
rex2='"general_msg_list":"({.*?})"'
def get_art(data,rex):
pattern = re.compile(pattern=rex, flags=re.S)
match = pattern.search(data)
if match:
data = match.group(1).replace('\\','')
# there is some difference for data.
if rex=="msgList = '({.*?})'":
data = html.unescape(data)
data = json.loads(data)
articles = data.get("list")
for item in articles:
print('\nthe result is:\n',item)
with open(filepath,'r', encoding='utf-16') as fp:
line = fp.readline()
while line:
try:
get_art(line.strip(),rex1)
except:
pass
try:
get_art(line.strip(),rex2)
except:
pass
line = fp.readline()
I need to catch the data in (msgList =....) or (general_msg_list":"...). and convert the string to json. for the data in (msgList =....), I found I need to use "data = html.unescape(data)", while if I use "data = html.unescape(data)" in (general_msg_list":"...), there would be error.
currently, I use
try:
get_art(line.strip(),rex1)
except:
pass
try:
get_art(line.strip(),rex2)
except:
pass
I think there should be a better way to replace it.
maybe a better way is I read the whole file, not line by line. the problem for me is I have difficulty to deal with the while file data, that's why I read it line by line.

Simple forvalues loop in python?

is there a simple way in Python to loop over a simple list of numbers?
I want to scrape some data from different URLs that only differ in 3 numbers?
I'm quite new to python and couldn't figure out an easy way to do it.
Thanks a lot!
Here's my code:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.example.com/3322")
bsObj = BeautifulSoup(html)
table = bsObj.findAll("table",{"class":"MainContent"})[0]
rows=table.findAll("td")
csvFile = open("/Users/Max/Desktop/file1.csv", 'wt')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow=[]
for cell in row.findAll(['tr', 'td']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
In Stata this would be like:
foreach i of 13 34 55 67{
html = urlopen("http://www.example.com/`i'")
....
}
Thanks a lot!
Max
I've broken your original code into functions simply to make clearer what I think is the answer to your question: use a simple loop, and .format() to construct urls and filenames.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
def scrape_url(url):
html = urlopen(url)
bsObj = BeautifulSoup(html)
table = bsObj.findAll("table",{"class":"MainContent"})[0]
rows=table.findAll("td")
return rows
def write_csv_data(path, rows):
csvFile = open(path, 'wt')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow=[]
for cell in row.findAll(['tr', 'td']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
for i in (13, 34, 55, 67):
url = "http://www.example.com:3322/{}".format(i)
csv_path = "/Users/MaximilianMandl/Desktop/file-{}.csv".format(i)
rows = scrape_url(url)
write_csv_data(csv_path, rows)
i would use set.intersection() for that:
mylist=[1,16,8,32,7,5]
fieldmatch=[5,7,16]
intersection = list(set(mylist).intersection(fieldmatch))
I'm not familiar with stata, but. It looks like the python equivalent might be simply:
import request
for i in [13 34 55 67]:
response = request("http://www.example.com/{}".format(i))
....
The simplest way to do this it to apply the filter inside the loop:
mylist=[1,16,8,32,7,5]
for myitem in mylist:
if myitem in (5,7,16):
print myitem # or print(myitem)
This may not, however, be the most elegant way to do it. If you wanted to store a new list of the matching results, you can use a list comprehension:
mylist=[1,16,8,32,7,5]
fieldmatch=[5,7,16]
filteredlist=[ x for x in mylist if x in fieldmatch ]
You can then take filteredlist which contains only the items in mylist that match fieldmatch (in other words your original list filtered by your criteria) and iterate over it like any other list:
for myitem in filteredlist:
# Perform whatever process you want to each item here
do_something_with(myitem)
Hope this helps.

Python: How split data into different data types into 2D array

I’m trying to split downloaded data to an 2D array into different datatypes. The downloaded data looks like this:
000|17:40
000|17:45
010|17:50
025|17:55
056|18:00
178|18:05
202|18:10
203|18:15
190|18:20
072|18:25
013|18:30
002|18:35
000|18:40
000|18:45
000|18:50
000|18:55
000|19:00
000|19:05
000|19:10
000|19:15
000|19:20
000|19:25
000|19:30
000|19:35
000|19:40
I’m using the following code to parse this into a two dimensional array:
#!/usr/bin/python
import urllib2
response = urllib2.urlopen('http://gps.buienradar.nl/getrr.php?lat=52&lon=4')
html = response.read()
htmlsplit = []
for record in html.split("\r\n"):
htmlsplit.append(record.split("|"))
print htmlsplit
This is working great, but as expected, it treats it as a string. I’ve found some examples that splits into integers. That’s great if both sides where integers. But in my case it’s an integer | string (or maybe some kind of Python time format)
How can I split this directly into different data types?
Something like this?
for record in html.split("\r\n"): # beware, newlines are treacherous!
s = record.split("|")
htmlsplit.append((int(s[0]), s[1]))
Just write a parser for each record, if you have data this simple. However, I would add some try/except clause to catch errors for non-conforming lines, empty lines, etc. which may be present in the data. The code above is very fragile. Also, you might want to break at only \n and then clean your strings by strip() (i.e. replace s[1] by s[1].strip()). The integer conversion takes care of it automatically.
Use str.splitlines instead of splitting on \r\n
Use the csv module to iterate over the lines:
import csv
txt = '000|17:40\n000|17:45\n000|17:50\n000|17:55\n000|18:00\n000|18:05\n000|18:10\n000|18:15\n000|18:20\n000|18:25\n000|18:30\n000|18:35\n000|18:40\n000|18:45\n000|18:50\n000|18:55\n000|19:00\n000|19:05\n000|19:10\n000|19:15\n000|19:20\n000|19:25\n000|19:30\n000|19:35\n000|19:40\n'
reader = csv.reader(txt.splitlines(), delimiter='|')
column1 = []
column2 = []
for c1, c2 in reader:
column1.append(c1)
column2.append(c2)
You can also use the DictReader
import StringIO
reader2 = csv.DictReader(StringIO.StringIO(txt),
fieldnames=['int', 'time'],
delimiter='|')
column1 = []
column2 = []
for row in reader2:
column1.append(row['time'])
column2.append(row['int'])

Manipulate string data

I'm new to python and trying to create a script to modify the output of a JS file to match what is required to send data to an API. The JS file is being read via urllib2.
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
# JS Data
# m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
# m[mi++]="19.12.12 09:25:00|1911;2060;3277;293;59"
# Required format for API
# addbatchstatus.jsp?data=20121219,09:25,3277.0,1911,-1,-1,59.0,293.0;20121219,09:30,3440.0,1964,-1,-1,60.0,293.0
As a breakdown (Required values are bold)
m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
and need to add values of -1,-1 into the string
I've managed to get the date into the correct format and replace characters and line breaks to make the output look as such, but I have a feeling I'm heading down the wrong track if I need to be able to reorder this string values. Although it looks like the order is in reverse in regards to time as well.
20121219,09:30:00,1964,2121,3440,293,60;20121219,09:25:00,1911,2060,3277,293,59
Any help would be greatly appreciated! I'm thinking along the lines of regex might be what I need.
Here's a Regex pattern to strip out the bits you don't want
m\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+
and replace with
20\P<year>\P<month>\P<day>,\P<time>,\P<v3>,\P<v1>,-1,-1,\P<v5>,\P<v4>
This pattern assumes that the characters before the date are constant. You can replace m\[mi\+\+\]=" with [^\d]+ if you want more general handling of that bit.
So to put this in practice in python:
import re
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def repl(match):
return '20%s%s%s,%s,%s,%s,-1,-1,%s,%s'%(match.group('year'),
match.group('month'),
match.group('day'),
match.group('time'),
match.group('v3'),
match.group('v1'),
match.group('v5'),
match.group('v4'))
pattern = re.compile(r'm\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+')
data = [re.sub(pattern, repl, line).split(',') for line in getPage().split('\n')]
# If you want to sort your data
data = sorted(data, key=lambda x:x[0], reverse=True)
# If you want to write your data back to a formatted string
new_string = ';'.join(','.join(x) for x in data)
# If you want to write it back to file
with open('new/file.txt', 'w') as f:
f.write(new_string)
Hope that helps!

Categories