Process special JSON with keys as numbers - python

I want to extract data from file into a dictionary via json.loads. Example:
{725: 'pitcher, ewer',
726: "plane, carpenter's plane, woodworking plane"}
json.loads can't handle the keys as numbers
Some values are "" and others are '.
Any suggestions?
Code
import requests
url = url
r = requests.get(url)
response = r.text.replace('\n','')
response = re.sub(r':(\d+):*', r'"\1"', response)

The file you supplied seems like a valid Python dict, so I suggest an alternative approach, with literal_eval.
from ast import literal_eval
data = literal_eval(r.text)
print(data[726])
Output: plane, carpenter's plane, woodworking plane
If you still like json, then you can try replacing the numbers with strings using regex.
import re
s = re.sub(r"(?m)^(\W*)(\d+)\b", r'\1"\2"', r.text)
data = json.loads(s)

Related

Python requests returns incomprehensible content

I'm trying to parse the site. I don't want to use selenium. Requests is coping. BUT! something strange is happening. I can't cut out the text I need with a regular expression (and it's there - you can see it if you do print(data.text)) But re doesn't see him. If this text is copied to notepad++, it outputs this - it sees these characters as a single line.
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
print(data.text)
What is it and how to work with it?pay attention to the line numbers
You can try to use their Ajax API to load all usernames + thumb images:
import pandas as pd
import requests
url = 'https://ru.runetki3.com/tools/listing_v3.php?livetab=female&offset=0&limit={}'
headers = {'X-Requested-With': 'XMLHttpRequest'}
all_data = []
for p in range(1, 4): # <-- increase number of pages here
data = requests.get(url.format(p * 144), headers=headers).json()
for m in data['models']:
all_data.append((m['username'], m['display_name'], m['thumb_image'].replace('{ext}', 'jpg')))
df = pd.DataFrame(all_data, columns=['username', 'display_name', 'thumb'])
print(df.head())
Prints:
username display_name thumb
0 wetlilu Little_Lilu //i.bimbolive.com/live/034/263/131/xbig_lq/c30823.jpg
1 mellannie8 mellannieSEX //i.bimbolive.com/live/034/24f/209/xbig_lq/314348.jpg
2 mokkoann mokkoann //i.bimbolive.com/live/034/270/279/xbig_lq/cb25cb.jpg
3 ogurezzi CynEp-nuCbka //i.bimbolive.com/live/034/269/02c/xbig_lq/3ebe2a.jpg
4 Pepetka22 _-Katya-_ //i.bimbolive.com/live/034/24f/36e/xbig_lq/18da8e.jpg
Avoid using . in a regex unless you really want to get any character; here, the usernames (as far as I can see) only contain - and alphanumeric characters, so you can retrieve them with:
re.findall(r'"username":"([\w|-]+)"',data.text)
An even simpler way, which will remove the need to deal with special characters by getting all characters except " is:
re.findall(r'"username":"([^"]+)"',data.text)
So here's a way of getting the info you seek (I joined them into a dictionary, but you can change that to whatever you prefer):
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
with open ("return.txt",'w', encoding = 'utf-8') as f:
f.write(data.text)
names = re.findall(r'"username":"([^"]+)"',data.text)
disp_names = re.findall(r'"display_name":"([^"]+)"',data.text)
thumbs = re.findall(r'"thumb_image":"([^"]+)"',data.text)
names_dict = {name:[disp, thumb.replace('{ext}', 'jpg')] for name, disp, thumb in zip(names, disp_names, thumbs)}
Example
names_dict['JuliaCute']
# ['_Cute',
# '\\/\\/i.bimbolive.com\\/live\\/055\\/2b0\\/15d\\/xbig_lq\\/d89ef4.jpg']

How do I import an external text file from a website and use a list from inside it

I'm having trouble trying to make this work
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data:
print(line)
I am trying to pull a txt file from the internet, and be able to use the list inside of the text file.
Right now all it does is assume each letter is a different string(?)
response.text seems to be characters, if you loop over them you get each string. (Read about how Python handles strings).
In this case Python doesn't know what a "line" is. So split the data with newlines and try again:
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data.split("\n"):
print(line)
The attribute response.text is a string, so iterating over it will give you individual chars. You can split the string by spaces (or maybe be newlines) to get what you need (I also added a few print statements to show the steps):
import requests
response = requests.get(
"https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
print('response.text type:', type(response.text))
print('response.text len:', len(response.text))
print(response.text)
print()
print('splitting by spaces:')
for i, s in enumerate(response.text.split()):
print(i, s)
print()
print('splitting by newlines:')
for i, line in enumerate(response.text.split('\n')):
print(i, line)
The code gives this output:
response.text type: <class 'str'>
response.text len: 21
a = ["please","work"]
splitting by spaces:
0 a
1 =
2 ["please","work"]
splitting by newlines:
0 a = ["please","work"]
#bruno suggested in a comment to use str.splitlines(); this will work even if the response is bytes, since there also exists the method bytes.splitlines().

Extracting multiple nested JSON keys at a time

How do I go about extracting more than one JSON key at a time given this script - the script cycles through a list of message ids and extracts the JSON response. I only want to extract certain keys from the response.
import urllib3
import json
import csv
from progressbar import ProgressBar
import time
pbar = ProgressBar()
base_url = 'https://api.pipedrive.com/v1/mailbox/mailMessages/'
fields = {"include_body": "1", "api_token": "token"}
json_arr = []
http = urllib3.PoolManager()
with open('ten.csv', newline='') as csvfile:
for x in pbar(csv.reader(csvfile, delimiter=' ', quotechar='|')):
r = http.request('GET', base_url + "".join(x), fields=fields)
mails = json.loads(r.data.decode('utf-8'))
json_arr.append(mails['data']['from'][0]['id'])
print(json_arr)
This works as intended. But I want to do the following.
json_arr.append(mails(['data']['from'][0]['id'],['data']['to'][0]['id'])
Which results in TypeError: list indices must be integers or slices, not str
Did you mean:
json_arr.append(mails['data']['from'][0]['id'])
json_arr.append(mails['data']['to'][0]['id'])
The answer already posted looks good but I'll share the one-liner equivalent, using extend() instead of append():
json_arr.extend([mails['data']['from'][0]['id'], mails['data']['to'][0]['id']])

About file I/O in python

I want to read a txt file and store it as a list of string. This is a way that I come up with myself. It looks really clumsy. Is there any better way to do this? Thanks.
import re
import urllib2
import re
import numpy as np
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
f=open('graph1.txt','w')
f.write(txt)
f.close()
f=open('graph1.txt','r')
nodes=f.readlines()
I tried the solutions provided below, but they all actually return something different from my previous code.
This is string produced by split()
'node0, node1 0.04, node8 11.11, node14 72.21'
This is what my code produce
'node0, node1 0.04, node8 11.11, node14 72.21\n'
The problem is without the'\n' when I try process the string list it will confront some index error.
" row = index[0] IndexError: list index out of range "
for node in nodes:
index = re.findall('(?<=node)\w+',node)
index = map(int,index)
row = index[0]
del index[0]
According to the documentation, response is already a file-like object: you should be able to do response.readlines().
For those problems where you do need to create an intermediate file like this, though, you want to use io.StringIO
Look at split. So:
nodes = response.read().split("\n")
EDIT: Alternatively if you want to avoid \r\n newlines, use splitlines.
nodes = response.read().splitlines()
Try:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
with open('graph1.txt','w') as f:
f.write(txt)
nodes=txt.split("\n")
If you don't want the file, this should work:
url=('http://quant-econ.net/_downloads/graph1.txt')
response= urllib2.urlopen(url)
txt= response.read()
nodes=txt.split("\n")

Manipulate string data

I'm new to python and trying to create a script to modify the output of a JS file to match what is required to send data to an API. The JS file is being read via urllib2.
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
# JS Data
# m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
# m[mi++]="19.12.12 09:25:00|1911;2060;3277;293;59"
# Required format for API
# addbatchstatus.jsp?data=20121219,09:25,3277.0,1911,-1,-1,59.0,293.0;20121219,09:30,3440.0,1964,-1,-1,60.0,293.0
As a breakdown (Required values are bold)
m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
and need to add values of -1,-1 into the string
I've managed to get the date into the correct format and replace characters and line breaks to make the output look as such, but I have a feeling I'm heading down the wrong track if I need to be able to reorder this string values. Although it looks like the order is in reverse in regards to time as well.
20121219,09:30:00,1964,2121,3440,293,60;20121219,09:25:00,1911,2060,3277,293,59
Any help would be greatly appreciated! I'm thinking along the lines of regex might be what I need.
Here's a Regex pattern to strip out the bits you don't want
m\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+
and replace with
20\P<year>\P<month>\P<day>,\P<time>,\P<v3>,\P<v1>,-1,-1,\P<v5>,\P<v4>
This pattern assumes that the characters before the date are constant. You can replace m\[mi\+\+\]=" with [^\d]+ if you want more general handling of that bit.
So to put this in practice in python:
import re
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def repl(match):
return '20%s%s%s,%s,%s,%s,-1,-1,%s,%s'%(match.group('year'),
match.group('month'),
match.group('day'),
match.group('time'),
match.group('v3'),
match.group('v1'),
match.group('v5'),
match.group('v4'))
pattern = re.compile(r'm\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+')
data = [re.sub(pattern, repl, line).split(',') for line in getPage().split('\n')]
# If you want to sort your data
data = sorted(data, key=lambda x:x[0], reverse=True)
# If you want to write your data back to a formatted string
new_string = ';'.join(','.join(x) for x in data)
# If you want to write it back to file
with open('new/file.txt', 'w') as f:
f.write(new_string)
Hope that helps!

Categories