Import raw data from web page as a dataframe - python

I'm trying to import some data from a webpage into a dataframe.
Data: a block of text in the following format
[{"ID":0,"Name":"John","Location":"Chicago","Created":"2017-04-23"}, ... ]
I am successfully making the request to the server and can return the data in text form, but cannot seem to convert this to a DataFrame.
E.g
r = requests.get(url)
people = r.text
print(people)
So from this point, I am a bit confused on how to structure this text as a DataFrame. Most tutorials online seem to demonstrate importing csv, excel or html etc.

If people is a list of dict in string format, you can use json.loads to convert it to a list of dict and then create a DataFrame easily
>>> import json
>>> import pandas as pd
>>> people='[{"ID":0,"Name":"John","Location":"Chicago","Created":"2017-04-23"}]'
>>> json.loads(people)
[{'ID': 0, 'Name': 'John', 'Location': 'Chicago', 'Created': '2017-04-23'}]
>>>
>>> data=json.loads(people)
>>> pd.DataFrame(data)
Created ID Location Name
0 2017-04-23 0 Chicago John

Related

ParserError: unable to convert txt file to df due to json format and delimiter being the same

Im fairly new dealing with .txt files that has a dictionary within it. Im trying to pd.read_csv and create a dataframe in pandas.I get thrown an error of Error tokenizing data. C error: Expected 4 fields in line 2, saw 11. I belive I found the root problem which is the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas in this case is the delimiter.
Data (store.txt)
id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}
How would I read this csv file and create new columns based on the key-values. In addition, what if there are more key-value in other data... for example > 11 keys within the dictionary.
Is there a an efficient way of create a df from the example above?
My code when trying to read as csv##
df = pd.read_csv('store.txt', header=None)
I tried to import json and user a converter but it do not work and converted all the commas to a |
`
import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")
In addition I also tried to use:
`
import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")
I also was thinking to add a quote at the begining of the dictionary and at the end to make it a string but thought that was too tedious to find the closing brackets.
I think adding quotes around the dictionaries is the right approach. You can use regex to do so and use a different quote character than " (I used § in my example):
from io import StringIO
import re
import json
with open("store.txt", "r") as f:
csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())
df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")
df_out = pd.concat([
df[["id", "name", "storeid"]],
pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)
print(df_out)
Note: the very last value in your csv isn't valid json: "code":001. It should either be "code":"001" or "code":1
Output:
id name storeid Source ... itemtrack info itemsboughtid stolenitem
0 11 JohnSmith 3221-123-555 online ... 110 {'haircolor': 'black', 'age': 53} [] [{'item': 'candy', 'code': 1}, {'item': 'candy...
1 35 BillyDan 3221-123-555 letter ... 110 {'haircolor': 'black', 'age': 21} [1, 42, 465, 5] [{'item': 'shoe', 'code': 2}]
2 64 NickWalker 3221-123-555 letter ... 110 {'haircolor': 'red', 'age': 22} [1, 2] [{'item': 'sweater', 'code': 11}, {'item': 'ma...

Convert string from database to dataframe

My database has a column where all the cells have a string of data. There are around 15-20 variables, where the information is assigned to the variables with an "=" and separated by a space. The number and names of the variables can differ in the individual cells... The issue I face is that the data is separated by spaces and so are some of the variables. The variable name is in every cell, so I can't just make the headers and add the values to the data frame like a csv. The solution also needs to be able to do this process automatically for all the new data in the database.
Example:
Cell 1: TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520"... RELEASED="1880".
Cell 2: TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655"... MAIN CHARACTER="Ishmael".
I want to convert these strings of data into a structured dataframe like.
TITLE
AUTHOR
PAGES
RELEASED
MAIN
Brothers Karamazov
Fyodor Dostoevsky
520
1880
NaN
Moby Dick
Herman Meville
655
NaN
Ishmael
Any tips on how to move forwards? I have though about converting it into a JSON format by using the replace() function, before turning it into a dataframe, but have not yet succeeded. Any tips or ideas are much appreciated.
Thanks,
I guess this sample is what you need.
import pandas as pd
# Helper function
def str_to_dict(cell) -> dict:
normalized_cell = cell.replace('" ', '\n').replace('"', '').split('\n')
temp = {}
for x in normalized_cell:
key, value = x.split('=')
temp[key] = value
return temp
list_of_cell = [
'TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520" RELEASED="1880"',
'TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655" MAIN CHARACTER="Ishmael"'
]
dataset = [str_to_dict(i) for i in list_of_cell]
print(dataset)
"""
[{'TITLE': 'Brothers Karamazov', 'AUTHOR': 'Fyodor Dostoevsky', 'PAGES': '520', 'RELEASED': '1880'}, {'TITLE': 'Moby Dick', 'AUTHOR': 'Herman Melville', 'PAGES': '655', 'MAIN CHARACTER': 'Ishmael'}]
"""
df = pd.DataFrame(dataset)
df.head()
"""
TITLE AUTHOR PAGES RELEASED MAIN CHARACTER
0 Brothers Karamazov Fyodor Dostoevsky 520 1880 NaN
1 Moby Dick Herman Melville 655 NaN Ishmael
"""
Pandas lib can read them from a .csv file and make a data frame - try this:
import pandas as pd
file = 'xx.csv'
data = pd.read_csv(file)
print(data)
Create a Python dictionary from your database rows.
Then create Pandas dataframe using the function: pandas.DataFrame.from_dict
Something like this:
import pandas as pd
# Assumed data from DB, structure it like this
data = [
{
'TITLE': 'Brothers Karamazov',
'AUTHOR': 'Fyodor Dostoevsky'
}, {
'TITLE': 'Moby Dick',
'AUTHOR': 'Herman Melville'
}
]
# Dataframe as per your requirements
dt = pd.DataFrame.from_dict(data)

Identifying partial character encoding/compression in text content

I have a CSV (extracted from BZ2) where only some values are encoded:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
The |, 0 and 1 characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.
I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2 module, or with Pandas and read_csv:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
How can I identify which type of encoding type to decode with?
Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main
Source file: test-balanced.csv.bz2
First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh
I asked the original authors of the CSV/dataset but they didn't respond which is understandable.
From readme.txt:
File Guide:
raw/key.csv: column key for raw/sarc.csv
raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
*/comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
/.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that
sequence, and sarcastic/non-sarcastic labels for those responses. The
format is post_id comment_id … comment_id|response_id … response_id|label … labelwhere *_id is a key to */comments.json
and label 1 indicates the respective response_id maps to a
sarcastic response. Thus each row has three entries (comment
chain, responses, labels) delimited by '|', and each of these entries
has elements delimited by spaces.The first entry always contains a
post_id and 0 or more comment_ids. The second and third entries
have the same number of elements, with the first response_id
corresponding to the first label and so on.
Converting above to a Python code snippet:
import pandas as pd
import json
from pprint import pprint
file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
sep='|',
names=['posts','responses','labels'],
encoding='utf-8')
file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
data_json = json.load(f)
print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} post_id: {post_id}')
pprint(data_json[post_id])
for response_id in data_csv['responses'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} response_id: {response_id}')
pprint(data_json[response_id])
Note that files were (manually) downloaded from the pol directory for their acceptable size (pol: contains subset of main dataset corresponding to comments in /r/politics).
Result: D:\bat\SO\71596864.py
First csv line decoded:
post_id: hqa1x
{'author': 'joshlamb619',
'created_utc': 1307053256,
'date': '2011-06',
'downs': 359,
'score': 274,
'subreddit': 'politics',
'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
'candidates during recall elections.',
'ups': 633}
response_id: c1xiujs
{'author': 'Artisane',
'created_utc': 1307077221,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "And we're upset since the Democrats would *never* try something as "
'sneaky as this, right?',
'ups': -2}
response_id: c1xj4e2
{'author': 'stellarfury',
'created_utc': 1307080843,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
"Picture this we were makin' up candidates Being huge election whores",
'ups': -2}

Convert a dataframe column into a list of object

I am using pandas to read a CSV which contains a phone_number field (string), however, I need to convert this field into the below JSON format
[{'phone_number':'+01 373643222'}] and put it under a new column name called phone_numbers, how can I do that?
Searched online but the examples I found are converting the all the columns into JSON by using to_json() which is apparently cannot solve my case.
Below is an example
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'phone_number': ['+1 569-483-2388', '+1 555-555-1212', '+1 432-867-5309']})
use map function like this
df["phone_numbers"] = df["phone_number"].map(lambda x: [{"phone_number": x}] )
display(df)

Looping multiple values into dictionary

I would like to expand on a previously asked question:
Nested For Loop with Unequal Entities
In that question, I requested a method to extract the location's type (Hospital, Urgent Care, etc) in addition to the location's name (WELLSTAR ATLANTA MEDICAL CENTER, WELLSTAR ATLANTA MEDICAL CENTER SOUTH, etc).
The answer suggested utilizing a for loop and dictionary to collect the values and keys. The code snippet appears below:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
url = "https://www.wellstar.org/locations/pages/default.aspx"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
d = {}
for row in soup.select(".WS_Content > .WS_LeftContent > table > tr"):
title = row.h3.get_text(strip=True)
d[title] = [item.get_text(strip=True) for item in row.select(".PurpleBackgroundHeading a)]
pprint(d)
I would like to extend the existing solution to include the entity's address matched with the appropriate key-value combination. If the best solution is to utilize something other than a dictionary, I'm open to that suggestion as well.
Let's say you have a dict my_dict and you want to add 2 with my_key as key. Simply do:
my_dict['my_key'] = 2
Let say you have a dict d = {'Name': 'Zara', 'Age': 7} now you want to add another value
'Sex'= 'female'
You can use built in update method.
d.update({'Sex': 'female' })
print "Value : %s" % d
Value : {'Age': 7, 'Name': 'Zara', 'Sex': 'female'}
ref is https://www.tutorialspoint.com/python/dictionary_update.htm

Categories