Parsing a json file to pandas dataframe - python

I would need to parse some json files to a pandas dataframe. I want to have one column with the words present in the text, and another column with the corresponding entity – the entity will be the “Type” of the text below, when the “value” corresponds to the word, otherwise I want to assign the label ‘O’.
Below is an example.
This is the JSON file:
{"Text": "I currently use a Netgear Nighthawk AC1900. I find it reliable.",
"Entities": [
{
"Type": "ORGANIZATION ",
"Value": "Netgear"
},
{
"Type": "DEVICE ",
"Value": "Nighthawk AC1900"
}]
}
Here is what I want to get:
WORD TAG
I O
currently O
use O
a O
Netgear ORGANIZATION
Nighthawk AC1900 DEVICE
. O
I O
find O
it O
reliable O
. O
Can someone help me with the parsing? I can`t use the split() because sometime the values consists of two words. Hope this is clear. Thank you!

This is a difficult problem and will depend on what data isn't in this example and the output required. Do you have repeating data in the entity values? is order important? Did you want repetition in the output?
There are a few tools that can be used:
make a trie out of the Entity values before you search the string. This is good if you have overlapping versions of the same name like "Netgear" and "Netgear INC." and you want the longest version.
nltk.PunktSentenceTokenizer This one is finicky to work with about the Nouns. This tutorial does a better job of explaining how to deal with them.

I don't know if what you need is strictly what you post as a desired output.
The solution I am giving you is "dirty" (more elements and the column TAG is placed first)
You can manage to clean it and put it in the format you need. As you didn't provided a piece of code to start on, you can finish it.
Eventually you will find out that the purpose of stackoverflow is not to get people to write the code for you, but people to help you out with the code you are trying.
import json
import pandas as pd
#open and reading of the json:
with open('netgear.json','r') as jfile:
data = jfile.read()
info = json.loads(data)
#json into content
words,tags = info['Text'].split(),info['Entities']
#list to handle the Entities
prelist = []
for i in tags:
j = list(i.values())
#['ORGANIZATION ', 'Netgear']
#['DEVICE ', 'Nighthawk AC1900']
prelist.append(j)
#DataFrames to be merged
dft = pd.DataFrame(prelist,columns=['TAG','WORD'])
dfw = pd.DataFrame(words,columns=['WORD'])
#combine the dataFrames and NaN into 0
df = dfw.merge(dft, on='WORD', how='outer').fillna(0)
This is the output:
WORD TAG
0 I 0
1 I 0
2 currently 0
3 use 0
4 a 0
5 Netgear ORGANIZATION
6 Nighthawk 0
7 AC1900. 0
8 find 0
9 it 0
10 reliable. 0
11 Nighthawk AC1900 DEVICE

Related

How convert the specific conversational files into columns and save it in python

This is my text file. I want to convert it into columns such as speaker and comments and save it as csv. I have a huge list. So computing it will be helpful.
>bernardo11_5
Have you had quiet guard?
>francisco11_5
Not a mouse stirring.
>bernardo11_6
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
>francisco11_6
I think I hear them.--Stand, ho! Who is there?
>horatio11_1
Friends to this ground.
>marcellus11_1
And liegemen to the Dane.
Something like this?
import re
from pathlib import Path
import pandas as pd
input = Path('input.txt').read_text()
speaker = re.findall(">(.*)", input)
comments = re.split(">.*", input)
comments = [c.strip() for c in comments if c.strip()]
df = pd.DataFrame({'speaker': speaker, 'comments': comments})
This will give you full comments including newline characters.
For saving:
a) replace '\n' before calling to_csv()
df.comments = df.comments.str.replace('\n', '\\n')
b) save to a more suitable format, e.g., to_parquet()
c) split single comment into multiple rows
df.comments = df.comments.str.split('\n')
df.explode('comments')
one way is to parse and load
read the file
with open("test.txt") as fp:
data = fp.readlines()
remove empty lines
data = [x for x in data if x != "\n"]
separate into speaker and comments
speaker = []
comments = []
speaker_text = ""
for value in data:
if ">" in value:
speaker_text = value
else:
speaker.append(speaker_text)
comments.append(value)
convert to dataframe
df = pd.DataFrame({
"speaker": speaker,
"comments": comments
})
save as csv
df.to_csv("result.csv", index=False)
output
speaker comments
0 >bernardo11_5\n Have you had quiet guard?\n
1 >francisco11_5\n Not a mouse stirring.\n
2 >bernardo11_6\n Well, good night.\n
3 >bernardo11_6\n If you do meet Horatio and Marcellus,\n
4 >bernardo11_6\n The rivals of my watch, bid them make haste.\n
5 >francisco11_6\n I think I hear them.--Stand, ho! Who is there?\n
6 >horatio11_1\n Friends to this ground.\n
7 >marcellus11_1\n And liegemen to the Dane.\n

Create a dataframe using specific strings in a column from a parent dataframe

I am facing the following problem. My Dataframe is as follows,
I want to create 3 dataset from this dataframe,
Response column stays and need context with the first string, so Tweet1, Tweet3 ,Tweet6,Tweet7 and Tweet11
Response column stays and need context with the first and second string, so Tweet1,Tweet2, Tweet3, Tweet4,Tweet6, Tweet7, Tweet8,Tweet11 and Tweet12.
Response column stays and need context with the first, second and third string, so, Tweet1,Tweet2, Tweet3,Tweet4,Tweet5,Tweet6,Tweet7,Tweet8,Tweet9,Tweet11 and Tweet12
All the tweets in the context column are in a list as shown above and they are separated using a comma.
I appreciate your repsonse and comments.
Based on your new information, I will now mimic the reading of the json file like this::
import pandas as pd
from io import StringIO
file_as_str="""
[
{"label":1, "response" : "resp_exmaple1", "context": ["tweet1,with comma", "tweet2"]},
{"label":0, "response" : "resp_exmaple2", "context": ["tweet3", "tweet4", "tweet5"]},
{"label":1, "response" : "resp_exmaple3", "context": ["tweet6, with comma"]},
{"label":1, "response" : "resp_exmaple4", "context": ["tweet7", "Tweet8", "Tweet9", "Tweet10"]},
{"label":0, "response" : "resp_exmaple5", "context": ["tweet11", "Tweet12"]}
]
"""
tweets_json = StringIO(file_as_str)
The above string is only to mimic reading from file like this:
tweets = pd.read_json(tweets_json, orient='records')
If the structure is indeed is like my example, you should give orient='records', but if it is different you may need to pick another scheme. The dataframe now looks like:
label response context
0 1 resp_exmaple1 [tweet1,with comma, tweet2]
1 0 resp_exmaple2 [tweet3, tweet4, tweet5]
2 1 resp_exmaple3 [tweet6, with comma]
3 1 resp_exmaple4 [tweet7, Tweet8, Tweet9, Tweet10]
4 0 resp_exmaple5 [tweet11, Tweet12]
The difference is that the context column now contains lists of strings, so the comma's dont matter. Now you can easily make a selection of maximum number of tweets like this:
context = tweets["context"]
max_tweets = 2
new_context = list()
for tweet_list in context:
n_selection = min(len(tweet_list), max_tweets)
tweets_selection = tweet_list[:n_selection]
new_context.append(tweets_selection)
tweets["context"] = new_context
The result looks like
label response context
0 1 resp_exmaple1 [tweet1,with comma, tweet2]
1 0 resp_exmaple2 [tweet3, tweet4]
2 1 resp_exmaple3 [tweet6, with comma]
3 1 resp_exmaple4 [tweet7, Tweet8]
4 0 resp_exmaple5 [tweet11, Tweet12]

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Python Sorting and Organising

I'm trying to sort data from a file and not quiet getting what i need. I have a text file with race details ( name placement( ie 1,2,3). I would like to be able to organize the data by highest placement first and also alphabetically by name. I can do this if i split the lines but then the name and score will not match up.
Any help and suggestion would be very welcomed, I've hit that proverbial wall.
My apologies ( first time user for this site , and python noob, steep learning curve ) Thank you for your suggestions , i really do appreciate the help.
comp=[]
results = open('d:\\test.txt', 'r')
for line in results:
line=line.split()
# (name,score)= line.split()
comp.append(line)
sorted(comp)
results.close()
print (comp)
Test file was in this format:
Jones 2
Ranfel 7
Peterson 5
Smith 1
Simons 9
Roberts 4
McDonald 3
Rogers 6
Elliks 8
Helm 10
I completely agree with everyone who has down-voted this question for being badly posed. However, I'm in a good mood so I'll try and at least steer you in the right direction:
Let's assume your text file looks like this:
Name,Placement
D,1
D,2
C,1
C,3
B,1
B,3
A,1
A,4
I suggest importing the data and sorting it using Pandas http://pandas.pydata.org/
import pandas as pd
# Read in the data
# Replace <FULL_PATH_OF FILE> with something like C:/Data/RaceDetails.csv
# The first row is automatically used for column names
data=pd.read_csv("<FULL_PATH_OF_FILE>")
# Sort the data
sorted_data=data.sort(['Placement','Name'])
# Create a re-indexed data frame if you so desire
sorted_data_new_index=sorted_data.reset_index(drop=True)
This gives me:
Name Placement
A 1
B 1
C 1
D 1
D 2
B 3
C 3
A 4
I'll leave you to figure out the rest..
As #Jack said, I am very limited to how I can help if you don't post code or the txt file. However, I've run into a similar problem before, so I know the basics (again, will need code/files before I can give an exact type-this-stuff answer!)
You can either develop an algorithm yourself, or use the built-in sorted feature
Put the names and scores in a list (or dictionary) such as:
name_scores = [['Matt', 95], ['Bob', 50], ['Ashley', 100]]
and then call sorted(name_scores) and it will sort by names: [['Ashley', 100], ['Bob', 50], ['Matt', 95]]

Subsetting data using R or Python

I want to subset the following data set. Specifically, I only want to retrieve 1)ID, 2)ASIN, 3) Group, 4) salesrank, and 5) categories in "csv" format. I am going to use R or Python.
(R can't frequently read this kind of irregular data format).
The following data doesn't have usual format, so I don't know how to subset it. I have two-year
experience in R but mostly use the tool for statistical purpose. So, I am not used to dealing with this kind of data manipulation with the unusual format. If anyone can give me the answer (or clue), that would be great.
At the bottom is one set of the data consisting of "key:value". The final result should look like
this:
Id ASIN group salesrank categories
1 0827229534 Book 396585 2
The original data looks like:
************************************************************************************************
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
reviews: total: 2 downloaded: 2 avg rating: 5
2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9
2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5
You could try in R by
Reading the file using readLines
Create a pattern with paste to subset the lines using grep
split the "lines1" into a list with list elements being each of the prefix groups. Before the split, I removed the LHS and RHS of : using sub.
cbind the list elements using do.call(cbind and convert it to a data.frame
This will return columns of class character. It is not clear which one should be character/numeric
NOTE: I created two records just to reproduce the problem.
lines <- readLines('file.txt')
pat <- paste0(c('Id', 'ASIN', 'group', 'salesrank', 'categories'),
':', collapse='|')
lines1 <- lines[grep(pat, lines)]
val <- str_trim(sub(".*:", "", lines1))
Grp <- sub(":.*", '', lines1)
library(stringr)
res <- do.call(cbind,split(val,Grp))
res1 <- as.data.frame(res,stringsAsFactors=FALSE)
res1
# ASIN categories group Id salesrank
#1 0827229534 2 Book 1 396585
#2 0827529534 3 Book2 2 396587

Categories