how to define sample in a natural language processing model - python

for doc in sample['documents']:
The error is 'sample' undefined (I was trying to reproduce a natural language processing model)

I this you are searching that the way to display the natural processing language and i this this is helpful to you. i mention the link below so please come check this..
https://www.tableau.com/learn/articles/natural-language-processing-examples

In this case, your problem is the way you are reading the input. Not big deal no worries !
In the loop:
for doc in sample['documents']
sample is the Dataframe of input, or a dictionary, and 'documents' is the name of the column.
Let's suppose I have a csv of input like the following:
documents,label
Being offensive isnt illegal you idiot, negative
Loving the first days of summer! <3, positive
I hate when people put lol when we are having a serious talk ., negative
in python you will read the csv using pandas dataframe, for example:
sample=pd.read_csv('inputdata.csv',header=0)
and your sample['documents'] is the first colum of the input file. header =0 means that the label of your column are specified at the first line of the csv.
for doc in sample['documents'] will iterate over the first column, like this:
Being offensive isnt illegal you idiot
Loving the first days of summer! <3
I hate when people put lol when we are having a serious talk
This means that maybe the origin of your error is that you call your input data in some other ways instead of sample or it is not reading the header of the csv input.
If the csv doesn't have documents as the name of the header you can specify it like this:
columns = ['documents', 'labels']
sample = pd.read_csv(inputdata.csv', header = None, names = columns)
sample
Hope it helps !

Related

How would I be able to remove this part of the variable?

So I am making a code like a guessing game. The data for the guessing game is in the CSV file so I decided to use pandas. I have tried to use pandas to import my csv file, pick a random row and put the data into variables so I can use it in the rest of the code but, I can't figure out how to format the data in the variable correctly.
I've tried to split the string with split() but I am quite lost.
ar = pandas.read_csv('names.csv')
ar.columns = ["Song Name","Artist","Intials"]
randomsong = ar.sample(1)
songartist = randomsong["Artist"]
songname = (randomsong["Song Name"])
songintials = randomsong["Intials"]
print(songname)
My CSV file looks like this.
Song Name,Artist,Intials
Someone you loved,Lewis Capaldi,SYL
Bad Guy,Billie Eilish,BG
Ransom,Lil Tecca,R
Wow,Post Malone, W
I expect the output to be the name of the song from the csv file. For Example
Bad Guy
Instead the output is
1 Bad Guy
Name: Song Name, dtype:object
If anyone knows the solution please let me know. Thanks
You're getting a series object as output. You can try
randomsong["Song Name"].to_string()
Use df['column].values to get values of the column.
In your case, songartist = randomsong["Artist"].values[0] because you want only the first element of the returned list.

Reading a dictionary from within a dictionary

I have a json file for tweet data. The data that I want to look at is the text of the tweet. For some reason, some of the tweets are too long to put into the normal text part of the dictionary.
It seems like there is a dictionary within another dictionary and I can't figure out how to access it very well.
Basically, what I want in the end is one column of a data frame that will have all of the text from each individual tweet. Here is a link to a small sample of the data that contains a problem tweet.
Here is the code I have so far:
import json
import pandas as pd
tweets = []
#This writes the json file so that I can work with it. This part works correctly.
with open("filelocation.txt") as source
for line in source:
if line.strip():
tweets.append(json.loads(line))
print(len(tweets)
df = pd.DataFrame.from_dict(tweets)
df.info()
When looking at the info you can see that there will be a column called extended_tweet that only encompasses one of the two sample tweets. Within this column, there seems to be another dictionary with one of those keys being full_text.
I want to add another column to the dataframe that just has this information along with the normal text column when the full_text is null.
My first thought was to try and read that specific column of the dataframe as a dictionary again using:
d = pd.DataFrame.from_dict(tweets['extended_tweet]['full_text])
But this doesn't work. I don't really understand why that doesn't work as that is how I read the data the first time.
My guess is that I can't look at the specific names because I am going back to the list and it would have to read all or none. The error it gives me says "KeyError: 'full_text' "
I also tried using the recommendation provided by this website. But this gave me a None value no matter what.
Thanks in advance!
I tried to do what #Dan D. suggested, however, this still gave me errors. But it gave me the idea to try this:
tweet[0]['extended_tweet']['full_text']
This works and gives me the value that I am looking for. But I need to run through the whole thing. So I tried this:
df['full'] = [tweet[i]['extended_tweet']['full_text'] for i in range(len(tweet))
This gives me "Key Error: 'extended_tweet' "
Does it seem like I am on the right track?
I would suggest to flatten out the dictionaries like this:
tweet = json.loads(line)
tweet['full_text'] = tweet['extended_tweet']['full_text']
tweets.append(tweet)
I don't know if the answer suggested earlier works. I never got that successfully. But I did figure out something else that works well for me.
What I really needed was a way to display the full text of a tweet. I first loaded the tweets from the json with what I posted above. Then I noticed that in the data file, there is something called truncated. If this value is true, the tweet is cut short and the full tweet is placed within the
tweet[i]['extended_tweet]['full_text]
In order to access it, I used this:
tweet_list = []
for i in range(len(tweets)):
if tweets[i]['truncated'] == 'True':
tweet_list.append(tweets[i]['extended_tweet']['full_text']
else:
tweet_list.append(tweets[i]['text']
Then I can work with the data using the whol text from each tweet.

Creating dict from csv file

I am working with a csv file containing tweets which was generated using this project: https://github.com/Jefferson-Henrique/GetOldTweets-python.
The 2 first tweets, and the headings in the csv file can be seen below:
username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink;;
thepsalami;02-04-2014 01:59;0;2;Must be #aprilfools because everyone is
saying #HIMYM is over! Haha it'll never stop as long as we hold fast to the
memories.;;;#aprilfools #HIMYM;
4,51147E+17;https://twitter.com/thepsalami/status/451146992131923968;;
shahanasiddiqui;02-04-2014 01:59;0;0;#promahuq yeah B-R was no surprise -
the ending was just right. My FB turned into #HIMYM blog site! Man that show
had a huge impact!;;#promahuq;#HIMYM;4,51147E+17;https://twitter.com/shahanasiddiqui/status
/451146991955759105;;
I want to save this in a dict such that I can easily access e.g. the username, the time or the text. I tried using csv.DictReader:
input_file = csv.DictReader(open("HIMYM_tweets.csv"))
But that results in something very weird:
{'username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink;;':
"thepsalami;02-04-2014 01:59;0;2;Must be #aprilfools because everyone is
saying #HIMYM is over! Haha it'll never stop as long as we hold fast to the
memories.;;;#aprilfools #HIMYM; 4", None:['51147E+17;https://twitter.com/thepsalami/status/451146992131923968;;']}
{'username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink;;': ' ....
Any help on creating such a dict, or maybe doing something smarter is very appreciated :D
As the comment by David you need to consider the delimeter when using the DictReader.
Just replace your code with this and it should work
input_file = csv.DictReader(open("HIMYM_tweets.csv"),delimeter=";")

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?
The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.
To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

match hex string with list indice

I'm building a de-identify tool. It replaces all names by other names.
We got a report that <name>Peter</name> met <name>Jane</name> yesterday. <name>Peter</name> is suspicious.
outpout :
We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.
It can be done on multiple documents, and one name is always replaced by the same counterpart, so you can still understand who the text is talking about. BUT, all documents have an ID, referring to the person this file is about (I'm working with files in a public service) and only documents with the same people ID will be de-identified the same way, with the same names. (the goal is to watch evolution and people's history) This is a security measure, such as when I hand over the tool to a third party, I don't hand over the key to my own documents with it.
So the same input, with a different ID, produces :
We got a report that <name>Henry</name> met <name>Alicia</name> yesterday. <name>Henry</name> is suspicious.
Right now, I'm hashing each name with the document ID as a salt, I convert the hash to an integer, then subtract the length of the name list until I can request a name with that integer as an indice. But I feel like there should be a quicker/more straightforward approach ?
It's really more of an algorithmic question, but if it's of any relevance I'm working with python 2.7 Please request more explanation if needed. Thank you !
I hope it's clearer this way รด_o Sorry when you are neck-deep in your code you forget others need a bigger picture to understand how you got there.
As #LutzHorn pointed out, you could just use a dict to map real names to false ones.
You could also just do something like:
existing_names = []
for nameocurrence in original_text:
if not nameoccurence.name in existing_names:
nameoccurence.id = len(existing_names)
existing_names.append(nameoccurence.name)
else:
nameoccurence.id = existing_names.index(nameoccurence.name)
for idx, _ in enumerate(existing_names):
existing_names[idx] = gimme_random_name()
Try using a dictionary of names.
import re
names = {"Peter": "Billy", "Jane": "Elsa"}
for name in re.findall("<name>([a-zA-Z]+)</name>", s):
s = re.sub("<name>" + name + "</name>", "<name>"+ names[name] + "</name>", s)
print(s)
Output:
'We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.'

Categories