I am struggling a little to do something like that:
to get this output:
The purpose of it, is to separate a sentence into 3 parts to make some manipulations after.
Any help is welcome
Select from the dataframe only the second line of each pair, which is the line
containing the separator, then use astype(str).apply(''.join...) to restrain the word
that can be on any value column on the original dataframe to a single string.
Iterate over each row using split with the word[i] of the respective row, after split
reinsert the separator back on the list, and with the recently created list build the
desired dataframe.
Input used as data.csv
title,Value,Value,Value,Value,Value
Very nice blue car haha,Very,nice,,car,haha
Very nice blue car haha,,,blue,,
A beautiful green building,A,,green,building,lol
A beautiful green building,,beautiful,,,
import pandas as pd
df = pd.read_csv("data.csv")
# second line of each pair
d1 = df[1::2]
d1 = d1.fillna("").reset_index(drop=True)
# get separators
word = d1.iloc[:,1:].astype(str).apply(''.join, axis=1)
strings = []
for i in range(len(d1.index)):
word_split = d1.iloc[i, 0].split(word[i])
word_split.insert(1, word[i])
strings.append(word_split)
dn = pd.DataFrame(strings)
dn.insert(0, "title", d1["title"])
print(dn)
Output from dn
title 0 1 2
0 Very nice blue car haha Very nice blue car haha
1 A beautiful green building A beautiful green building
Related
This is a continuation of my previous thread: Removing Custom-Defined Words from List - Python
I have a df as such:
df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})
<OUT>
PageNumber new_tags
175 flower architecture people...
162 hair red bobbles...
576 sweets chocolate shop...
And another df (which will act as the reference df (see more below)):
top_words= pd.DataFrame({'ID': [1,2,3], 'tag':['flower, people, chocolate']})
<OUT>
ID tag
1 flower
2 people
3 chocolate
I'm trying to remove values in a list in a df based on the values of another df. The output I wish to gain is:
<OUT> df
PageNumber new_tags
175 flower people
576 chocolate
I've tried the inner join method: Filtering the dataframe based on the column value of another dataframe, however no luck unfortunately.
So I have resorted to tokenizing all tags in both of the df columns and trying to loop through each and retaining only the values in the reference df. Currently, it returns empty lists...
df['tokenised_new_tags'] = filtered_new["new_tags"].astype(str).apply(nltk.word_tokenize)
topic_words['tokenised_top_words']= topic_words['tag'].astype(str).apply(nltk.word_tokenize)
df['top_word_tokens'] = [[t for t in tok_sent if t in topic_words['tokenised_top_words']] for tok_sent in df['tokenised_new_tags']]
Any help is much appreciated - thanks!
How about this:
def remove_custom_words(phrase, words_to_remove_list):
return([ elem for elem in phrase.split(' ') if elem not in words_to_remove_list])
df['new_tags'] = df['new_tags'].apply(lambda x: remove_custom_words(x[0],top_words['tag'].to_list()))
Basically I am applying remove_custom_words function for each row of the dataset. Then we filter and remove the words contained in top_words['tag']
I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan
I have a corpus of conversations (400) between two people as strings (or more precisely as plain text files) A small example of this might be:
my_textfiles = ['john: hello \nmary: hi there \njohn: nice weather \nmary: yes',
'nancy: hello \nbill: hi there \nnancy: nice weather \nbill: yes',
'ringo: hello \npaul: hi there \nringo: nice weather \npaul: yes',
'michael: hello \nbubbles: hi there \nmichael: nice weather \nbubbles: yes',
'steve: hello \nsally: hi there \nsteve: nice weather \nsally: yes']
In addition to speaker names, I have also noted each speakers' role in the conversation (as a leader or follower depending on whether they are the first or second speaker). I then have a simple script that converts each conversation into a data-frame by seperating speaker ID from the content:
import pandas as pd
import re
import numpy as np
import random
def convo_tokenize(tf):
turnTokenize = re.split(r'\n(?=.*:)', tf, flags=re.MULTILINE)
turnTokenize = [turn.split(':', 1) for turn in turnTokenize]
dataframe = pd.DataFrame(turnTokenize, columns = ['speaker','turn'])
return dataframe
df_list = [convo_tokenize(tf) for tf in my_textfiles]
The corresponding dataframe then forms the basis of a much longer piece of analysis. However, I would now like to be able to shuffle speakers so that I create entirely random (and likely nonsense) conversations. For instance, John, who is having a conversation with Mary in the fist string, might be randomly assigned Paul (the second speaker in the third string). Crucially, I would need to maintain the order of speech within each speaker. It is also important that, when randomly assigning new speakers, I preserve a mix of leader/follower, such that I am not creating conversations from two leaders or two followers.
To begin, my thinking was to create a standardized speaker label (where 1 = leader, 2 = follower), and separate each DF into a sub-DF and store in role_specific df lists
def speaker_role(dataframe):
leader = dataframe['speaker'].iat[0]
dataframe['sp_role'] = np.where(dataframe['speaker'].eq(leader), 1, 2)
return dataframe
df_list = [speaker_role(df) for df in df_list]
leader_df = []
follower_df = []
for df in df_list:
is_leader = df['sp_role'] == 1
is_follower = df['sp_role'] != 1
leader_df.append(df[is_leader])
follower_df.append(df[is_follower])
I have worked out that I can now simply shuffle the data-frame of one of the sub-dfs, in this case the follower_df
follower_rand = random.sample(follower_df, len(follower_df))
Having got to this stage I'm not sure where to turn next. I suspect I will need some sort of zip function, but am unsure exactly what. I'm also unsure how I go about merging the turns together such that they form the same dataframe structure I initially had. Assuming Ringo (leader) is randomly assigned to Bubbles (follower) for one of the DFs, I would hope to have something like this...
speaker | turn | sp_role
------------------------------------
ringo hello 1
bubbles hi there 2
ringo nice weather 1
bubbles yes it is 2
I have 2 dataframes containing text as list in each row. This one is called df
Datum File File_type Text
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr..
and i have another one, df_lm which looks like this
List_type Words
0 LM_cnstrain. [abide, abiding, bound, bounded, commit, commi...
1 LM_litigius. [abovementioned, abrogate, abrogated, abrogate...
2 LM_modal_me. [can, frequently, generally, likely, often, ou...
3 LM_modal_st. [always, best, clearly, definitely, definitive...
4 LM_modal_wk. [almost, apparently, appeared, appearing, appe...
I want to create new columns in df, where the match of words should be counted, so for example how many words are there from df_lm.Words[0] in df.Text[0]
Note: df has ca 500 rows and df_lm has 6 -> so i need to create 6 new columns in df so that the updated df looks somewhat like this
Datum ...LM_cnstrain LM_litigius Lm_modal_me ...
2000-01-27 ... 5 3 4
2000-02-25 ... 7 1 0
I hope i was clear on my question.
Thanks in advance!
EDIT:
i have already done smth. similar by creating a list and loop over it, but as the lists in df_lm are very long this is not an option.
The code looked like this:
result_list[]
for file in file_list:
count_growth = 0
for word in text.split ():
if word in growth:
count_growth = count_growth +1
a={'Grwoth':count_growth}
result_list.append(a)
According to my comments you can try something like this:
The below code has to run in a loop where text column from 1st df has to be matched with all 6 from next and make column with value from len(c)
desc = df_lm.iloc[0,1]
matches = df.text.isin(desc)
result = df.text[matches]
If this helps you, let me know otherwise will update/delete the answer
So ive come to the following solution:
for file in file_list:
count_lm_constraint = 0
count_lm_litigious = 0
count_lm_modal_me = 0
for word in text.split()
if word in df_lm.iloc[0,1]:
count_lm_constraint = count_lm_constraint +1
if word in df_lm.iloc[1,1]:
count_lm_litigious = count_lm_litigious +1
if word in df_lm.iloc[2,1]:
count_lm_modal_me = count_lm_modal_me +1
a={"File": name, "Text": text,'lm_uncertain':count_lm_uncertain,'lm_positive':count_lm_positive ....}
result_list.append(a)
I want to do some text processing on a dataset containing Twitter messages. So far I'm able to load the data (.CSV) in a Pandas dataframe and index that by a (custom) column 'timestamp'.
df = pandas.read_csv(f)
df.index = pandas.to_datetime(df.pop('timestamp'))
Looks a bit like this:
user_name user_handle
timestamp
2015-02-02 23:58:42 Netherlands Startups NLTechStartups
2015-02-02 23:58:42 shareNL share_NL
2015-02-02 23:58:42 BreakngAmsterdamNews iAmsterdamNews
[49570 rows x 8 columns]
I can create a new object (Series) containing just the text like so:
texts = pandas.Series(df['text'])
Which creates this:
2015-06-02 14:50:54 Business Update Meer cruiseschepen dan ooit in...
2015-06-02 14:50:53 RT #ProvincieNH: Provincie maakt Markermeerdij...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat: In ...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat http...
2015-06-02 14:50:53 Lugar secreto em Amsterdam: Begijnhof // Hidde...
Name: text, Length: 49570
1. Is this new object of the same sort of type (dataframe) as my initial df variable, just with different columns/rows?
Now together with the nltk tookit I'm able to tokenize the strings using this:
for w in words:
print(nltk.word_tokenize(w))
This iterates the array instead of mapping the 'text' column to a multiple-column 'words' array. 2. How would I do this and moreover how do I then count the occurrences of each word?
I know there is a unique() method which I could use to create a distinct list of words. But then again I'd need an extra column which is a count over the array which I'm unable to produce in the first place. :) 3. Or would the next step towards 'counting' occurrences of those words be grouping?
EDIT. 3: I seem to need "CountVectorizer", thanks EdChum
documents = df['text'].values
vectorizer = CountVectorizer(min_df=0, stop_words=[])
X = vectorizer.fit_transform(documents)
print(X.toarray())
My main goal is to count the occurences of each word and selecting the top X results. I feel I'm on the right track, but I can't get the final steps just right..
Building on EdChums comments here is a way to get the (I assume global) word counts from CountVectorizer:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()
df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
,'class': ['a','a','a','a','c','c','b','e']})
X = vect.fit_transform(df['text'].values)
y = df['class'].values
covert the sparse matrix returned by CountVectoriser to a dense matrix, and pass it and the feature names to the dataframe constructor. Then transpose the frame and sum along axis=1 to get the total per word:
word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names()).T.sum(axis=1)
word_counts.sort(ascending=False)
word_counts[:3]
If all you are interested in is the frequency distribution of the words consider using Freq Dist from NLTK:
import nltk
import itertools
from nltk.probability import FreqDist
texts = ['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']
texts = [nltk.word_tokenize(text) for text in texts]
# collapse into a single list
tokens = list(itertools.chain(*texts))
FD =FreqDist(tokens)