Count match in 2 pandas dataframes - python

I have 2 dataframes containing text as list in each row. This one is called df
Datum File File_type Text
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr..
and i have another one, df_lm which looks like this
List_type Words
0 LM_cnstrain. [abide, abiding, bound, bounded, commit, commi...
1 LM_litigius. [abovementioned, abrogate, abrogated, abrogate...
2 LM_modal_me. [can, frequently, generally, likely, often, ou...
3 LM_modal_st. [always, best, clearly, definitely, definitive...
4 LM_modal_wk. [almost, apparently, appeared, appearing, appe...
I want to create new columns in df, where the match of words should be counted, so for example how many words are there from df_lm.Words[0] in df.Text[0]
Note: df has ca 500 rows and df_lm has 6 -> so i need to create 6 new columns in df so that the updated df looks somewhat like this
Datum ...LM_cnstrain LM_litigius Lm_modal_me ...
2000-01-27 ... 5 3 4
2000-02-25 ... 7 1 0
I hope i was clear on my question.
Thanks in advance!
EDIT:
i have already done smth. similar by creating a list and loop over it, but as the lists in df_lm are very long this is not an option.
The code looked like this:
result_list[]
for file in file_list:
count_growth = 0
for word in text.split ():
if word in growth:
count_growth = count_growth +1
a={'Grwoth':count_growth}
result_list.append(a)

According to my comments you can try something like this:
The below code has to run in a loop where text column from 1st df has to be matched with all 6 from next and make column with value from len(c)
desc = df_lm.iloc[0,1]
matches = df.text.isin(desc)
result = df.text[matches]
If this helps you, let me know otherwise will update/delete the answer

So ive come to the following solution:
for file in file_list:
count_lm_constraint = 0
count_lm_litigious = 0
count_lm_modal_me = 0
for word in text.split()
if word in df_lm.iloc[0,1]:
count_lm_constraint = count_lm_constraint +1
if word in df_lm.iloc[1,1]:
count_lm_litigious = count_lm_litigious +1
if word in df_lm.iloc[2,1]:
count_lm_modal_me = count_lm_modal_me +1
a={"File": name, "Text": text,'lm_uncertain':count_lm_uncertain,'lm_positive':count_lm_positive ....}
result_list.append(a)

Related

How to iterate through rows which contains text and create bigrams using python

In an excel file I have 5 columns and 20 rows, out of which one row contains text data as shown below
df['Content'] row contains:
0 this is the final call
1 hello how are you doing
2 this is me please say hi
..
.. and so on
I want to create bigrams while it remains attached to its original table.
I tried applying the below function to iterate through rows
def find_bigrams(input_list):
bigram_list = []
for i in range(len(input_list)-1):
bigram_list.append(input_list[1:])
return bigram_list
And tried applying back the row into its table using the:
df['Content'] = df['Content'].apply(find_bigrams)
But I am getting the following error:
0 None
1 None
2 None
I am expecting the output as below
Company Code Content
0 xyz uh-11 (this,is),(is,the),(the,final),(final,call)
1 abc yh-21 (hello,how),(how,are),(are,you),(you,doing)
Your input_list is not actually a list, it's a string.
Try the function below:
def find_bigrams(input_text):
input_list = input_text.split(" ")
bigram_list = list(map(tuple, zip(input_list[:-1], input_list[1:])))
return bigram_list
You can use itertools.permutations()
s.str.split().map(lambda x: list(itertools.permutations(x,2))[::len(x)])

Pandas: How to remove words in string which appear before a certain word from another column

I have a large csv file with a column containing strings. At the beginning of these strings there are a set of id numbers which appear in another column as below.
0 Home /buy /York /Warehouse /P000166770Ou... P000166770
1 Home /buy /York /Plot /P000165923A plot of la... P000165923
2 Home /buy /London /Commercial /P000165504A str... P000165504
...
804 Brand new apartment on the first floor, situat... P000185616
I want to remove all text which appears before the ID number so here we would get:
0 Ou...
1 A plot of la...
2 A str...
...
804 Brand new apartment on the first floor, situat...
I tried something like
df['column_one'].str.split(df['column_two'])
and
df['column_one'].str.replace(df['column_two'],'')
You could replace the pattern using regex as follows:
>> my_pattern = "^(Alpha|Beta|QA|Prod)\s[A-Z0-9]{7}"
>> my_series = pd.Series(['Alpha P17089OText starts here'])
>> my_series.str.replace(my_pattern, '', regex=True)
0 Text starts here
There is a bit of work to be done to determine the nature of your pattern. I would suggest experimenting a bit with https://regex101.com/
To extend your split() idea:
df.apply(lambda x: x['column_one'].split(x['column_two'])[1], axis=1)
0 Text starts here
I managed to get it to work using:
df.apply(lambda x: x['column1'].split(x['column2'])[1] if x['column2'] in x['column1'] else x['column1'], axis=1)
This also works when the ID is not in the description. Thanks for the help!
Here is one way to do it, by applying regex to each of the row based on the code
import re
def ext(row):
mch = re.findall(r"{0}(.*)".format(row['code']), row['txt'])
if len(mch) >0:
rtn = mch.pop()
else:
rtn = row['txt']
return rtn
df['ext'] = df.apply(ext, axis=1)
df
0 Ou...
1 A plot of la...
2 A str...
3 Brand new apartment on the first floor situat...
x txt code ext
0 0 Home /buy /York /Warehouse / P000166770 Ou... P000166770 Ou...
1 1 Home /buy /York /Plot /P000165923A plot of la... P000165923 A plot of la...
2 2 Home /buy /London /Commercial /P000165504A str... P000165504 A str...
3 804 Brand new apartment on the first floor situat... P000185616 Brand new apartment on the first floor situat...

Python concatenate values in rows till empty cell and continue

I am struggling a little to do something like that:
to get this output:
The purpose of it, is to separate a sentence into 3 parts to make some manipulations after.
Any help is welcome
Select from the dataframe only the second line of each pair, which is the line
containing the separator, then use astype(str).apply(''.join...) to restrain the word
that can be on any value column on the original dataframe to a single string.
Iterate over each row using split with the word[i] of the respective row, after split
reinsert the separator back on the list, and with the recently created list build the
desired dataframe.
Input used as data.csv
title,Value,Value,Value,Value,Value
Very nice blue car haha,Very,nice,,car,haha
Very nice blue car haha,,,blue,,
A beautiful green building,A,,green,building,lol
A beautiful green building,,beautiful,,,
import pandas as pd
df = pd.read_csv("data.csv")
# second line of each pair
d1 = df[1::2]
d1 = d1.fillna("").reset_index(drop=True)
# get separators
word = d1.iloc[:,1:].astype(str).apply(''.join, axis=1)
strings = []
for i in range(len(d1.index)):
word_split = d1.iloc[i, 0].split(word[i])
word_split.insert(1, word[i])
strings.append(word_split)
dn = pd.DataFrame(strings)
dn.insert(0, "title", d1["title"])
print(dn)
Output from dn
title 0 1 2
0 Very nice blue car haha Very nice blue car haha
1 A beautiful green building A beautiful green building

If text is contained in another dataframe then flag row with a binary designation

I'm working on mining survey data. I was able to flag the rows for certain keywords:
survey['Rude'] = survey['Comment Text'].str.contains('rude', na=False, regex=True).astype(int)
Now, I want to flag any rows containing names. I have another dataframe that contains common US names.
Here's what I thought would work, but it is not flagging any rows, and I have validated that names do exist in the 'Comment Text'
for row in survey:
for word in survey['Comment Text']:
survey['Name'] = 0
if word in names['Name']:
survey['Name'] = 1
You are not looping through the series correctly. for row in survey: loops through the column names in survey. for word in survey['Comment Text']: loops though the comment strings. survey['Name'] = 0 creates a column of all 0s.
You could use set intersections and apply(), to avoid all the looping through rows:
survey = pd.DataFrame({'Comment_Text':['Hi rcriii',
'Hi yourself stranger',
'say hi to Justin for me']})
names = pd.DataFrame({'Name':['rcriii', 'Justin', 'Susan', 'murgatroyd']})
s2 = set(names['Name'])
def is_there_a_name(s):
s1 = set(s.split())
if len(s1.intersection(s2))>0:
return 1
else:
return 0
survey['Name'] = survey['Comment_Text'].apply(is_there_a_name)
print(names)
print(survey)
Name
0 rcriii
1 Justin
2 Susan
3 murgatroyd
Comment_Text Name
0 Hi rcriii 1
1 Hi yourself stranger 0
2 say hi to Justin for me 1
As a bonus, return len(s1.intersection(s2)) to get the number of matches per line.

How to select uppercase words from a column and separate into a new column?

I have a dataset of genes and drugs all in 1 column, looks like this:
Molecules
3-nitrotyrosine
4-phenylbutyric acid
5-fluorouracil/leucovorin/oxaliplatin
5-hydroxytryptamine
ABCB4
ABCC8
ABCC9
ABCF2
ABHD4
The disperasal of genes and drugs in the column is random, so there is no precise partitioning I can do.
I am looking to remove the genes and put them into a new column, I am wondering if I can use isupper() to select the genes and move them into a new column, although I know this only works with strings. Is there some way to select the rows with uppercase letters to put into a new column? Any guidance would be appreciated.
Expected Output:
Column 1 Column 2
3-nitrotyrosine ABCB4
4-phenylbutyric acid ABCC8
5-fluorouracil/leucovorin/oxaliplatin ABCC9
5-hydroxytryptamine ABCF2
Read your file in to a list:
with open('test.txt', 'r') as f:
lines = [line.strip() for line in f]
Strip out all uppercase as so:
mols = [x for x in lines if x.upper() != x]
genes = [x for x in lines if x.upper() == x]
Result:
mols
['3-nitrotyrosine', '4-phenylbutyric acid',
'5-fluorouracil/leucovorin/oxaliplatin', '5-hydroxytryptamine']
genes
['ABCB4', 'ABCC8', 'ABCC9', 'ABCF2', 'ABHD4']
As mentioned, separating the upper case is simple:
df.loc[df['Molecules'].str.isupper()]
Molecules
5 ABCB4
6 ABCC8
7 ABCC9
8 ABCF2
9 ABHD4
df.loc[df['Molecules'].str.isupper() == False]
Molecules
0 3-nitrotyrosine
1 4-phenylbutyric
2 acid
3 5-fluorouracil/leucovorin/oxaliplatin
4 5-hydroxytryptamine
However how you want to match up the rows are unclear until you are able to provide additional details.

Categories