Searching for a string in an excel column - python

I am trying to find a string in an excel spreadsheet but it is only capturing the first row only and neglected to search the rest.
In my code I am using Tkinter to get a user to insert an input and using a link_url() to match it with each column cell in excel sheet and if it matches to capture the value of the same row another column.
Here is the same of the excel sheet index:
Name Link
0 ABC www.linkname1.com
1 DEF www.linkname2.com
2 GHI www.linkname3.com
3 JKL www.linkname4.com
4 MNO www.linkname5.com
5 PQR www.linkname6.com
6 STU www.linkname7.com
7 VWX www.linkname8.com
8 YZZ www.linkname9.com
9 123 www.linkname10.com
I create a the following function to search for the input:
def link_url():
df = pd.read_excel('Links.xlsx')
for i in df.index:
# print(df['Name'])
# print(e.get())
if e.get() in df['Name'][i]:
print(df['Name'][i])
link_url = df['Link'][i]
known.append(e.get())
return link_url
else:
unknown.append(e.get())
unknown_request = "I will search and return back to you"
return unknown_request
My Question
When I search for ABC it returns www.linkname1.com as requested but when I search for DEF it returns I will search and return back to you why is that happening and how can I fix it?

I may be misunderstanding the question a bit (Henry Ecker is right about the direct issue you are running into), but the solution with Pandas feels a bit weird to me.
I guess I'd personally do something more like this to filter a data frame to a specific row. I generally avoid for looping through data frames as much as I can.
import pandas as pd
my_data = pd.DataFrame(
{'Name': ['ABC', 'DEF', 'GHI'],
'Link': ['www.linkname1.com', 'www.linkname2.com', 'www.linkname3.com']}
)
keep = my_data.Name.eq('DEF')
result = my_data[keep]
if len(result) > 0:
print(result.Link.values)
else:
print("I will search and return back to you")

Related

How to iterate through rows which contains text and create bigrams using python

In an excel file I have 5 columns and 20 rows, out of which one row contains text data as shown below
df['Content'] row contains:
0 this is the final call
1 hello how are you doing
2 this is me please say hi
..
.. and so on
I want to create bigrams while it remains attached to its original table.
I tried applying the below function to iterate through rows
def find_bigrams(input_list):
bigram_list = []
for i in range(len(input_list)-1):
bigram_list.append(input_list[1:])
return bigram_list
And tried applying back the row into its table using the:
df['Content'] = df['Content'].apply(find_bigrams)
But I am getting the following error:
0 None
1 None
2 None
I am expecting the output as below
Company Code Content
0 xyz uh-11 (this,is),(is,the),(the,final),(final,call)
1 abc yh-21 (hello,how),(how,are),(are,you),(you,doing)
Your input_list is not actually a list, it's a string.
Try the function below:
def find_bigrams(input_text):
input_list = input_text.split(" ")
bigram_list = list(map(tuple, zip(input_list[:-1], input_list[1:])))
return bigram_list
You can use itertools.permutations()
s.str.split().map(lambda x: list(itertools.permutations(x,2))[::len(x)])

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

Hey guys this is my first post. I am planning on building an anime recommendation engine using python. I came across a problem where I made a list called genre_list which stores the genres that I want to filter from the huge data spreadsheet I was given. I am using the Pandas library and it has an isin() function to check if the values of a list is included in the datasheet and its supposed to filter it out. I am using the function but its not able to detect "Action" from the datasheet although it is there. I got a feeling there's something wrong with the data types and I probably have to work around it somehow but I'm not sure how.
I downloaded my csv file from this link for anyone interested!
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?resource=download
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
print(genre_list)
df_genre = df[df["genre"].isin(genre_list)]
# df_genre = df["genre"]
print(df_genre)
Outout:
[1]: https://i.stack.imgur.com/XZzcc.png
You want to check if ANY value in your user input list is in each of the list values in the "genre" column. The "isin" function will check if your input in it's entirety is in a cell value, which is not what you want here. Change that line to this:
df_genre = df[df['genre'].apply(lambda x: any([i in x for i in genre_list]))]
Let me know if you need any more help.
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
# List of all cells and their genre put into a list
col_list = df["genre"].values.tolist()
temp_list = []
# Each val in the list is compared with the genre_list to see if there is a match
for index, val in enumerate(col_list):
if all(x in val for x in genre_list):
# If there is a match, the UID of that cell is added to a temp_list
temp_list.append(df['uid'].iloc[index])
print(temp_list)
# This checks if the UID is contained in the temp_list of UIDs that have these genres
df_genre = df["uid"].isin(temp_list)
new_df = df.loc[df_genre, "title"]
# Prints all Anime with the specified genres
print(new_df)
This is another approach I took and works as well. Thanks for all the help :D
To make a selection from a dataframe, you can write this:
df_genre = df.loc[df['genre'].isin(genre_list)]
I've downloaded the file animes.csv from Kaggle and read it into a dataframe. What I found is that the column genre actually contains strings (of lists), not lists. So one way to get what you want would be:
...
m = df["genre"].str.contains(r"'(?:" + "|".join(genre_list) + r")'")
df_genre = df[m]
Result for genre_list = ["Action"]:
uid ... link
3 5114 ... https://myanimelist.net/anime/5114/Fullmetal_A...
4 31758 ... https://myanimelist.net/anime/31758/Kizumonoga...
5 37510 ... https://myanimelist.net/anime/37510/Mob_Psycho...
7 38000 ... https://myanimelist.net/anime/38000/Kimetsu_no...
9 2904 ... https://myanimelist.net/anime/2904/Code_Geass_...
... ... ... ...
19301 10350 ... https://myanimelist.net/anime/10350/Hakuouki_S...
19303 1293 ... https://myanimelist.net/anime/1293/Urusei_Yatsura
19304 150 ... https://myanimelist.net/anime/150/Blood_
19305 4177 ... https://myanimelist.net/anime/4177/Bounen_no_X...
19309 450 ... https://myanimelist.net/anime/450/InuYasha_Mov...
[4215 rows x 12 columns]
If you want to transform the values of the genre column for some reason into lists, then you could do either
df["genre"] = df["genre"].str[1:-1].str.replace("'", "").str.split(r",\s*")
or
df["genre"] = df["genre"].map(eval)
Afterwards
df_genre = df[~df["genre"].map(set(genre_list).isdisjoint)]
would give you the filtered dataframe.

Need to extract specific word from text

I am trying to run data cleaning process in python and one of the column which has too many rows is as follows:
|Website |
|:------------------|
|m.google.com |
|uk.search.yahoo |
|us.search.yahoo.com|
|google.co.in |
|m.youtube |
|youtube.com |
I want to extract company name from the text
Output will be as follows
|Website |Company|
|:------------------|:------|
|m.google.com |google |
|uk.search.yahoo |yahoo |
|us.search.yahoo.com|yahoo |
|google.co.in |google |
|m.youtube |youtube|
|youtube.com |youtube|
Data is too big to do it manually and being a beginner, I tried all of the things I learned. Please help!
Not bullet-proof but maybe a feasible heuristic:
import pandas as pd
d = {'Website': {0: 'm.google.com', 1: 'uk.search.yahoo', 2: 'us.search.yahoo.com', 3: 'google.co.in', 4: 'm.youtube', 5: 'youtube.com'}}
df = pd.DataFrame(data=d)
df['Website'].str.split('.').map(lambda l: [e for e in l if len(e)>3][-1])
0 google
1 yahoo
2 yahoo
3 google
4 youtube
5 youtube
Name: Website, dtype: object
Explaination:
Split string on ., filter out substrings with less than 3 characters, then take the rightmost element which wasn't filtered out.
I applied this trick on a kaggle large dataset and it works for me. Assuming that you already have a dataframe object of Pandas named as df.
company = df['Website']
ext_list = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
for extension in ext_list:
company = company.str.replace(extension,'')
df['company'] = company
df['company'].head(15)
Now look at your data carefully either by looking at head or tail of data and try to find if there is any extension that you miss in the list if you find another then add it in ext_list.
Now you can also verify it using
df['company'].unique()
Here is a way of checking the running time of it also its Big O Complexity would be O(n) so it also perfoms well on a large number of datasets.
import time
def time_it(func):
def wrapper(*args,**kwargs):
start = time.time()
result = func(*args,**kwargs)
end = time.time()
print(func.__name__ + " took "+ str((end-start)* 1000)+ " miliseconds")
return result
return wrapper
#time_it
def specific_word(col_name, ext_list):
for extension in ext_list:
col_name = col_name.str.replace(extension,'')
return col_name
if __name__ == '__main__':
company = df['Website']
extensions = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
result = specific_word(company, extensions)
print(result.head())
Here is I applied on estimate 10,000 values.

Categorize the dataframe column data into Latin/Non-Latin

I'm trying to categorize latin/non-latin data through Python. I want the output to be 'columnname: Latin' if it's Latin, 'columnname: Non-Latin' if it's non-latin. Here's the data set I'm using:
name|company|address|ssn|creditcardnumber
Gauge J. Wiley|Crown Holdings|1916 Central Park Columbus|697-01-963|4175-0049-9703-9147
Dalia G. Valenzuela|Urs Corporation|8672 Cottage|Cincinnati|056-74-804|3653-0049-5620-71
هاها|Exide Technologies|هاها|Washington|139-09-346|6495-1799-7338-6619
I tried adding the below code. I don't get any error, but I get 'Latin' all the time. Is there any issue with the code?
if any(dataset.name.astype(str).str.contains(u'[U+0000-U+007F]')):
print ('Latin')
else:
print('Non-Latin')
And also I'd be happy if someone could tell me how to display the output as "column name: Latin", column name being iterated from dataframe
It depends what need - if check if any value has non latin values or all values have strings with numpy.where:
df = pd.DataFrame({'name':[u"هاها",'a',u"aهاها"]})
#https://stackoverflow.com/a/3308844
import unicodedata as ud
latin_letters= {}
def is_latin(uchr):
try: return latin_letters[uchr]
except KeyError:
return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))
def only_roman_chars(unistr):
return all(is_latin(uchr)
for uchr in unistr
if uchr.isalpha())
#check if any
df['new1'] = np.where(df['name'].map(only_roman_chars), 'Latin','Non-Latin')
#check if all
df['new2'] = np.where(df.name.str.contains('[a-zA-Z]'), 'Latin','Non-Latin')
print (df)
name new1 new2
0 هاها Non-Latin Non-Latin
1 a Latin Latin
2 aهاها Non-Latin Latin

Count match in 2 pandas dataframes

I have 2 dataframes containing text as list in each row. This one is called df
Datum File File_type Text
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr..
and i have another one, df_lm which looks like this
List_type Words
0 LM_cnstrain. [abide, abiding, bound, bounded, commit, commi...
1 LM_litigius. [abovementioned, abrogate, abrogated, abrogate...
2 LM_modal_me. [can, frequently, generally, likely, often, ou...
3 LM_modal_st. [always, best, clearly, definitely, definitive...
4 LM_modal_wk. [almost, apparently, appeared, appearing, appe...
I want to create new columns in df, where the match of words should be counted, so for example how many words are there from df_lm.Words[0] in df.Text[0]
Note: df has ca 500 rows and df_lm has 6 -> so i need to create 6 new columns in df so that the updated df looks somewhat like this
Datum ...LM_cnstrain LM_litigius Lm_modal_me ...
2000-01-27 ... 5 3 4
2000-02-25 ... 7 1 0
I hope i was clear on my question.
Thanks in advance!
EDIT:
i have already done smth. similar by creating a list and loop over it, but as the lists in df_lm are very long this is not an option.
The code looked like this:
result_list[]
for file in file_list:
count_growth = 0
for word in text.split ():
if word in growth:
count_growth = count_growth +1
a={'Grwoth':count_growth}
result_list.append(a)
According to my comments you can try something like this:
The below code has to run in a loop where text column from 1st df has to be matched with all 6 from next and make column with value from len(c)
desc = df_lm.iloc[0,1]
matches = df.text.isin(desc)
result = df.text[matches]
If this helps you, let me know otherwise will update/delete the answer
So ive come to the following solution:
for file in file_list:
count_lm_constraint = 0
count_lm_litigious = 0
count_lm_modal_me = 0
for word in text.split()
if word in df_lm.iloc[0,1]:
count_lm_constraint = count_lm_constraint +1
if word in df_lm.iloc[1,1]:
count_lm_litigious = count_lm_litigious +1
if word in df_lm.iloc[2,1]:
count_lm_modal_me = count_lm_modal_me +1
a={"File": name, "Text": text,'lm_uncertain':count_lm_uncertain,'lm_positive':count_lm_positive ....}
result_list.append(a)

Categories