Categorize the dataframe column data into Latin/Non-Latin - python

I'm trying to categorize latin/non-latin data through Python. I want the output to be 'columnname: Latin' if it's Latin, 'columnname: Non-Latin' if it's non-latin. Here's the data set I'm using:
name|company|address|ssn|creditcardnumber
Gauge J. Wiley|Crown Holdings|1916 Central Park Columbus|697-01-963|4175-0049-9703-9147
Dalia G. Valenzuela|Urs Corporation|8672 Cottage|Cincinnati|056-74-804|3653-0049-5620-71
هاها|Exide Technologies|هاها|Washington|139-09-346|6495-1799-7338-6619
I tried adding the below code. I don't get any error, but I get 'Latin' all the time. Is there any issue with the code?
if any(dataset.name.astype(str).str.contains(u'[U+0000-U+007F]')):
print ('Latin')
else:
print('Non-Latin')
And also I'd be happy if someone could tell me how to display the output as "column name: Latin", column name being iterated from dataframe

It depends what need - if check if any value has non latin values or all values have strings with numpy.where:
df = pd.DataFrame({'name':[u"هاها",'a',u"aهاها"]})
#https://stackoverflow.com/a/3308844
import unicodedata as ud
latin_letters= {}
def is_latin(uchr):
try: return latin_letters[uchr]
except KeyError:
return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))
def only_roman_chars(unistr):
return all(is_latin(uchr)
for uchr in unistr
if uchr.isalpha())
#check if any
df['new1'] = np.where(df['name'].map(only_roman_chars), 'Latin','Non-Latin')
#check if all
df['new2'] = np.where(df.name.str.contains('[a-zA-Z]'), 'Latin','Non-Latin')
print (df)
name new1 new2
0 هاها Non-Latin Non-Latin
1 a Latin Latin
2 aهاها Non-Latin Latin

Related

Searching for a string in an excel column

I am trying to find a string in an excel spreadsheet but it is only capturing the first row only and neglected to search the rest.
In my code I am using Tkinter to get a user to insert an input and using a link_url() to match it with each column cell in excel sheet and if it matches to capture the value of the same row another column.
Here is the same of the excel sheet index:
Name Link
0 ABC www.linkname1.com
1 DEF www.linkname2.com
2 GHI www.linkname3.com
3 JKL www.linkname4.com
4 MNO www.linkname5.com
5 PQR www.linkname6.com
6 STU www.linkname7.com
7 VWX www.linkname8.com
8 YZZ www.linkname9.com
9 123 www.linkname10.com
I create a the following function to search for the input:
def link_url():
df = pd.read_excel('Links.xlsx')
for i in df.index:
# print(df['Name'])
# print(e.get())
if e.get() in df['Name'][i]:
print(df['Name'][i])
link_url = df['Link'][i]
known.append(e.get())
return link_url
else:
unknown.append(e.get())
unknown_request = "I will search and return back to you"
return unknown_request
My Question
When I search for ABC it returns www.linkname1.com as requested but when I search for DEF it returns I will search and return back to you why is that happening and how can I fix it?
I may be misunderstanding the question a bit (Henry Ecker is right about the direct issue you are running into), but the solution with Pandas feels a bit weird to me.
I guess I'd personally do something more like this to filter a data frame to a specific row. I generally avoid for looping through data frames as much as I can.
import pandas as pd
my_data = pd.DataFrame(
{'Name': ['ABC', 'DEF', 'GHI'],
'Link': ['www.linkname1.com', 'www.linkname2.com', 'www.linkname3.com']}
)
keep = my_data.Name.eq('DEF')
result = my_data[keep]
if len(result) > 0:
print(result.Link.values)
else:
print("I will search and return back to you")

Looking for words in a dataframe column that begin with "#" for each row, then adding them to a new column

I have a column called "Tweets". I want to extract all the hashtagged words and put then in a new column.
Here's the code I tried:
for row in df.split(' '):
for word in row:
if word.startswith('#'):
return row
else:
return np.nan
Problem is it only returns one hashtag per row. So if a row has "#word1 and #word2" it only returns "#word1"
You might want to have a look at pandas' string functions like extractall() with regex. Example:
tweets = ["lorem ipsum #hashtag01 #hashtag02 #another_one",
"#one ipsum #two lorem #some_more"]
df = pd.DataFrame(tweets, columns=["tweets"])
df.tweets.str.extractall(r"(#\w+)").unstack()
The (#\w+) catches all strings as groups that start with # and have one or many consequent word characters \w+ .
[OUT]
match 0 1 2
0 #hashtag01 #hashtag02 #another_one
1 #one #two #some_more
If you want to extract all hashtags to one single column and are sure that hashtags are always separated by a space (like your example suggests) than you can use this line of code:
df["hashtags] = df.tweets.apply(lambda x: [x for x in x.split(" ") if x.startswith("#")])
[OUT]
0 [#hashtag01, #hashtag02, #another_one]
1 [#one, #two, #some_more]

If text is contained in another dataframe then flag row with a binary designation

I'm working on mining survey data. I was able to flag the rows for certain keywords:
survey['Rude'] = survey['Comment Text'].str.contains('rude', na=False, regex=True).astype(int)
Now, I want to flag any rows containing names. I have another dataframe that contains common US names.
Here's what I thought would work, but it is not flagging any rows, and I have validated that names do exist in the 'Comment Text'
for row in survey:
for word in survey['Comment Text']:
survey['Name'] = 0
if word in names['Name']:
survey['Name'] = 1
You are not looping through the series correctly. for row in survey: loops through the column names in survey. for word in survey['Comment Text']: loops though the comment strings. survey['Name'] = 0 creates a column of all 0s.
You could use set intersections and apply(), to avoid all the looping through rows:
survey = pd.DataFrame({'Comment_Text':['Hi rcriii',
'Hi yourself stranger',
'say hi to Justin for me']})
names = pd.DataFrame({'Name':['rcriii', 'Justin', 'Susan', 'murgatroyd']})
s2 = set(names['Name'])
def is_there_a_name(s):
s1 = set(s.split())
if len(s1.intersection(s2))>0:
return 1
else:
return 0
survey['Name'] = survey['Comment_Text'].apply(is_there_a_name)
print(names)
print(survey)
Name
0 rcriii
1 Justin
2 Susan
3 murgatroyd
Comment_Text Name
0 Hi rcriii 1
1 Hi yourself stranger 0
2 say hi to Justin for me 1
As a bonus, return len(s1.intersection(s2)) to get the number of matches per line.

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

Extract alpha letters between a pipe and japanese character, and replace with space with comma

I have some data that looks like this in a dataframe:
Japanese
--------
明日|Adverb の|Case 天気|Weather は|Case なんですか
Using Pandas, I am looking for a way to return this in a new column
Tag
------
Adverb, Case, Weather
So far I have been able to use
df['Tag'] = df.iloc[:, 0].str.replace('[^a-zA-Z]', ' ')
to get
Tag
------
Adverb Case Weather
but when I run
df['Tag'] = df['Tag'].str.replace(' ', ',')
I get
Tag
------
,,,,Adverb,,,Case,,,,Weather,,,Case,,,,,,
I think I'm supposed to use str.extract instead of replace, but i also get an error message in that case.
pandas.Series.str.findall
s = df.Japanese.str.findall('(?i)[a-z]+')
pd.Series([', '.join({*x}) for x in s], s.index)
0 Adverb, Weather, Case
dtype: object
Sorted
s = df.Japanese.str.findall('(?i)[a-z]+')
pd.Series([', '.join(sorted({*x})) for x in s], s.index)
0 Adverb, Case, Weather
dtype: object

Categories