How to remove row completely when removing non-ascii characters? - python

I am using code below to remove all non english characters below:
DF.text.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
where df has a column called text with text in it like below:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。\n
¡Hola miguel! Lamento mucho la confusión cau
expected output:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
For my rows where my code removes characters -
I want to delete those rows from the df completely, meaning if it does replace any non-english characters, I want to delete that row from the df completely to avoid having that row with either 0 characters or a few characters that are meaningless after they have been altered by the code above.

You can use
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['hi what are you saying?', 'ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。'], 'another_col':['demo 1', 'demo 2']})
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
# text another_col
# 0 hi what are you saying? demo 1
Notes:
df['text'].str.contains(r'[^\x00-\x7F]') finds all values in text column that contain a character other than ASCII char (it is our "mask")
df[~...] only keeps those rows that did not match the regex.

str.contains() returns a Series of booleans that we can use to index our frame
patternDel = "[^\x00-\x7F]"
filter = df['Event Name'].str.contains(patternDel)
I tend to keep the things we want as opposed to deleting rows. Since filter represents things we want to delete we use ~ to get all the rows that don't match and keep them
df = df[~filter]

Related

Excel Steps Formula in Python

So I am venturing into python scripts. I am a beginner, but i have been tasked with taking a excel formula to python code. I have a text file that contains 3+million rows, each row has three columns and is delimited by tab.
All rows comes as string and first two columns have no problem. Problem with third column is that the content when downloaded gets added padding of 0s to make 18 characters if data is numerical.
In the same column, there are also values that contain space in between. Like 00 5372. Some are fully text format either identified by letter or character like ABC3400 or 00-ab-fu-99 or 00-33-44-66.
A1 B1 Values Output
AA00 dd 000000000000056484 56484
AB00 dd 00 564842 00 564842
AC00 dd 00-563554-f 00-563554-f
AD00 dd STO45642 STO45642
AE00 dd 45632 45632
I need to clean this type of codes to make the output text to be clean, while
Leaving the spaces between,
Clean the leading and trailing spaces,
Clean the value if it is padded with 0's in front.
I do in excel for limited amount by using following formula.
=TRIM(IFERROR(IF((FIND(" ";A2))>=1;A2);TEXT(A2;0)))
*Semicolon due to regional language setting.
For large file, I use following power query steps.
= Table.ReplaceValue(#"Trimmed Text", each [Values], each if Text.Contains([Values]," ") then [Values] else if Number.From([Values]) is number then Text.TrimStart([Values],"0") else [Values],Replacer.ReplaceValue,{"Values"})
First trim and then replace values. This does the job in Power Query very well. Now I would like to do it with Python Script. But as a noob I am stuck at very beginning. Can anyone help me with the library and code ?
My end target is to get the data saved in txt/csv with cleaned values.
Excel ScreenShot
*Edited to correct point 1) Leaving and not removing and further clarification with data.
try the code below(replace the column1,column2,column3 with respectiove column names and provide file_address to variable file_name, if python script and excel file is saved at same location then only name will be enough):
import pandas as pd
df = pd.read_excel(file_name, sep='\t', lineterminator='\r', skipinitialspace=True)
df['column1'] = df['column1'].str.replace(' ','')
df['column2'] = df['column2'].str.replace(' ','')
df['column3'] = df['column3'].str.replace(' ','')
df.to_csv('output.csv',index=False)

Replace with Python regex in pandas column

a = "Other (please specify) (What were you trying to do on the website when you encountered ^f('l1')^?Â\xa0)"
There are many values starting with '^f' and ending with '^' in a pandas column. And I need to replace them like below :
"Other (please specify) (What were you trying to do on the website when you encountered THIS ISSUE?Â\xa0)"
You don't mention what you've tried already, nor what the rest of your DataFrame looks like but here is a minimal example:
# Create a DataFrame with a single data point
df = pd.DataFrame(["... encountered ^f('l1')^?Â\xa0)"])
# Define a regex pattern
pattern = r'(\^f.+\^)'
# Use the .replace() method to replace
df = df.replace(to_replace=pattern, value='TEST', regex=True)
Output
0
0 ... encountered TEST? )

Removing alphabets and special characters other than decimal point from a column in a pandas dataframe using regex not working as intended

I am trying to remove labels like "kg", "g", "pack", "packs" from a column so I can perform a numerical operation, however for some reason its dropping a lot of values.
https://pastebin.com/pRtKsAYL
This is the line which I am using to remove anything other than digits in the df.
df['boxes'] = df['boxes'].str.replace(r'[^\d.]+', '')
Over 600 entries in the column boxes gets dropped even though they do not have any alphabets in them.
Is there something I am missing?
Thanks!
I found if I added
concatenated['boxes'] = concatenated['boxes'].astype(str)
before
concatenated['boxes_cl'] = concatenated['boxes'].str.extract('(\d+)', expand=False)
It kinda works, but added nan to all the blank columns.

Python pandas replace value in column

When I look at the values in a column in my dataframe, I can see that due to user data entry errors, the same category has been entered incorrectly.
For my dataframe I use this code:
df['column_name'].value_counts()
output:
Targeted 523534
targeted 1
story 25425
story 2
multiple 2524543
For story, I guess there is a space?
I am trying to replace targeted with Targeted.
df['column_name'].replace("targeted","Targeted")
But nothing is happening, I still get the same value count.
Yes, is seems there is start of end white-space(s).
Need str.strip first and then Series.replace or Series.str.replace:
df['column_name'] = df['column_name'].str.strip().replace("targeted","Targeted")
df['column_name'] = df['column_name'].str.strip().str.replace("targeted","Targeted")
Another possible solution is convert all characters to lowercase:
df['column_name'] = df['column_name'].str.strip().str.lower()

sub string python pandas

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following
for i in range(len(table["series_id"])):
table["state_code"] = table["series_id"][i][2:4]
table["area_code"] = table["series_id"][i][5:9]
table["supersector_code"] = table["series_id"][i][11:12]
where "series_id" is the string containing multiple information fields I want to create an example data element:
columns:
[series_id, year, month, value, footnotes]
The data:
[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
['SMS01000000000000001' '2006' 'M02' 1970.4 '']
['SMS01000000000000001' '2006' 'M03' 1976.6 '']
However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.
http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern
has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:
table["state_code"] = table["series_id"].str.get(1:3)
or
table["state_code"] = table["series_id"].str.slice(1:3)
or
table["state_code"] = table["series_id"].str.slice([1:3])
When I have tried the following functions I get an invalid syntax for the ":".
but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.
Thank you
I think I would use str.extract with some regex (which you can tweak for your needs):
In [11]: s = pd.Series(["SMU78000009092000001"])
In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]:
state_code area_code supersector_code
0 U78 0000 92
This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

Categories