Cleaning a string (probably encoded string) in python

Cleaning a string (probably encoded string) in python - python

I am trying to read some data which is suppose to be tab delimited but I see a lot of #FO# in it?
I was wondering how can i clean that text out?
Sample snippet
title=#F0#Sometimes#F0#the#F0#Grave#F0#Is#F0#a#F0#Fine#F0#and#F0#Public#F0#Place.#F0#|url=http://query.nytimes.com/gst/fullpage.html?
res=940DEFD71230F93BA15750C0A9629C8B63#F0#|quote=New#F0#Jersey#F0#is,#F0#indeed,#F0#a#F0#hom
e#F0#of#F0#poets.#F0#Walt#F0#Whitman's#F0#tomb#F0#is#F0#nestled#F0#in#F0#a#F0#wooded#F0#grov
e#F0#in#F0#the#F0#Harleigh#F0#Cemetery#F0#in#F0#Camden.#F0#Joyce#F0#Kilmer#F0#is#F0#buried#F
0#in#F0#Elmwood#F0#Cemetery#F0#in#F0#New#F0#Brunswick,#F0#not#F0#far#F0#from#F0#the#F0#New#F
0#Jersey#F0#Turnpike#F0#rest#F0#stop#F0#named#F0#in#F0#his#F0#honor.#F0#Allen#F0#Ginsberg#F0
#may#F0#not#F0#yet#F0#have#F0#a#F0#rest#F0#stop,#F0#but#F0#the#F0#Beat#F0#Generation#F0#auth
or#F0#of#F0#"Howl"#F0#is#F0#resting#F0#at#F0#B'Nai#F0#Israel#F0#Cemetery#F0#in#F0#Newark.#F0
#|work=The#F0#New#F0#York#F0#Times#F0#|date=March#F0#28,#F0#2004#F0#|accessdate=August#F0#21

Make title and res strings and then use [s.replace(old, new)][1]:
title="#F0#Sometimes#F0#the#F0#Grave#F0#Is#F0#a#F0#Fine#F0#and#F0#Public#F0#Place.#F0#|url=http://query.nytimes.com/gst/fullpage.html?"
res="""940DEFD71230F93BA15750C0A9629C8B63#F0#|quote=New#F0#Jersey#F0#is,#F0#indeed,#F0#a#F0#hom
e#F0#of#F0#poets.#F0#Walt#F0#Whitman's#F0#tomb#F0#is#F0#nestled#F0#in#F0#a#F0#wooded#F0#grov
e#F0#in#F0#the#F0#Harleigh#F0#Cemetery#F0#in#F0#Camden.#F0#Joyce#F0#Kilmer#F0#is#F0#buried#F
0#in#F0#Elmwood#F0#Cemetery#F0#in#F0#New#F0#Brunswick,#F0#not#F0#far#F0#from#F0#the#F0#New#F
0#Jersey#F0#Turnpike#F0#rest#F0#stop#F0#named#F0#in#F0#his#F0#honor.#F0#Allen#F0#Ginsberg#F0
#may#F0#not#F0#yet#F0#have#F0#a#F0#rest#F0#stop,#F0#but#F0#the#F0#Beat#F0#Generation#F0#auth
or#F0#of#F0#"Howl"#F0#is#F0#resting#F0#at#F0#B'Nai#F0#Israel#F0#Cemetery#F0#in#F0#Newark.#F0
#|work=The#F0#New#F0#York#F0#Times#F0#|date=March#F0#28,#F0#2004#F0#|accessdate=August#F0#21"""
title = title.replace('#FO#', '')
res = res.replace('#FO#', '')

Related

How to split text by comma into lines?

How to split a line of text by comma onto separate lines?
Code
text = "ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
fields = text.split(",")
text = "\n".join(fields)
Issue & Expected
But "\n" did not work. The result expected is that it adds new lines like:
ACCOUNTNUMBER=Accountnumber,
ACCOUNTSOURCE=Accountsource,
ADDRESS_1__C=Address_1__C,
ADDRESS_2__C
Note: I run it on Google Colab

if you want the commas to stay there you can use this code:
text = "ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
fields = text.split(",")
print(",\n".join(fields))

Your code should give this output
ACCOUNTNUMBER=Accountnumber
ACCOUNTSOURCE=Accountsource
ADDRESS_1__C=Address_1__C
ADDRESS_2__C
But if you want to seperate it by commas(,). You should add comma(,) with \n use text = ",\n".join(fields) instead of text = "\n".join(fields)
So the final code should be
text="ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
fields = text.split(",")
text = ",\n".join(fields)
print (text)
It will give your desirable output.

A more cross-compatible way could be to use os.linesep. It's my understanding that it's safer to do this for code that might be running on both Linux, Windows and other OSes:
import os
print("hello" + os.linesep + "fren")

I try to use print then it worked!, thank all you guys

You can use replace() :
text = "ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
print(text.replace(',',",\n"))
result:
ACCOUNTNUMBER=Accountnumber,
ACCOUNTSOURCE=Accountsource,
ADDRESS_1__C=Address_1__C,
ADDRESS_2__C

Delete all characters that come after a given string

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows?
for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg?
I tried replacing but it didn't work
another way?
Use a forum to count strings or something like ?
I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])

You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'

You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.

You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.

The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')

See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.

You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

Cleaning Twitter Data in Excel

I am working on a project for school, but now with online instruction it is much harder to get help. I have a dataset in excel and there are links and emojis that I need to remove.
This is what my data looks like now. I want to get rid of the https://t.co/....... link, the emojis and some of the weird characters.
Does anyone have any suggestions on how to do this in excel? or maybe python?

I'm not sure how to do it in Excel, however, you can easily load the Excel file into 'pandas.dataFrame' and then use regex to ignore the non-ascii chars:
file_path = '/some/path/to/file.xlsx'
df = pd.read_excel(file_path , index_col=0)
df = df.replace(r'\W+', '', regex=True)
Here you can find an extra explanation about loading an Excel file into a dataframe
Here you can read about more ways to ignore non-ascii chars in dataframe

According to this reference, I believe you could do a function like this:
def checkChars(inputString):
outputString = ""
allowedChars = [" ", "/", ":", ".", ",",";"] # The characters you want to include
for l in inputString:
if l.isalnum() or l in allowedChars: # This line will check if the character is alphanumeric or is in your allowed character list
outputString += l
return outputString

Decoding String list in python from a binary file

I need to read a list of strings from a binary file and create a python list.
I'm using the below command to extract data from binary file:
tmp = f.read(100)
abc, = struct.unpack('100c',tmp)
The data that I can see in variable 'abc' is exactly as shown below, but I need to get the below data into a python list as strings.
Data that I need as a list: 'UsrVal' 'VdetHC' 'VcupHC' ..... 'Gravity_Axis'
b'UsrVal\x00VdetHC\x00VcupHC\x00VdirHC\x00HdirHC\x00UpFlwHC\x00UxHC\x00UyHC\x00UzHC\x00VresHC\x00UxRP\x00UyRP\x00UzRP\x00VresRP\x00Gravity_Axis'

Here is how i would suggest you to do it with one liner.
You need to decode binary string and then you can do a split based on "\x00" which will return the list you are looking for.
e.g
my_binary_out = b'UsrVal\x00VdetHC\x00VcupHC\x00VdirHC\x00HdirHC\x00UpFlwHC\x00UxHC\x00UyHC\x00UzHC\x00VresHC\x00UxRP\x00UyRP\x00UzRP\x00VresRP\x00Gravity_Axis'
decoded_list = my_binary_out.decode("latin1", 'ignore').split('\x00')
#or
decoded_list = my_binary_out.decode("cp1252", 'ignore').split('\x00')
Output Will look like this :
['UsrVal', 'VdetHC', 'VcupHC', 'VdirHC', 'HdirHC', 'UpFlwHC', 'UxHC', 'UyHC', 'UzHC', 'VresHC', 'UxRP', 'UyRP', 'UzRP', 'VresRP', 'Gravity_Axis']
Hope this helps

If you're going for a quick and messy way here, AND assuming your string
b'UsrVal\x00VdetHC\x00VcupHC\x00VdirHC\x00HdirHC\x00UpFlwHC\x00UxHC\x00UyHC\x00UzHC\x00VresHC\x00UxRP\x00UyRP\x00UzRP\x00VresRP\x00Gravity_Axis'
is in fact interpreted as
" b'UsrVal\x00VdetHC\x00VcupHC\x00VdirHC\x00HdirHC\x00UpFlwHC\x00UxHC\x00UyHC\x00UzHC\x00VresHC\x00UxRP\x00UyRP\x00UzRP\x00VresRP\x00Gravity_Axis' "
Then the following few lines of code result with 'b' having the array you want.
a = {YourStringHere}
b = a[2:-1].split("\x00")

google translate api does not return apostrophe as apostrophe in python

I am trying to use google translate api as below. Translation seems ok except the apostrophe chars which are returned as ' instaead.
Is it possible to fix those ? I can of course make a postprocessing but I don't know if there is another special character facing with same problem or not.
This is how I perform translation right now:
import pandas as pd
import six
from google.cloud import translate
# Instantiates a client
#translate_client = translate.Client()
"""Translates text into the target language.
Target must be an ISO 639-1 language code.
See https://g.co/cloud/translate/v2/translate-reference#supported_languages
"""
translate_client_en_de = translate.Client(target_language="de")
translate_client_de_en = translate.Client(target_language="en")
target1="de"
target2="en"
#if isinstance(text, six.binary_type):
# text = text.decode('utf-8')
fname ='fname.tsv'
df = pd.read_table(fname,sep='\t')
for i,row in df.iterrows():
text = row['Text']
de1 = translate_client_en_de.translate(
text, target_language=target1)
text2 = de1['translatedText']
en2 = translate_client_de_en.translate(
text2, target_language=target2)
text3 = en2['translatedText']
print(text)
print(text2)
print(text3)
print('----------')
break
Sample output:
Simon's advice after he wouldn't
Simon's advice after

I solve it as follows:
Problem:
The problem is that you need to specify that you are using plain text and not HTML text.
Look at the documentation here: https://googleapis.dev/python/translation/latest/client.html, look for the 'translate' attribute and the 'format_' parameter.
Solution:
Just add the parameter 'format_='text'. In my case I wrote it like this:
result = translate_client.translate(text, target_language=target, format_='text')
and it works well, now the api returns the apostrophe correctly:
Before I got: 'Hello, we haven't seen each other in a long time'.
Now I get: 'Hello, we haven't seen each other in a long time'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning a string (probably encoded string) in python - python

Related

How to split text by comma into lines?

Delete all characters that come after a given string

Cleaning Twitter Data in Excel

Decoding String list in python from a binary file

google translate api does not return apostrophe as apostrophe in python

Categories

Resources