Cleaning Twitter Data in Excel

Cleaning Twitter Data in Excel - python

I am working on a project for school, but now with online instruction it is much harder to get help. I have a dataset in excel and there are links and emojis that I need to remove.
This is what my data looks like now. I want to get rid of the https://t.co/....... link, the emojis and some of the weird characters.
Does anyone have any suggestions on how to do this in excel? or maybe python?

I'm not sure how to do it in Excel, however, you can easily load the Excel file into 'pandas.dataFrame' and then use regex to ignore the non-ascii chars:
file_path = '/some/path/to/file.xlsx'
df = pd.read_excel(file_path , index_col=0)
df = df.replace(r'\W+', '', regex=True)
Here you can find an extra explanation about loading an Excel file into a dataframe
Here you can read about more ways to ignore non-ascii chars in dataframe

According to this reference, I believe you could do a function like this:
def checkChars(inputString):
outputString = ""
allowedChars = [" ", "/", ":", ".", ",",";"] # The characters you want to include
for l in inputString:
if l.isalnum() or l in allowedChars: # This line will check if the character is alphanumeric or is in your allowed character list
outputString += l
return outputString

Related

How to split text by comma into lines?

How to split a line of text by comma onto separate lines?
Code
text = "ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
fields = text.split(",")
text = "\n".join(fields)
Issue & Expected
But "\n" did not work. The result expected is that it adds new lines like:
ACCOUNTNUMBER=Accountnumber,
ACCOUNTSOURCE=Accountsource,
ADDRESS_1__C=Address_1__C,
ADDRESS_2__C
Note: I run it on Google Colab

if you want the commas to stay there you can use this code:
text = "ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
fields = text.split(",")
print(",\n".join(fields))

Your code should give this output
ACCOUNTNUMBER=Accountnumber
ACCOUNTSOURCE=Accountsource
ADDRESS_1__C=Address_1__C
ADDRESS_2__C
But if you want to seperate it by commas(,). You should add comma(,) with \n use text = ",\n".join(fields) instead of text = "\n".join(fields)
So the final code should be
text="ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
fields = text.split(",")
text = ",\n".join(fields)
print (text)
It will give your desirable output.

A more cross-compatible way could be to use os.linesep. It's my understanding that it's safer to do this for code that might be running on both Linux, Windows and other OSes:
import os
print("hello" + os.linesep + "fren")

I try to use print then it worked!, thank all you guys

You can use replace() :
text = "ACCOUNTNUMBER=Accountnumber,ACCOUNTSOURCE=Accountsource,ADDRESS_1__C=Address_1__C,ADDRESS_2__C"
print(text.replace(',',",\n"))
result:
ACCOUNTNUMBER=Accountnumber,
ACCOUNTSOURCE=Accountsource,
ADDRESS_1__C=Address_1__C,
ADDRESS_2__C

Python - DocxTemplate - Not Printing "&" in final document

I am running a script that takes the names from a csv file and populates them into individual word documents from a template. I got that part. But here is where I need a bit of help.
Some cells in the csv file are two names, such as "Bobby & Sammy." When I go check the populated word document, it only has "Bobby Sammy." I know that the "&" is a special character, but I am not sure what I have to do for it to populate the word documents correctly.
Any and all help is appreciated.
Edit: Code
csvfn = "Addresses.csv"
df = pd.read_csv('Addresses.csv')
def mkw(n):
tpl = DocxTemplate('Envelope_Template.docx')
df_to_doct = df.to_dict()
x = df.to_dict(orient='records')
context = x
tpl.render(context[n])
tpl.save("%s.docx" %str(n))
wait = time.sleep(random.randint(1,3))
~
print ("There will be ", df2, "files")
~
for i in range(0, df2):
print("Making file: ",f"{i}," ,"..Please Wait...")
mkw(i)
print("Done! - Now check your files")
~ Denotes new cell, I am using JupyterLab
File is a standard csv file
Standard CSV File
Without "&" Prints fine
Empty space where "&" is supposed to be

How to load a dataframe from a file containing unwanted characters?

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?

On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).

Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.

Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)

A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...

For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

Output comes twice - Update of a Q asked 30 minutes before posting this one

Here is my code
import re
with open('newfiles.txt') as f:
k = f.read()
p = re.compile(r'[\w\:\-\.\,\']+|[^[\w\:\-\.\'\,]\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
if word not in uniquelist:
uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
n = p.findall(indexes)
file = open("newfiletwo.txt","w")
file.write (' '.join(str(e) for e in n))
file.close()
file = open("newfilethree.txt","w")
file.write(' '.join(uniquelist))
file.close()
with open('newfiletwo.txt') as f:
indexess = f.read()
with open('newfilethree.txt') as f:
differentwords = f.read()
differentwords = p.findall(differentwords)
indexess = [uniquelist.index(word) for word in originaltext]
for word in originaltext:
if not word in differentwords:
differentwords.append(word)
i = differentwords.index(word)
indexess.append(i)
s = "" # the reconstructed sentence
for i in indexess:
s = s + differentwords[i] + " "
print(s)
The program basically takes an external text file, returns the index of its positions (if any word repeats, then the first position is taken) and then saves the positions as an external file. Whilst doing this, I have split up the text file including splitting punctuation and saved different words and punctuation that occur in the file as an external file too. Now for the hard part, using both of these external files - the indexes and the different separated words, I am trying to recreate the original text file, including the punctuation. But the error shown in the title occurs:
Traceback (most recent call last):
File "E:\Python\Index.py", line 31, in <module>
s = s + differentwords[i] + " "
IndexError: list index out of range
Not trying to sound rude but I am a sort of beginner, please try to change as less as possible in a simple way, as I have created this myself. You guys maybe know a far shorter way to do this, but this is the level of simplicity I can handle, proved by the length of the code. I have tried to shorten the original text file but that proves no use. Anyone know why the error occurs and how to fix it? I am not looking for efficiency right now, maybe after another couple of months of learning, but the simplest (i don't mind long) answer will be the best. Sorry if I have repeated myself a lot :-)
'newfiles' - A bunch of sentences with punctuation
UPDATE
The code does not show the error but prints the original sentence twice. The error has gone due to the removal of +1 on line 23. Does anyone know why the output repeats twice though?

Problem is, how you qualify what word is, what is not. For instance is comma part of word? In your case that is not mentioned as such, while it is also not a separator. So you end up with separate word comma, or dot, and so on. I have no access to your input, so I can just provide sample:
p = re.compile(r'[\w\:\-\.\,]+|[^[\w\:\-\.\,]\s]')
There is one point - in this case: 'Word', 'word', 'Word', 'Word.', 'word,' are all separate words. Since dot, and coma are parts of word. You can't eat cake and have it. To fix that... you need to store information if there is white space before separation.
UPDATE:
Oh, yes. Double output. Files that are stored in the middle - are OK. So something was filed after that. Look at this two lines:
i = differentwords.index(word)
indexess.append(i)
They need to be inside preceding if statement.

Cleaning a string (probably encoded string) in python

I am trying to read some data which is suppose to be tab delimited but I see a lot of #FO# in it?
I was wondering how can i clean that text out?
Sample snippet
title=#F0#Sometimes#F0#the#F0#Grave#F0#Is#F0#a#F0#Fine#F0#and#F0#Public#F0#Place.#F0#|url=http://query.nytimes.com/gst/fullpage.html?
res=940DEFD71230F93BA15750C0A9629C8B63#F0#|quote=New#F0#Jersey#F0#is,#F0#indeed,#F0#a#F0#hom
e#F0#of#F0#poets.#F0#Walt#F0#Whitman's#F0#tomb#F0#is#F0#nestled#F0#in#F0#a#F0#wooded#F0#grov
e#F0#in#F0#the#F0#Harleigh#F0#Cemetery#F0#in#F0#Camden.#F0#Joyce#F0#Kilmer#F0#is#F0#buried#F
0#in#F0#Elmwood#F0#Cemetery#F0#in#F0#New#F0#Brunswick,#F0#not#F0#far#F0#from#F0#the#F0#New#F
0#Jersey#F0#Turnpike#F0#rest#F0#stop#F0#named#F0#in#F0#his#F0#honor.#F0#Allen#F0#Ginsberg#F0
#may#F0#not#F0#yet#F0#have#F0#a#F0#rest#F0#stop,#F0#but#F0#the#F0#Beat#F0#Generation#F0#auth
or#F0#of#F0#"Howl"#F0#is#F0#resting#F0#at#F0#B'Nai#F0#Israel#F0#Cemetery#F0#in#F0#Newark.#F0
#|work=The#F0#New#F0#York#F0#Times#F0#|date=March#F0#28,#F0#2004#F0#|accessdate=August#F0#21

Make title and res strings and then use [s.replace(old, new)][1]:
title="#F0#Sometimes#F0#the#F0#Grave#F0#Is#F0#a#F0#Fine#F0#and#F0#Public#F0#Place.#F0#|url=http://query.nytimes.com/gst/fullpage.html?"
res="""940DEFD71230F93BA15750C0A9629C8B63#F0#|quote=New#F0#Jersey#F0#is,#F0#indeed,#F0#a#F0#hom
e#F0#of#F0#poets.#F0#Walt#F0#Whitman's#F0#tomb#F0#is#F0#nestled#F0#in#F0#a#F0#wooded#F0#grov
e#F0#in#F0#the#F0#Harleigh#F0#Cemetery#F0#in#F0#Camden.#F0#Joyce#F0#Kilmer#F0#is#F0#buried#F
0#in#F0#Elmwood#F0#Cemetery#F0#in#F0#New#F0#Brunswick,#F0#not#F0#far#F0#from#F0#the#F0#New#F
0#Jersey#F0#Turnpike#F0#rest#F0#stop#F0#named#F0#in#F0#his#F0#honor.#F0#Allen#F0#Ginsberg#F0
#may#F0#not#F0#yet#F0#have#F0#a#F0#rest#F0#stop,#F0#but#F0#the#F0#Beat#F0#Generation#F0#auth
or#F0#of#F0#"Howl"#F0#is#F0#resting#F0#at#F0#B'Nai#F0#Israel#F0#Cemetery#F0#in#F0#Newark.#F0
#|work=The#F0#New#F0#York#F0#Times#F0#|date=March#F0#28,#F0#2004#F0#|accessdate=August#F0#21"""
title = title.replace('#FO#', '')
res = res.replace('#FO#', '')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning Twitter Data in Excel - python

Related

How to split text by comma into lines?

Python - DocxTemplate - Not Printing "&" in final document

How to load a dataframe from a file containing unwanted characters?

Output comes twice - Update of a Q asked 30 minutes before posting this one

Cleaning a string (probably encoded string) in python

Categories

Resources