with pandas and jupyter notebook I would like to delete everything that is not character, that is: hyphens, special characters etc etc
es:
firstname,birthday_date
joe-down§,02-12-1990
lucash brown_ :),06-09-1980
^antony,11-02-1987
mary|,14-12-2002
change with:
firstname,birthday_date
joe down,02-12-1990
lucash brown,06-09-1980
antony,11-02-1987
mary,14-12-2002
I'm trying with:
df['firstname'] = df['firstname'].str.replace(r'!', '')
df['firstname'] = df['firstname'].str.replace(r'^', '')
df['firstname'] = df['firstname'].str.replace(r'|', '')
df['firstname'] = df['firstname'].str.replace(r'§', '')
df['firstname'] = df['firstname'].str.replace(r':', '')
df['firstname'] = df['firstname'].str.replace(r')', '')
......
......
df
it seems to work, but on more populated columns I always miss some characters.
Is there a way to completely eliminate all NON-text characters and keep only a single word or words in the same column? in the example I used firstname to make the idea better! but it would also serve for columns with whole words!
Thanks!
P.S also encoded text for emoticons
You can use regex for this.
df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']
Try the below. It works on the names you have used in post
first_names = ['joe-down§','lucash brown_','^antony','mary|']
clean_names = []
keep = {'-',' '}
for name in first_names:
clean_names.append(''.join(c if c not in keep else ' ' for c in name if c.isalnum() or c in keep))
print(clean_names)
output
['joe down', 'lucash brown', 'antony', 'mary']
I have the following sequences of strings within a column in pandas:
SEQ
An empty world
So the word is
So word is
No word is
I can check the similarity using fuzzywuzzy or cosine distance.
However I would like to know how to get information about the word which changes position from amore to another.
For example:
Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3.
They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th.
How can I see the changes between two rows/texts?
Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.
The general concept is:
find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
apply some html formatting to the tokens shared between the two strings.
apply this to all rows.
output the resulting dataframe as html and render it in ipython.
import pandas as pd
data = ['An empty world',
'So the word is',
'So word is',
'No word is']
df = pd.DataFrame(data, columns=['phrase'])
bold = lambda x: f'<b>{x}</b>'
def highlight_shared(string1, string2, format_func):
shared_toks = set(string1.split(' ')) & set(string2.split(' '))
return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
from IPython.core.display import HTML
HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))
I need to Update a List in Python which is:
data = [{' Customers ','null,blank '},{' CustomersName ','max=50,null,blank '},{' CustomersAddress ','max=150,blank '},{' CustomersActive ','Active '}]
I wanted to Write a Lambda Expression to Store the Customers, CustomersName in the List and Remove the White Spaces.
I am absolutely New to Python and Does not Have Any Knowledge!
As I see it, You have Declared the Dictionary Inside a List but the Dict is Wrong, It should be {"key":"value"}, So I assume you need to Change it to List as such:
data = [[' Customers ','null,blank '],[' CustomersName ','max=50,null,blank '],[' CustomersAddress ','max=150,blank '],[' CustomersActive ','Active ']]
And Then The Following would get you Your Desired!
data_NameExtracted = [x[0].strip() for x in data]
You can not put this inside a lambda expression, but you can use a generator object like this:
# please note that i have used tuples instead of sets,
# because sets are unordered
data = [
(' Customers ','null,blank '),
(' CustomersName ','max=50,null,blank '),
(' CustomersAddress ','max=150,blank '),
(' CustomersActive ','Active ')
]
# Indexing is not allowed for set objects
values = [item[0].strip() for item in data]
see:
https://wiki.python.org/moin/Generators
https://docs.python.org/3/tutorial/datastructures.html#sets
EDIT:
If you wan't to use dictionaries you could use something like this:
data = [
{' Customers ': 'null,blank '},
{' CustomersName ': 'max=50,null,blank '},
{' CustomersAddress ': 'max=150,blank '},
{' CustomersActive ': 'Active '}
]
# expecting a single value in the dicts
values = [item.values()[0].strip() for item in data]
top_N = 100
words = review_tip['user_tip'].dropna()
words = words.astype(str)
words = words.str.replace('[{}]'.format(string.punctuation), '')
words = words.str.lower().apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))
# replace '|'-->' ' and drop all stopwords
words = words.str.lower().replace([r'\|', RE_stopwords], [' ', ''], regex=True).str.cat(sep=' ').split()
# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
plt.clf()
# plot
rslt.plot.bar(rot=90, figsize=(16,10), width=0.8)
plt.show()
Frequency
Word
great 17069
food 16381
good 12502
service 11342
place 10841
best 9280
get 7483
love 7042
amazing 5043
try 4945
time 4810
go 4594
dont 4377
As you can see the words are singular which is something I can use, but is it possible to take like two words that couldve been used together a lot?
For example getting
dont go (this could be for 100 times)
instead of getting it separate
dont 100
go 100
This will generate bi-grams, is this what you are looking for:
bi_grams = zip(words, words[1:])
I generates tuples which is fine to use in the counter but you could also easily tweak the code to use ' '.join((a, b)).
ok, I've been at this one for hours I admit defeat and beg for your mercy.
goal: I have multiple files (bank statement downloads), and I want to
Merge, sort, remove duplicates.
the downloads are in this format:
"08/04/2015","Balance","5,804.30","Current Balance for account 123S14"
"08/04/2015","Balance","5,804.30","Available Balance for account 123S14"
"02/03/2015","241.25","Transaction description","2,620.09"
"02/03/2015","-155.49","Transaction description","2,464.60"
"03/03/2015","82.00","Transaction description","2,546.60"
"03/03/2015","243.25","Transaction description","2,789.85"
"03/03/2015","-334.81","Transaction description","2,339.12"
"04/03/2015","-25.05","Transaction description","2,314.07"
one of my prime issues, aside from total ignorance of what I'm doing, is that the numerical values contain commas. I've written, successfully, code that strips such 'buried' commas out, I then strip the quotes so that I have a CSV...line.
so I now have my data in this format
['02/03/2015', ' \t ', '241.25\t ', ' \t ', 'Transaction Details\n', '02/03/2015', ' \t ', ' \t ', '-155.49\t ', 'Transaction Details\n', '03/03/2015', ' \t ', '82.00\t ', ' \t ', 'Transaction Details\n', '03/03/2015', ' \t ', '243.25\t ', ' \t ', 'Transaction Details\n', '02/03/2015', ' \t ', '241.25\t ', ' \t ', 'Transaction Details\n']
which I believe makes it nearly ready to do a sort on first the element, but I think it's now one long list, instead of a list of lists.
I researched sorts and found the lambda...function, so I started to implement
new_file_data = sorted(new_file_data, key=lambda item: item[0])
but element [0] was just the " at the BOL.
I also noted that I needed to instruct that the date was not in, possibly, the correct format, which led me to this construct:
sorted(new_file_data, key=lambda d: datetime.strptime(d, '%d/%m/%Y'))
I get, loosely, the 'map' construct but not how to combine such that I can just reference element[0] as well as how to reference it (datewise)
and now I'm here, hopefully someone could push me over this hurdle?
I think I need to [have] split the list better to start with so each line is an element - I did at one point get a sorted result but all the fields got globbed together, values (sorted) then dates then words etc
So if anyone could offer some advice on my failed list manipulation and how to structure that sort-lambda.
thanks to those who have the time and know how to respond to such starter queries.
If I understand correctly you want to read the contents of the csv and sort them by date.
Given the contents of data.csv
"08/04/2015","Balance","5,804.30","Current Balance for account 123S14"
"08/04/2015","Balance","5,804.30","Available Balance for account 123S14"
"02/03/2015","241.25","Transaction description","2,620.09"
"02/03/2015","-155.49","Transaction description","2,464.60"
"03/03/2015","82.00","Transaction description","2,546.60"
"03/03/2015","243.25","Transaction description","2,789.85"
"03/03/2015","-334.81","Transaction description","2,339.12"
"04/03/2015","-25.05","Transaction description","2,314.07"
I would use the csv-module to read the data.
import csv
with open('data.csv') as f:
data = [row for row in csv.reader(f)]
Which gives:
>>> data
[['08/04/2015', 'Balance', '5,804.30', 'Current Balance for account 123S14'],
['08/04/2015', 'Balance', '5,804.30', 'Available Balance for account 123S14'],
['02/03/2015', '241.25', 'Transaction description', '2,620.09'],
['02/03/2015', '-155.49', 'Transaction description', '2,464.60'],
['03/03/2015', '82.00', 'Transaction description', '2,546.60'],
['03/03/2015', '243.25', 'Transaction description', '2,789.85'],
['03/03/2015', '-334.81', 'Transaction description', '2,339.12'],
['04/03/2015', '-25.05', 'Transaction description', '2,314.07']]
Then you can use the datetime-module to provide a key for sorting.
import datetime
sorted_data = sorted(data, key=lambda row: datetime.datetime.strptime(row[0], "%d/%m/%Y"))
Which gives:
>>> sorted_data
[['02/03/2015', '241.25', 'Transaction description', '2,620.09'],
['02/03/2015', '-155.49', 'Transaction description', '2,464.60'],
['03/03/2015', '82.00', 'Transaction description', '2,546.60'],
['03/03/2015', '243.25', 'Transaction description', '2,789.85'],
['03/03/2015', '-334.81', 'Transaction description', '2,339.12'],
['04/03/2015', '-25.05', 'Transaction description', '2,314.07'],
['08/04/2015', 'Balance', '5,804.30', 'Current Balance for account 123S14'],
['08/04/2015', 'Balance', '5,804.30', 'Available Balance for account 123S14']]
You can define your own sorting function.
Do a mix of these two questions and you'll have what you want (or something close):
Custom Python list sorting
Python date string to date object
In your sorting function, tranform the date from string to datetime and compare
def cmp_items(a, b):
datetime_a = datetime.datetime.strptime(a.[0], "%d/%m/%Y").date()
datetime_b = datetime.datetime.strptime(a.[0], "%d/%m/%Y").date()
if datetime_a > datetime_b:
return 1
elif datetime_a == datetime_b:
return 0
else:
return -1
and then, you just have to sort the list using it
new_file_data = new_file_data.sort(cmp_items)
You can still have a little problem after that, the elements with the same date will be in an random-like order. You can improve the comparing function to compare more stuff to prevent that.
BTW, you have not stripped the burried commas out, it seems you have completely removed the last part.