I am trying to clean lists within a column in my dataframe from all the terms that do not make sense.
For example
Col New_Col
VM ['#']
JS [ '/','/UTENTI/','//utilsit/promo', '/notifiche/']
www.facebook.com ['https://www.facebook.com/','https://twitter.com/']
FA ['/nordest/venezia/','/nordest/treviso/']
I would like to remove from each list(row) in the column all the words that
do not start with https, http or //
contains Col as subset in New_Col (for example: www.facebook.com is included in https://www.facebook.com/ so I should remove it, does not matter if it starts with https)
I tried to write this code:
prefixes=['http','https','//']
for word in df['New_Col']:
if word.startswith(prefixes):
list.remove(word)
print (df['New_Col'])
however it says that
'list' object has no attribute 'startswith'
(Attribute error).
I think I am considering in my code above a list and not column with lists.
Can you please help me to understand how to do it?
Use, DataFrame.apply on axis=1 along with custom filter function fx:
import re
fx = lambda s: [w for w in s['New_Col'] if s['Col'] not in w and re.match(r'^https?|//', w)]
df['New_Col'] = df.apply(fx, axis=1)
# print(df)
Col New_Col
0 VM []
1 JS [//utilsit/promo]
2 www.facebook.com [https://twitter.com/]
3 FA []
make a function to remove the words you want using regular expression and then apply it on the dataframe column as below:
df['ColName'].apply(lambda x: func(x))
Here func is the function that will take each row of the ColName column and will return your required result
Related
Create a function that can remove a comma from any given column illustrated.
Essentially I need a piece of code that removes a comma from all the values within a column, and in addition, the code also becomes a function that means the end-user can identify different column names to run this command for.
My code so far:
df = df.apply(lambda x: x.replace(',', ''))
print (df)
The code above is how far I have gotten. Python seems to accept this piece of code, however when I print the df, the commas still show.
Once I get this to work, my next battle is understanding how I can target just one specific column rather than the whole dataset, and make this an interchangeable function for the end-user.
Baring in mind that I am very new to Python coding, any explanations would be much appreciated.
Thanks!
Assuming this toy example:
df = pd.DataFrame([['123,2', 'ab,c', 'd', ',']], columns=list('ABCD'))
A B C D
0 123,2 ab,c d ,
You can use str.replace (replace would only replace if the full content of the cell is ,):
df = df.apply(lambda col: col.str.replace(',', ''))
output:
A B C D
0 1232 abc d
To target only one column:
df['A'] = df['A'].str.replace(',', '')
output:
A B C D
0 1232 ab,c d ,
I have a dataframe with some text indexes which contains a necessary information that I want to copy into a list.
I don't know how is the text info specifically (the word always changes), but I know where is located in the index:
'point.subclase.optimum.R31.done'. R31 is the value which I would like to write in a list, so I know that that text, that is always different, is between point.subclase.optimum. and .done.
I've tried with:
info_list = []
for col in df.columns:
if ('point.subclase.optimum.' in col) and ('.done' in col):
info_list.append(col)
But that script just provide me the entire index in the list.
Does anyone know how to solve it?
Use Series.str.extract with escape \. because special regex character, then remove possible missing values if no match by Series.dropna and last convert output to list:
df = pd.DataFrame({'a':range(3)}, index=['point.subclase.optimum.R31.done',
'point.subclase',
'point.subclase.optimum.R98.done'])
print (df)
a
point.subclase.optimum.R31.done 0
point.subclase 1
point.subclase.optimum.R98.done 2
L = (df.index.str.extract(r'point\.subclase\.optimum\.(.*)\.done', expand=False)
.dropna()
.tolist())
print (L)
['R31', 'R98']
I've a dataframe with the following structure (3 columns):
DATE,QUOTE,SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, KAGGLE
What I am trying to do is make a substring on QUOTE column in order to generate anew column only with the words after the last occurrence (in this case the word 'TEST').
My expected result:
DATE,QUOTE, SATUS, SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, DONE, KAGGLE
For that I'm trying with the following code:
import pandas as pd
df = pd.read_excel (filename)
split = lambda x: len(x['QUOTE'].rsplit('TEST',1)[0])
df["STATUS"] = df.apply(split, axis=1)
print(df["STATUS"].unique())
However I'm just printing numbers not 'DONE'.
What I am doing wrong?
Thanks!
In the definition of split you are using len, that returns the length of sequence (an integer),
len([1, 'Done']) # returns 2
You need to access the last index, for example:
df['STATUS'] = df.QUOTE.str.rsplit('TEST').str[-1]
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Or if you want to use apply, just change the definition of split:
split = lambda x: x['QUOTE'].rsplit('TEST', 1)[-1]
df["STATUS"] = df.apply(split, axis=1)
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Note than using lambda to create named functions is consider a not so good practice.
Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack
I am currently working on an analysis of urls and want to find urls which match a specific word. Those URLs are in an pandas DataFrame column and I want to filter for specific words in the title of the URL.
What I did so far:
data['new'] = data['SOURCEURL'].str.extract("(" + "|".join(filter3) +")", expand=False)
The problem about this is that the filter that I apply is an abbreviation ('ecb') which is often also used in the end of a link.
http://www.ntnews.com.au/news/national/senate-president-stephen-parry-believes-he-is-a-british-citizen/news-story/b2d3a3442544937f85508135401a3f84?nk=f19e52d2acd9588ecb494c03f21fed8c-1509598074
In the last '/'-section. How can I just filter for 'ecb' occurences which occur in textish surrounding? Something like www.xyz.com/news/national/ecb-press-realease/b2dse332313 and which does not extract the occurence of ecb in a hash or something similar as above. Is this possible in an easy way?
Thanks a lot!
Perhaps you could split the URL into words and filter out all words that are not in an English dictionary? For example using PyEnchant:
import enchant
d = enchant.Dict("en_US")
filtered_words = [x for x in words if d.check(x)]
One easy solution is check only in strings before last /:
df = pd.DataFrame({'SOURCEURL':['http://au/news/nat/cit/news-story/b2ecb',
'http://au/news/nat/cit/news-story/b2d88ecb494']})
print (df)
SOURCEURL
0 http://au/news/nat/cit/news-story/b2ecb
1 http://au/news/nat/cit/news-story/b2d88ecb494
filter3 = ['ecb']
df['new'] = (df['SOURCEURL'].str.rsplit('/', 1).str[0]
.str.extract("(" + "|".join(filter3) +")", expand=False))
Another similar solution:
filter3 = ['ecb']
df['new'] = (df['SOURCEURL'].str.extract('(.*)/', expand=False)
.str.extract("(" + "|".join(filter3) +")", expand=False))
print (df)
SOURCEURL new
0 http://au/news/nat/cit/news-story/b2ecb NaN
1 http://au/news/nat/cit/news-story/b2d88ecb494 NaN
Another possible approach here. You're probably looking to exclude parameters passed at the end of the URL, which I believe is the only place you'd see either a ? or an =
In this case you can evaluate each split string section as True/False and return the boolean of the sum to get True/False.
validation = bool(sum([True if sub in x and "?" not in x and '=' not in x else False for x in url.split('/')]))