How to drop rows by condition on string value in pandas dataframe? - python

Consider a Pandas Dataframe like:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df
Giving:
url
0 http://url1.com
1 http://www.url1.com
2 http://www.url2.com
3 http://www.url3.com
4 http://www.url1.com
I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:
url
0 http://ww.url3.com
I do this
domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))
But this give me no result.
Any idea how to solve the above problem?
Edit: Solution
import pandas as pd
import tldextract
df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)

If we checking domain , we should find the 100% match domain rather than use string contain . since the subdomain may contain the same key work as domain
import tldextract
s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]:
0 True
1 True
2 True
3 False
4 True
Name: url, dtype: bool
df=df[~s]

Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:
m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)
Result:
url
0 http://www.url3.com

you can use pd.Series.str.contains here.
df[~df.url.str.contains('|'.join(domainToCheck))]
url
3 http://www.url3.com
If you want to reset index use this
df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)
url
0 http://www.url3.com

Related

Pandas isin() returning all false

I'm using pandas 1.1.3, the latest available with Anaconda.
I have two DataFrames, imported from a .txt and a .xlsx file. They have a column called "ID" which is an int64 (verified with df.info()) on both DataFrames.
df1:
ID Name
0 1234564567 Last, First
1 1234564569 Last, First
...
df2:
ID Amount
0 1234564567 59.99
1 5678995545 19.99
I want to check if all of the IDs on df1 are on df2. For this I create a series:
foo = df1["ID"].isin(df2["ID"])
And I get that all values are False, even though manually I checked and the values do match.
0 False
1 False
2 False
3 False
4 False
...
I'm not sure if I'm missing something, if there is something wrong with the environment, or if it is a known bug.
You must do something wrong. Try to reproduce this error with a toy example as I did here. The below works for me.
Reproducing with and sharing a minimal example not only allows you to challenge your error but also allows us to provide help.
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'nick'], 'ID':[1234564567, 1234564569]}
data2 = {'Name':['Tom', 'nick'], 'ID':[1234564567, 5678995545]}
# Create DataFrame
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False
Name: ID, dtype: bool
EDIT: with Paul's data I don't get any error. See the importance of providing examples?
import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False
import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
now we have that set up we get to the meat...
df1["ID"].apply(lambda x: df2['ID'].isin([x]))
Which shows
0 1
0 True False
1 False False
That ID 0 in df1 is in ID 0 of df2

Python, extracting a string between two specific characters for all rows in a dataframe

I am currently trying to write a function that will extract the string between 2 specific characters.
My data set contains emails only, that look like this: pstroulgerrn#time.com.
I am trying to extract everything after the # and everything before the . so that the email listed above would output time.
Here is my code so far :
new = df_personal['email'] # 1000x1 dataframe of emails
def extract_company(x):
y = [ ]
y = x[x.find('#')+1 : x.find('.')]
return y
extract_company(new)
Note : If I change new to df_personal['email'][0] the correct output is displayed for that row.
However, when trying to do it for the entire dataframe, I get an error saying :
AttributeError: 'Series' object has no attribute 'find'
You can extract a series of all matching texts using regex:
import pandas as pd
df = pd.DataFrame( ['kabawonga#something.whereever','kabawonga#omg.whatever'])
df.columns = ['email']
print(df)
k = df["email"].str.extract(r"#(.+)\.")
print(k)
Output:
# df
email
0 kabawonga#something.whereever
1 kabawonga#omg.whatever
# extraction
0
0 something
1 omg
See pandas.Series.str.extract
Try:
df_personal["domain"]=df_personal["email"].str.extract(r"\#([^\.]+)\.")
Outputs (for the sample data):
import pandas as pd
df_personal=pd.DataFrame({"email": ["abc#yahoo.com", "xyz.abc#gmail.com", "john.doe#aol.co.uk"]})
df_personal["domain"]=df_personal["email"].str.extract(r"\#([^\.]+)\.")
>>> df_personal
email domain
0 abc#yahoo.com yahoo
1 xyz.abc#gmail.com gmail
2 john.doe#aol.co.uk aol
You can do it with an apply function, by first splitting by a . and then by # for each of the row:
Snippet:
import pandas as pd
df = pd.DataFrame( ['abc#xyz.dot','def#qwe.dot','def#ert.dot.dot'])
df.columns = ['email']
df["domain"] = df["email"].apply(lambda x: x.split(".")[0].split("#")[1])
Output:
df
Out[37]:
email domain
0 abc#xyz.dot xyz
1 def#qwe.dot qwe
2 def#ert.dot.dot ert

Dropping duplicate values in a column

i have a frame like;
df = pd.DataFrame({'America':["24,23,24,24","10","AA,AA, XY"]})
tried to convert it to a list, set etc.. but coudnt handle
how can i drop the duplicates
Use custom function with split and set:
df['America'] = df['America'].apply(lambda x: set(x.split(',')))
Another solution is use list comprehension:
df['America'] = [set(x.split(',')) for x in df['America']]
print (df)
America
0 {23, 24}
1 {10}
2 {AA, XY}
This is one approach using str.split.
Ex:
import pandas as pd
df = pd.DataFrame({'America':["24,23,24,24","10","AA,AA, XY"]})
print(df["America"].str.split(",").apply(set))
Output:
0 {24, 23}
1 {10}
2 {AA, XY}
Name: America, dtype: object

How to convert JSON data inside a pandas column into new columns

I have this short version of ADSB json data and would like to convert it into dataFrame columns as Icao, Alt, Lat, Long, Spd, Cou.....
After Alperen told me to do this
df = pd.read_json('2016-06-20-2359Z.json', lines=True),
I can load it into a DataFrame. However, df.acList is
[{'Id': 10537990, 'Rcvr': 1, 'HasSig': False, ...
Name: acList, dtype: object
How can I get the Icao, Alt, Lat, Long, Spd, Cou data?
"src":1,
"feeds":[
{
"id":1,
"name":"ADSBexchange.com",
"polarPlot":false
}
],
"srcFeed":1,
"showSil":true,
"showFlg":true,
"showPic":true,
"flgH":20,
"flgW":85,
"acList":[
{
"Id":11281748,
"Rcvr":1,
"HasSig":false,
"Icao":"AC2554",
"Bad":false,
"Reg":"N882AS",
"FSeen":"\/Date(1466467166951)\/",
"TSecs":3,
"CMsgs":1,
"AltT":0,
"Tisb":false,
"TrkH":false,
"Type":"CRJ2",
"Mdl":"2001
BOMBARDIER INC
CL-600-2B19",
"Man":"Bombardier",
"CNum":"7503",
"Op":"EXPRESSJET AIRLINES INC - ATLANTA, GA",
"OpIcao":"ASQ",
"Sqk":"",
"VsiT":0,
"WTC":2,
"Species":1,
"Engines":"2",
"EngType":3,
"EngMount":1,
"Mil":false,
"Cou":"United States",
"HasPic":false,
"Interested":false,
"FlightsCount":0,
"Gnd":false,
"SpdTyp":0,
"CallSus":false,
"TT":"a",
"Trt":1,
"Year":"2001"
},
{
"Id":11402205,
"Rcvr":1,
"HasSig":true,
"Sig":110,
"Icao":"ADFBDD",
"Bad":false,
"FSeen":"\/Date(1466391940977)\/",
"TSecs":75229,
"CMsgs":35445,
"Alt":8025,
"GAlt":8025,
"AltT":0,
"Call":"TEST1234",
"Tisb":false,
"TrkH":false,
"Sqk":"0262",
"Help":false,
"VsiT":0,
"WTC":0,
"Species":0,
"EngType":0,
"EngMount":0,
"Mil":true,
"Cou":"United States",
"HasPic":false,
"Interested":false,
"FlightsCount":0,
"Gnd":true,
"SpdTyp":0,
"CallSus":false,
"TT":"a",
"Trt":1
}
],
"totalAc":4231,
"lastDv":"636019887431643594",
"shtTrlSec":61,
"stm":1466467170029
}
</pre>
If you already have your data in acList column in a pandas DataFrame, simply do:
import pandas as pd
pd.io.json.json_normalize(df.acList[0])
Alt AltT Bad CMsgs CNum Call CallSus Cou EngMount EngType ... Sqk TSecs TT Tisb TrkH Trt Type VsiT WTC Year
0 NaN 0 False 1 7503 NaN False United States 1 3 ... 3 a False False 1 CRJ2 0 2 2001
1 8025.0 0 False 35445 NaN TEST1234 False United States 0 0 ... 0262 75229 a False False 1 NaN 0 0 NaN
Since pandas 1.0 the imports should be:
import pandas as pd
pd.json_normalize(df.acList[0])
#Sergey's answer solved the issue for me but I was running into issues because the json in my data frame column was kept as a string and not as an object. I had to add the additional step of mapping the column:
import json
import pandas as pd
pd.io.json.json_normalize(df.acList.apply(json.loads))
Since pandas 1.0, json_normalize is available in the top-level namespace.
Therefore use:
import pandas as pd
pd.json_normalize(df.acList[0])
I can't comment yet on ThinkBonobo's answer but in case the JSON in the column isn't exactly a dictionary you can keep doing .apply until it is. So in my case
import json
import pandas as pd
json_normalize(
df
.theColumnWithJson
.apply(json.loads)
.apply(lambda x: x[0]) # the inner JSON is list with the dictionary as the only item
)
In my case I had some missing values (None) then I created a more specific code that also drops the original column after creating the new ones:
for prefix in ['column1', 'column2']:
df_temp = df[prefix].apply(lambda x: {} if pd.isna(x) else x)
df_temp = pd.io.json.json_normalize(df_temp)
df_temp = df_temp.add_prefix(prefix + '_')
df.drop([prefix], axis=1, inplace=True)
df = pd.concat([df, df_temp], axis = 1, sort=False)

python use test if value of a pandas dataframe is in membership of a set denoted by another column

if I have the following csv file test.csv:
C01,45,A,R
C02,123,H,I
where I have define sets R and I as
R=set(['R','E','D','N','P','H','K'])
I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
I want to be able to test if the string A is a member of set R (which is false) and if string H is a member of set I (which is true). I have tried to do this with the following script:
#!/usr/bin/env python
import pandas as pd
I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
R=set(['R','E','D','N','P','H','K'])
with open(test.csv) as f:
table = pd.read_table(f, sep=',', header=None, lineterminator='\n')
table[table.columns[3]].astype(str).isin(table[table.columns[4]].astype(str))
i.e. I am trying to do the equivalent of A in R or rather table.columns[3] in table.columns[4] and return TRUE or FALSE for each row of data.
The only problem is that using the final line the two rows return TRUE. If I change the final line to
table[table.columns[3]].astype(str).isin(R)
Then I get
0 FALSE
1 TRUE
which is correct. It seems that I am not referencing the set name correctly when doing .isin(table[table.columns[3]].astype(str))
any ideas?
Starting with the following:
In [21]: df
Out[21]:
0 1 2 3
0 C01 45 A R
1 C02 123 H I
In [22]: R=set(['R','E','D','N','P','H','K'])
...: I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
...:
You could do something like this:
In [23]: sets = {"R":R,"I":I}
In [24]: df.apply(lambda S: S[2] in sets[S[3]],axis=1)
Out[24]:
0 False
1 True
dtype: bool
Fair warning, .apply is slow and doesn't scale with larger data very well. It is there for convenience and a last resort.

Categories