Check if value in a column exists in URL using lamda function - python

I have a dataframe that has 2 columns. One is the URL and other is the username.
+----------------------------------------+---------------+
| URL | Username |
+----------------------------------------+---------------+
| johnsmith/stackoverflow.com/?=abc | johnsmith |
| michealrod/stackoverflow.com/?=payment | michealrod |
| stephaniejean/stackoverflow.com/?=abc | stephaniejean |
+----------------------------------------+---------------+
I want to write a lambda function that that checks if the username exists i the URL. I am trying to write something like this but getting an error
df['exists'] = df.apply(lambda x : df['Username'] in df['URL']).any()
So basically I am trying to get a TRUE if the username is a part of URL and False if the username does not exists in the URL.

Assuming your data is clean, a list comprehension is relatively efficient:
df['exists'] = [x in y for x, y in zip(df['Username'], df['URL'])]
You can use apply but with worse performance:
df['exists'] = df.apply(lambda row: row['Username'] in row['URL'], axis=1)

Check with numpy core.defchararray.find
df['exists']=np.core.defchararray.find(df.URL.values.astype(str),df.Username.values)!=-1

Related

Object to dictonary to use get() python pandas

I'm having some issues with a column in my csv that the type is an 'object', but it's should be an dict series (a dict for which row).
The point is to make which row as a dict to use get('id') on the key to return the id's values for which row in the 'Conta' column.
Thats the way it's as 'object' column:
| Conta |
| ---------------------------------------------|
| {'name':'joe','id':'4347176000574713087'} |
| {'name':'mary','id':'4347176000115055151'} |
| {'name':'fred','id':'4347176000574610147'} |
| {'name':'Marcos','id':'4347176000555566806'} |
| {'name':'marcos','id':'4347176000536834310'} |
Thats the way it's should be in the end
| Conta |
| ------------------- |
| 4347176000574713087 |
| 4347176000115055151 |
| 4347176000574610147 |
| 4347176000555566806 |
| 4347176000536834310 |
I tried to use:
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df['Conta'] = df['Conta'].to_dict()
df['Conta'] = [x.get('id', 0) for x in df['Conta']]
#return: AttributeError: 'str' object has no attribute 'get'
I also tried to use ast.literal_eval() but it dosen't work as well
import ast
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df = df[['Conta','ID_CS']]
df['Conta'] = df['Conta'].apply(ast.literal_eval)
#return: ValueError: malformed node or string: nan
Can someone help me?
Consider replacing the following line:
df['Conta'] = df['Conta'].apply(ast.literal_eval)
If it's being correctly detected as a dictionary then:
df['Conta'] = df['Conta].map(lambda x: x['id'])
If each row is a string:
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(x)['id'])
However, if you are getting a malformed node or json error. Consider first using str and then ast.literal_eval():
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(str(x))['id'])

pandas - create true/false column if column header is substring of another column

I am following this post to create a number of columns that are true/false based on if a substring is present in another column.
Prior to using the code in the above post, I look at a field called LANGUAGES which has values such as "ENG, SPA, CZE" or "ENG, SPA". Unfortunately, the data is a comma-delimited string instead of a list but no problem, in one line, I can get the list of 25 unique values.
Once I get the list of unique values, I want to make a new column for each value so a df[ENG], df[SPA], etc. columns. I want these columns to be true/false based on if the header is a substring of the original LANGUAGES column.
Following the post, I use df.apply(lambda x: language in df.LANGUAGES, axis = 1). However, when I check the values of the columns (the value counts in the last for loop), all the values come up false.
How can I create a true/false column based on the column's header being a substring of another column?
My Code:
import json
import pandas as pd
import requests
url = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="
response = requests.get(url)
if response.status_code == 200:
res = response.json()
df = pd.DataFrame(res)
df.columns = [str(h).upper() for h in list(df)]
#
# the below line is confusing but it creates a sorted list of all unique languages
#
languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
for language in languages:
print(language)
df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)
for language in languages:
print(df[language].value_counts())
print("\n")
else:
print("\nConnection was unsuccesful: {0}".format(response.status_code))
Edit: There was a request for a raw data input and expected output. Here is what a column looks like:
+-------+-----------------+
| Index | LANGUAGES |
+-------+-----------------+
| 0 | 'ENG, OTH, RUS' |
| 1 | 'ENG' |
| 2 | 'ENG, CZE, SPA' |
+-------+-----------------+
This is the expected output:
+-------+-----------------+------+-------+-------+-------+-------+
| Index | LANGUAGES | ENG | CZE | OTH | RUS | SPA |
+-------+-----------------+------+-------+-------+-------+-------+
| 0 | 'ENG, OTH, RUS' | TRUE | FALSE | TRUE | TRUE | FALSE |
| 1 | 'ENG' | TRUE | FALSE | FALSE | FALSE | FALSE |
| 2 | 'ENG, CZE, SPA' | TRUE | TRUE | FALSE | TRUE | FALSE |
+-------+-----------------+------+-------+-------+-------+-------+
Two steps,
first, we explode your list and create a pivot table to re-concat to your original df based on the index.
s = df['LANGUAGES'].str.replace("'",'').str.split(',').explode().to_frame()
cols = s['LANGUAGES'].drop_duplicates(keep='first').tolist()
df2 = pd.concat([df, pd.crosstab(s.index, s["LANGUAGES"])[cols]], axis=1).replace(
{1: True, 0: False}
)
print(df2)
LANGUAGES ENG OTH RUS CZE SPA
0 'ENG, OTH, RUS' True True True False False
1 'ENG' True False False False False
2 'ENG, CZE, SPA' True False False True True
Found in this post, I swapped out the following line of code:
df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)
for the following two lines:
criteria = lambda row : language in row["LANGUAGES"]
df[language] = df.apply(criteria, axis =1)
And it works.
import json
import pandas as pd
import requests
url = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="
response = requests.get(url)
if response.status_code == 200:
res = response.json()
df = pd.DataFrame(res)
df.columns = [str(h).upper() for h in list(df)]
#
# the below line is confusing but it creates a sorted list of all unique languages
#
languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
for language in languages:
criteria = lambda row : language in row["LANGUAGES"]
df[language] = df.apply(criteria, axis =1)
for language in languages:
print(df[language].value_counts())
print("\n")
else:
print("\nConnection was unsuccesful: {0}".format(response.status_code))
This line swap could also work:
for language in languages:
df[language] = df.LANGUAGES.apply(lambda x: 'True' if language in x else 'False')
print("{}:{}".format(language, df[df[language] == 'True'].shape[0]))

ETL table selection by Variable

I'm trying to select rows within a table and create a new table with the information from the original table using PETL.
My code right now is:
import petl as etl
table_all = (
etl.fromcsv("practice_locations.csv")
.convert('Practice_Name', 'upper')
.convert('Suburb', str)
.convert('State', str)
.convert('Postcode', int)
.convert('Lat', str)
.convert('Long', str)
)
def selection(post_code):
table_selected = etl.select(table_all, "{Postcode} == 'post_code'")
print(post_code)
etl.tojson(table_selected, 'location.json', sort_keys=True)
But I cannot seem to populate table_selected by using the selection function as it is. The etl.select call will work if I replace post_code to look like
table_selected = etl.select(table_all, "{Postcode} == 4510")
Which outputs the correct table shown as:
+--------------------------------+--------------+-------+----------+--------------+--------------+
| Practice_Name | Suburb | State | Postcode | Lat | Long |
+================================+==============+=======+==========+==============+==============+
| 'CABOOLTURE COMBINED PRACTICE' | 'Caboolture' | 'QLD' | 4510 | '-27.085007' | '152.951707' |
+--------------------------------+--------------+-------+----------+--------------+--------------+
I'm sure I am just trying to call post_code in a way that is wrong but have tried everything from the PETL documentation and can't seem to figure it out.
Any help is much appreciated.
"{Postcode} == 'post_code'" will not replace post_code with the value passed to your selection function.
You need to format your select string (and escape {Postcode} when using format)
table_selected = etl.select(table_all, "{{Postcode}} == {post_code}".format(post_code=post_code))
Testing this in console
>>> "{{Postcode}} == {post_code}".format(post_code=1234)
'{Postcode} == 1234'

Convert a value using a value from a different row with petl?

I have the following table:
+---------+------------+----------------+
| IRR | Price List | Cambrdige Data |
+=========+============+================+
| '1.56%' | '0' | '6/30/1989' |
+---------+------------+----------------+
| '5.17%' | '100' | '9/30/1989' |
+---------+------------+----------------+
| '4.44%' | '0' | '12/31/1990' |
+---------+------------+----------------+
I'm trying to write a calculator that updates the Price List field by making a simple calculation. The logic is basically this:
previous price * ( 1 + IRR%)
So for the last row, the calculation would be: 100 * (1 + 4.44%) = 104.44
Since I'm using petl, I'm trying to figure out how to update a field with its above value and a value from the same row and then populate this across the whole Price List column. I can't seem to find a useful petl utility for this. Should I just manually write a method? What do you guys think?
Try this:
# conversion can access other values from the same row
table = etl.convert(table, 'Price List',
lambda row: 100 * (1 + row.IRR),
pass_row=True)

Using regex to extract information from a large SFrame or dataframe without using a loop

I have the following code in which I use a loop to extract some information and use these information to create a new matrix. However, because I am using a loop, this code takes forever to finish.
I wonder if there is a better way of doing this by using GraphLab's SFrame or pandas dataframe. I appreciate any help!
# This is the regex pattern
pattern_topic_entry_read = r"\d{15}/discussion_topics/(?P<topic>\d{9})/entries/(?P<entry>\d{9})/read"
# Using the pattern, I filter my records
requests_topic_entry_read = requests[requests['url'].apply(lambda x: False if regex.match(pattern_topic_entry_read, x) == None else True)]
# Then for each record in the final set,
# I need to extract topic and entry info using match.group
for request in requests_topic_entry_read:
for match in regex.finditer(pattern_topic_entry_read, request['url']):
topic, entry = match.group('topic'), match.group('entry')
# Then, I need to create a new SFrame (or dataframe, or anything suitable)
newRow = gl.SFrame({'user_id':[request['user_id']],
'url':[request['url']],
'topic':[topic], 'entry':[entry]})
# And, append it to my existing SFrame (or dataframe)
entry_read_matrix = entry_read_matrix.append(newRow)
Some sample data:
user_id | url
1000 | /123456832960900/discussion_topics/770000832912345/read
1001 | /123456832960900/discussion_topics/770000832923456/view?per_page=832945307
1002 | /123456832960900/discussion_topics/770000834562343/entries/832350330/read
1003 | /123456832960900/discussion_topics/770000534344444/entries/832350367/read
I want to obtain this:
user_id | topic | entry
1002 | 770000834562343 | 832350330
1003 | 770000534344444 | 832350367
Pandas' series has string functions for that. E.g., with your data in df:
pattern = re.compile(r'.*/discussion_topics/(?P<topic>\d+)(?:/entries/(?P<entry>\d+))?')
df = pd.read_table(io.StringIO(data), sep=r'\s*\|\s*', index_col='user_id')
df.url.str.extract(pattern, expand=True)
yields
topic entry
user_id
1000 770000832912345 NaN
1001 770000832923456 NaN
1002 770000834562343 832350330
1003 770000534344444 832350367
Here, let me reproduce it:
>>> import pandas as pd
>>> df = pd.DataFrame(columns=["user_id","url"])
>>> df.user_id = [1000,1001,1002,1003]
>>> df.url = ['/123456832960900/discussion_topics/770000832912345/read', '/123456832960900/discussion_topics/770000832923456/view?per_page=832945307', '/123456832960900/discussion_topics/770000834562343/entries/832350330/read','/123456832960900/discussion_topics/770000534344444/entries/832350367/read']
>>> df["entry"] = df.url.apply(lambda x: x.split("/")[-2] if "entries" in x.split("/") else "---")
>>> df["topic"] = df.url.apply(lambda x: x.split("/")[-4] if "entries" in x.split("/") else "---")
>>> df[df.entry!="---"]
gives you desired DataFrame

Categories