I have the following line
open.loc[(open['Genesis'] == 'Incomplete Name') & (open['Date'] >= last_date),'Result'] = 'Missing information'
'last_date' is a variable equal to a specific date week (2022-25)
'open' is my dataframe
'Result' is a column of my dataframe, as well as 'Date' and 'Genesis'
What I'm trying to do is to transform that line of code into something where instead of specifying the value of 'Incomplete' in the ['Genesis'], I could have something like the 'LIKE' operator in SQL, to get more information, since the ['Genesis'] column has values like:
'Incomplete Name'
'Incomplete color'
'Incomplete material'
And to replace the ['Result'] value with the ['Genesis'] value itself, since I do not want to manually specify every possible outcome of that column.
I've tried something like the following but I'm struggling to make it work:
`for word in open['Genesis']:
if word.startswith('Incomplete') and open['Date'] >= last_date:
open['Result'] = open['Genesis']`
Thanks in advance!
You can use the in operator in python. Basically the expression returns true if the incomplete value is "subset" of the other complete value.
So you can do:
open.loc[('Incomplete Name' in open['Genesis']) & (open['Date'] >= last_date),'Result'] = 'Missing information'
** EDIT **
try this one:
for word in words:
if word in open['Genesis'] and open['Date'] >= last_date:
open['Result'] = open['Genesis']`
Related
I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.
I'm trying to filter out the null values in a column and count if its greater than 1.
badRows = df.filter($"_corrupt_record".isNotNull) if badRows.count > 0: logger.error("throwing bad rows exception...") schema_mismatch_exception(None, "cdc", item )
I'm getting a syntax error. Also tried to check using :
badRows = df.filter(col("_corrupt_record").isNotNull),
badRows = df.filter(None, col("_corrupt_record")),
badRows = df.filter("_corrupt_record isNotnull")
What is the correct way to filter out if there is data in the _corrupt_record column
Try, e.g.
import pyspark.sql.functions as F
...
df.where(F.col("colname").isNotNull())
...
Many of the options you provide are not the right syntax as you note.
Kaggle Dataset(working on)- Newyork Airbnb
Created with a raw data code for running better explanation of the issue
`airbnb= pd.read_csv("https://raw.githubusercontent.com/rafagarciac/Airbnb_NYC-Data-Science_Project/master/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv")
airbnb[airbnb["host_name"].isnull()][["host_name","neighbourhood_group"]]
`DataFrame
I would like to fill the null values of "host_name" based on the "neighbourhood_group" column entities.
like
if airbnb['host_name'].isnull():
airbnb["neighbourhood_group"]=="Bronx"
airbnb["host_name"]= "Vie"
elif:
airbnb["neighbourhood_group"]=="Manhattan"
airbnb["host_name"]= "Sonder (NYC)"
else:
airbnb["host_name"]= "Michael"
(this is wrong,just to represent the output format i want)
I've tried using if statement but I couldn't apply in a correct way. Could you please me solve this.
Thanks
You could try this -
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Bronx"), "host_name"] = "Vie"
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Manhattan"), "host_name"] = "Sonder (NYC)"
airbnb.loc[airbnb['host_name'].isnull(), "host_name"] = "Michael"
Pandas has a special method to fill NA values:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You may create a dict with values for "host_name" field using "neighbourhood_group" values as keys and do this:
host_dict = {'Bronx': 'Vie', 'Manhattan': 'Sonder (NYC)'}
airbnb['host_name'] = airbnb['host_name'].fillna(value=airbnb[airbnb['host_name'].isna()]['neighbourhood_group'].map(host_dict))
airbnb['host_name'] = airbnb['host_name'].fillna("Michael")
"value" argument here may be a Series of values.
So, first of all, we create a Series with "neighbourhood_group" values which correspond to our missing values by using this part:
neighbourhood_group_series = airbnb[airbnb['host_name'].isna()]['neighbourhood_group']
Then using map function together with "host_dict" we get a Series with values that we want to impute:
neighbourhood_group_series.map(host_dict)
Finally we just impute in all other NA cells some default value, in our case "Michael".
You can do it with:
ornek = pd.DataFrame({'samp1': [None, None, None],
'samp2': ["sezer", "bozkir", "farkli"]})
def filter_by_col(row):
if row["samp2"] == "sezer":
return "ping"
if row["samp2"] == "bozkir":
return "pong"
return None
ornek.apply(lambda x: filter_by_col(x), axis=1)
This code works for single string (inputx) but I can't get it to work when I replace it with the name of the column in my dataframe. What I want to do is split the string in column DESC where the capitalized words (at beginning of string) is place into column break2 and the remainder of the description is placed in column break3. Any assistance is appreciated. Thanks.
Example:
What I want output to look like (but with the different DESC from each row
Code that works for hardcoded string:
inputx= "STOCK RECORD INQUIRY This is a system that keeps track of the positions, location and ownership of the securities that the broker holds"
pos = re.search("[a-z]", inputx[::1]).start()
Before_df['break1'] = pos
Before_df['break2'] = inputx[:(pos-1)]
Before_df['break3'] = inputx[(pos-1):]
But if I replace with dataframe column, I get error message: TypeError: expected string or bytes-like object
inputx = Before_df['DESC']
pos = re.search("[a-z]", inputx[::1]).start()
Before_df['break1'] = pos
Before_df['break2'] = inputx[:(pos-1)]
Before_df['break3'] = inputx[(pos-1):]
You can use regex in the df.str.split method
df[['result','result2','result3']] = df['yourcol'].str.split("([a-z])", expand= True)
If you absolutely must use re.search (which sounds a little like homework...)
for i in df.index:
df.at[i, 'columnName'] = re.search("[a-z]", df.at[i, 'inputColumn'][::1]).start()
The reason for looping instead of using df.apply() is because dataframes do not like to be changed during an apply
I have the following df and function (see below). I might be over complicating this. A new set of fresh eyes would be deeply appreciated.
df:
Site Name Plan Unique ID Atlas Placement ID
Affectv we11080301 11087207850894
Mashable we14880202 11087208009031
Alphr uk10790301 11087208005229
Alphr uk19350201 11087208005228
The goal is to:
Iter first through df['Plan Unique ID'], search for a specific value (we_match or uk_match), if there is a match
Check that the string value is bigger than a certain value in that group (we12720203 or uk11350200)
If the value is greater than add that we or uk value to a new column df['Consolidated ID'].
If the value is lower or there is no match, then search df['Atlas Placement ID'] with new_id_search
If there is a match, then add that to df['Consolidated ID']
If not, return 0 to df['Consolidated ID]
The current problem is that it returns an empty column.
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match > "we12720203":
return we_match.group(0)
else:
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match > "uk11350200":
return uk_match.group(0)
else:
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)
Edit: Cleaned the formula
I modified gzl's function in the following way (see below): First see if in df1 there is 14 numbers. If so, add that.
The next step, ideally would be to grab a column MediaPlanUnique from df2 and turn it into a series filtered_placements:
we11080301
we12880304
we14880202
uk19350201
uk11560205
uk11560305
And see if any of the values in filtered_placements are present in df['Plan Unique ID]. If there is a match, then add df['Plan Unique ID] to our end column = df[ConsolidatedID]
The current problem is that it results in all 0. I think it's because the comparison is been done as 1 to 1 (first result of new_match vs first result of filtered_placements) rather than 1 to many (first result of new_match vs all results of filtered_placements)
Any ideas?
def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):
if type(df['PlacementID']) is str:
old_match = re.search(old_id_search, df['PlacementID'])
if old_match:
return old_match.group(0)
else:
if type(df['Plan Unique ID']) is str:
if type(filtered_placements) is str:
new_match = re.search(new_id_search, df['Plan Unique ID'])
if new_match:
if filtered_placements.str.contains(new_match.group(0)):
return new_match.group(0)
return 0
mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)
I would recommend that don't use such complicate nested if statements. As Phil pointed out, each check is mutually-exclusive. Thus you can check 'we' and 'uk' in same indented if statement, then fall back to default process.
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match.group(0) > "we12720203":
return we_match.group(0)
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match.group(0) > "uk11350200":
return uk_match.group(0)
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
Test:
In [37]: df.apply(placement_extract, axis=1)
Out[37]:
0 11087207850894
1 we14880202
2 11087208005229
3 uk19350201
dtype: object
I've reorganised the logic and also simpified the regex operations to show another way to approach it. The reorganisation wasn't strictly necessary for the answer but as you asked for another opinion / way of approaching it I thought this might help you in future:
# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
# Extracted to shorter temp variable
plan_id = row["Plan Unique ID"]
# Using parenthesis to get two separate groups - code and numeric
# Means you can do the match just once
result = re.match("(we|uk)(.+)",plan_id)
if result:
code, numeric = result.groups()
# We can get away with these simple tests as the earlier regex guarantees
# that the string starts with either "we" or "uk"
if code == "we" and plan_id > we_search:
return_val = plan_id
elif code == "uk" and plan_id > uk_search:
return_val = plan_id
else:
# It looked like this column was used whatever happened at the
# end, so there's no need to check against a regex
#
# The Atlas Placement is the default option if it either fails
# the prefix check OR the "greater than" test
return_val = row["Atlas Placement ID"]
# A single return statement is often easier to debug
return return_val
Then using in an apply statement (also look into assign):
$ mediaplan_df["Consolidated ID"] = mediaplan_df.apply(placement_extract, axis=1)
$ mediaplan_df
>
Site Name Plan Unique ID Atlas Placement ID Consolidated ID
0 Affectv we11080301 11087207850894 11087207850894
1 Mashable we14880202 11087208009031 we14880202
2 Alphr uk10790301 11087208005229 11087208005229
3 Alphr uk19350201 11087208005228 uk19350201