Pandas - Filter DataFrame with Logics - python

I have a Pandas DataFrame like this,
Employee ID ActionCode ActionReason ConcatenatedOutput
1 TER DEA TER_DEA
1 RET ABC RET_ABC
1 RET DEF RET_DEF
2 TER DEA TER_DEA
2 ABC ABC ABC_ABC
2 DEF DEF DEF_DEF
3 RET FGH RET_FGH
3 RET EFG RET_EFG
4 PLA ABC PLA_ABC
4 TER DEA TER_DEA
And I want to filter it with the below logics and change it to something like this,
Employee ID ConcatenatedOutput Context
1 RET_ABC RET or TER Found
2 TER_DEA RET or TER Found
3 RET_FGH RET or TER Found
4 PLA_ABC RET or TER Not Found
Logics:-
1) If the first record of an Employee is TER_DEA then we go in to that employee and see if that employee has any other records, If that employee has another RET record, then we pick up the first available RET record or else we stick to TER_DEA record.
2) if the first record of an employee is anything other than TER_DEA then we stick with that record.
3) Context is conditional if it has a RET or TER then we say RET or TER Found, else it is not found.
Note:- The final output will have only one record for an employee ID.
The data below,
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})

I'd recommend you rather go with a Bool in that field. To get the test data I used this:
import pandas as pd
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})
You can then do a group by on the employee ID and and apply a function to perform your specific program logic in there.
def myfunc(data):
if data.iloc[0]['ConcatenatedOutput'] == 'TER_DEA':
if len(data.loc[data['ActionCode'] == 'RET']) > 0:
located_record = data.loc[data['ActionCode'] == 'RET'].iloc[[0]]
else:
located_record = data.iloc[[0]]
else:
located_record = data.iloc[[0]]
located_record['RET or TER Context'] = data['ActionCode'].str.contains('|'.join(['RET', 'TER']))
return located_record
df.groupby(['Employee ID']).apply(myfunc)

Related

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

I'm trying to use this create_df() function in Streamlit to gather a list of user-provided URLs called "recipes" and loop through each URL to return a df I've labeled "res" towards the end of the function. I've tried several approaches with the Streamlit syntax but I just cannot get this to work as I'm getting this error message:
recipe_scrapers._exceptions.WebsiteNotImplementedError: recipe-scrapers exception: Website (h) not supported.
Have a look at my entire repo here. The main.py script works just fine once you've installed all requirements locally, but when I try running the same script with Streamlit syntax in the streamlit.py script I get the above error. Once you run streamlit run streamlit.py in your terminal and have a look at the UI I've create it should be quite clear what I'm aiming at, which is providing the user with a csv of all ingredients in the recipe URLs they provided for a convenient grocery shopping list.
Any help would be greatly appreciated!
def create_df(recipes):
"""
Description:
Creates one df with all recipes and their ingredients
Arguments:
* recipes: list of recipe URLs provided by user
Comments:
Note that ingredients with qualitative amounts e.g., "scheutje melk", "snufje zout" have been ommitted from the ingredient list
"""
df_list = []
for recipe in recipes:
scraper = scrape_me(recipe)
recipe_details = replace_measurement_symbols(scraper.ingredients())
recipe_name = recipe.split("https://www.hellofresh.nl/recipes/", 1)[1]
recipe_name = recipe_name.rsplit('-', 1)[0]
print("Processing data for "+ recipe_name +" recipe.")
for ingredient in recipe_details:
try:
df_temp = pd.DataFrame(columns=['Ingredients', 'Measurement'])
df_temp[str(recipe_name)] = recipe_name
ing_1 = ingredient.split("2 * ", 1)[1]
ing_1 = ing_1.split(" ", 2)
item = ing_1[2]
measurement = ing_1[1]
quantity = float(ing_1[0]) * 2
df_temp.loc[len(df_temp)] = [item, measurement, quantity]
df_list.append(df_temp)
except (ValueError, IndexError) as e:
pass
df = pd.concat(df_list)
print("Renaming duplicate ingredients e.g., Kruimige aardappelen, Voorgekookte halve kriel met schil -> Aardappelen")
ingredient_dict = {
'Aardappelen': ('Dunne frieten', 'Half kruimige aardappelen', 'Voorgekookte halve kriel met schil',
'Kruimige aardappelen', 'Roodschillige aardappelen', 'Opperdoezer Ronde aardappelen'),
'Ui': ('Rode ui'),
'Kipfilet': ('Kipfilet met tuinkruiden en knoflook'),
'Kipworst': ('Gekruide kipworst'),
'Kipgehakt': ('Gemengd gekruid gehakt', 'Kipgehakt met Mexicaanse kruiden', 'Half-om-halfgehakt met Italiaanse kruiden',
'Kipgehakt met tuinkruiden'),
'Kipshoarma': ('Kalkoenshoarma')
}
reverse_label_ing = {x:k for k,v in ingredient_dict.items() for x in v}
df["Ingredients"].replace(reverse_label_ing, inplace=True)
print("Assigning ingredient categories")
category_dict = {
'brood': ('Biologisch wit rozenbroodje', 'Bladerdeeg', 'Briochebroodje', 'Wit platbrood'),
'granen': ('Basmatirijst', 'Bulgur', 'Casarecce', 'Cashewstukjes',
'Gesneden snijbonen', 'Jasmijnrijst', 'Linzen', 'Maïs in blik',
'Parelcouscous', 'Penne', 'Rigatoni', 'Rode kidneybonen',
'Spaghetti', 'Witte tortilla'),
'groenten': ('Aardappelen', 'Aubergine', 'Bosui', 'Broccoli',
'Champignons', 'Citroen', 'Gele wortel', 'Gesneden rodekool',
'Groene paprika', 'Groentemix van paprika, prei, gele wortel en courgette',
'IJsbergsla', 'Kumato tomaat', 'Limoen', 'Little gem',
'Paprika', 'Portobello', 'Prei', 'Pruimtomaat',
'Radicchio en ijsbergsla', 'Rode cherrytomaten', 'Rode paprika', 'Rode peper',
'Rode puntpaprika', 'Rode ui', 'Rucola', 'Rucola en veldsla', 'Rucolamelange',
'Semi-gedroogde tomatenmix', 'Sjalot', 'Sperziebonen', 'Spinazie', 'Tomaat',
'Turkse groene peper', 'Veldsla', 'Vers basilicum', 'Verse bieslook',
'Verse bladpeterselie', 'Verse koriander', 'Verse krulpeterselie', 'Wortel', 'Zoete aardappel'),
'kruiden': ('Aïoli', 'Bloem', 'Bruine suiker', 'Cranberrychutney', 'Extra vierge olijfolie',
'Extra vierge olijfolie met truffelaroma', 'Fles olijfolie', 'Gedroogde laos',
'Gedroogde oregano', 'Gemalen kaneel', 'Gemalen komijnzaad', 'Gemalen korianderzaad',
'Gemalen kurkuma', 'Gerookt paprikapoeder', 'Groene currykruiden', 'Groentebouillon',
'Groentebouillonblokje', 'Honing', 'Italiaanse kruiden', 'Kippenbouillonblokje', 'Knoflookteen',
'Kokosmelk', 'Koreaanse kruidenmix', 'Mayonaise', 'Mexicaanse kruiden', 'Midden-Oosterse kruidenmix',
'Mosterd', 'Nootmuskaat', 'Olijfolie', 'Panko paneermeel', 'Paprikapoeder', 'Passata',
'Pikante uienchutney', 'Runderbouillonblokje', 'Sambal', 'Sesamzaad', 'Siciliaanse kruidenmix',
'Sojasaus', 'Suiker', 'Sumak', 'Surinaamse kruiden', 'Tomatenblokjes', 'Tomatenblokjes met ui',
'Truffeltapenade', 'Ui', 'Verse gember', 'Visbouillon', 'Witte balsamicoazijn', 'Wittewijnazijn',
'Zonnebloemolie', 'Zwarte balsamicoazijn'),
'vlees': ('Gekruide runderburger', 'Half-om-half gehaktballetjes met Spaanse kruiden', 'Kipfilethaasjes', 'Kipfiletstukjes',
'Kipgehaktballetjes met Italiaanse kruiden', 'Kippendijreepjes', 'Kipshoarma', 'Kipworst', 'Spekblokjes',
'Vegetarische döner kebab', 'Vegetarische kaasschnitzel', 'Vegetarische schnitzel'),
'zuivel': ('Ei', 'Geraspte belegen kaas', 'Geraspte cheddar', 'Geraspte grana padano', 'Geraspte oude kaas',
'Geraspte pecorino', 'Karnemelk', 'Kruidenroomkaas', 'Labne', 'Melk', 'Mozzarella',
'Parmigiano reggiano', 'Roomboter', 'Slagroom', 'Volle yoghurt')
}
reverse_label_cat = {x:k for k,v in category_dict.items() for x in v}
df["Category"] = df["Ingredients"].map(reverse_label_cat)
col = "Category"
first_col = df.pop(col)
df.insert(0, col, first_col)
df = df.sort_values(['Category', 'Ingredients'], ascending = [True, True])
print("Merging ingredients by row across all recipe columns using justify()")
gp_cols = ['Ingredients', 'Measurement']
oth_cols = df.columns.difference(gp_cols)
arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False).apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, axis=0, side='up')))
# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
.dropna(how='all', subset=oth_cols, axis=0))
res = res.fillna(0)
res['Total'] = res.drop(['Ingredients', 'Measurement'], axis=1).sum(axis=1)
res=res[res['Total'] !=0] #To drop rows that are being duplicated with 0 for some reason; will check later
print("Processing complete!")
return res
Your function create_df needs a list as an argument but st.text_input returs always a string.
In your streamlit.py, replace this df_download = create_df(recs) by this df_download = create_df([recs]). But if you need to handle multiple urls, you should use str.split like this :
def create_df(recipes):
recipes = recipes.split(",") # <--- add this line to make a list from the user-input
### rest of the code ###
if download:
df_download = create_df(recs)
# Output :

How to use values from a PANDAS data frame as filter params in regex?

I would like to use the values from a Pandas df as filter params in a SPARQL query.
By reading the data from an excel file I'm creating the pandas dataframe:
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
Here the resulting dataframe:
oggetto descrizione lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori 6 Bad OpereArteVisiva
1 #iccd3636719# Decorazione plastica. 2 Bad OpereArteVisiva
2 #iccd3641475# Scultura.. Figure: angelo 3 Bad OpereArteVisiva
3 #iccd8282504# Custodia di reliquiario in legno intagliato e ... 8 Good OpereArteVisiva
4 #iccd3019633# Portale. 1 Bad OpereArteVisiva
... ... ... ... ... ...
59995 #iccd2274873# Ciotola media a larga tesa. Decorazione in cob... 35 Good OpereArteVisiva
59996 #iccd11189887# Il medaglione bronzeo, sormontato da un'aquila... 85 Good OpereArteVisiva
59997 #iccd4545324# Tessuto di fondo rosaceo. Disegno a fiori e fo... 49 Good OpereArteVisiva
59998 #iccd2934870# Sculture a tutto tondo in legno dipinto di bia... 28 Good OpereArteVisiva
59999 #iccd2685205# Calice con piede a base circolare e nodo ovoid... 14 Bad OpereArteVisiva
Then I need to use the values from the oggetto column as filter to retrieve for (each record) the relative subject from a SPARQL endpoint.
By using this SPARQL query:
SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
I'm able to filter one single record.
object.type object.value ... subjects.type subjects.value
0 uri http://dati.culturaitalia.it/resource/oai-oaic... ... literal Putto con ghirlanda di fiori|Putto con ghirlan..
Since the dataset is 60k records I would automatize the process by looping through the dataframe and use the value as filter to have a new df with a relative subject col.
oggetto descrizione subject lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori Putto con ghirlanda di fiori|Putto con ghirlan.. 6 Bad OpereArteVisiva
Here the entire script I wrote:
import xlrd
import pandas as pd
from pandas import json_normalize
from SPARQLWrapper import SPARQLWrapper, JSON
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
def query_ci(sparql_query, sparql_service_url):
sparql = SPARQLWrapper(sparql_service_url)
sparql.setQuery(sparql_query)
sparql.setReturnFormat(JSON)
# ask for the result
result = sparql.query().convert()
return json_normalize(result["results"]["bindings"])
sparql_query = """ SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
"""
sparql_service_url = "http://dati.culturaitalia.it/sparql"
result_table = query_ci(sparql_query, sparql_service_url)
print (result_table)
result_table.to_excel("output.xlsx")
Is it possible to do that?

Pandas Dataframe - Conditional Column Creation

I am attempting to create a new column based on conditional logic from another column. I've tried searching and haven't been able to find anything that addresses my issue.
I have imported a CSV to a pandas dataframe, it is structured like this. I edited a few of the descriptions for this post, but other than that everything is the same:
#code used to load dataframe:
df = pd.read_csv(r"C:\filepath\filename.csv")
#output from print(type(df)):
#class 'pandas.core.frame.DataFrame'
#output from print(df.columns.values):
#['Type' 'Trans Date' 'Post Date' 'Description' 'Amount']
#output from print(df.columns):
Index(['Type', 'Trans Date', 'Post Date', 'Description', 'Amount'], dtype='object')
#output from print
Type Trans Date Post Date Description Amount
0 Sale 01/25/2018 01/25/2018 DESC1 -13.95
1 Sale 01/25/2018 01/26/2018 AMAZON MKTPLACE PMTS -6.99
2 Sale 01/24/2018 01/25/2018 SUMMIT BISTRO -5.85
3 Sale 01/24/2018 01/25/2018 DESC3 -9.13
4 Sale 01/24/2018 01/26/2018 DYNAMIC VENDING INC -1.60
I then write the following code:
def criteria(row):
if row.Description.find('SUMMIT BISTRO')>0:
return 'Lunch'
elif row.Description.find('AMAZON MKTPLACE PMTS')>0:
return 'Amazon'
elif row.Description.find('Aldi')>0:
return 'Groceries'
else:
return 'NotWorking'
df['Category'] = df.apply(criteria, axis=0)
Errors:
Traceback (most recent call last):
File "C:\Users\Test_BankReconcile2.py", line 44, in <module>
df['Category'] = df.apply(criteria, axis=0)
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "C:\Users\OneDrive\Documents\finance\Test_BankReconcile2.py", line 35, in criteria
if row.Description.find('SUMMIT BISTRO')>0:
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\generic.py", line 3081, in __getattr__
return object.__getattribute__(self, name)
AttributeError: ("'Series' object has no attribute 'Description'", 'occurred at index Type')
I'm able to successfully execute this same sort of command on a very similar csv file from a different bank (this example is from my credit card), so I don't know what is going on but possibly I need to define the dataframe in some way that I'm not doing? Or possibly something else that is very obvious that I'm not seeing? Thank you all in advance for helping me solve this.
Yes, your problem is that you need to pass axis=1 to .apply:
In [52]: df
Out[52]:
Type Trans Date Post Date Description Amount
0 Sale 01/25/2018 01/25/2018 DESC1 -13.95
1 Sale 01/25/2018 01/26/2018 AMAZON MKTPLACE PMTS -6.99
2 Sale 01/24/2018 01/25/2018 SUMMIT BISTRO -5.85
3 Sale 01/24/2018 01/25/2018 DESC3 -9.13
4 Sale 01/24/2018 01/26/2018 DYNAMIC VENDING INC -1.60
In [53]: def criteria(row):
...: if row.Description.find('SUMMIT BISTRO')>0:
...: return 'Lunch'
...: elif row.Description.find('AMAZON MKTPLACE PMTS')>0:
...: return 'Amazon'
...: elif row.Description.find('Aldi')>0:
...: return 'Groceries'
...: else:
...: return 'NotWorking'
...:
In [54]: df.apply(criteria, axis=1)
Out[54]:
0 NotWorking
1 NotWorking
2 NotWorking
3 NotWorking
4 NotWorking
dtype: object
The second problem is you have a logic error, instead of .find(x) > 0 you want .find(x) >= 0, or better yet, some_string in some_other_string
For more general solution omit Description in loop and instead use df['Description'].apply(criteria) with Series.apply.
Also for check substring in string use in.
def criteria(row):
if 'SUMMIT BISTRO' in row:
return 'Lunch'
elif 'AMAZON MKTPLACE PMTS' in row:
return 'Amazon'
elif 'Aldi' in row:
return 'Groceries'
else:
return 'NotWorking'
df['Category'] = df['Description'].apply(criteria)

pandas dataframe apply function over column creating multiple columns

I have the pandas df below, with a few columns, one of which is ip_addresses
df.head()
my_id someother_id created_at ip_address state
308074 309115 2859690 2014-09-26 22:55:20 67.000.000.000 rejected
308757 309798 2859690 2014-09-30 04:16:56 173.000.000.000 approved
309576 310619 2859690 2014-10-02 20:13:12 173.000.000.000 approved
310347 311390 2859690 2014-10-05 04:16:01 173.000.000.000 approved
311784 312827 2859690 2014-10-10 06:38:39 69.000.000.000 approved
For each ip_address I'm trying to return the description, city, country
I wrote a function below and tried to apply it
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return [description, city, country]
# ---
suspect['ipwhois'] = suspect['ip_address'].apply(returnIP)
My problem is that this returns a list, I want three separate columns.
Any help is greatly appreciated. I'm new to Pandas/Python so if there's a better way to write the function and use Pandas would be very helpful.
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return (description, city, country)
suspect['description'], suspect['city'], suspect['country'] = \
suspect['ip_address'].apply(returnIP)
I was able to solve it with another stackoverflow solution
for n,col in enumerate(cols):
suspect[col] = suspect['ipwhois'].apply(lambda ipwhois: ipwhois[n])
If there's a more elegant way to solve this, please share!

parse string into list based on input list

I would like to write a function in python3 to parse a string based on the input list element. The following function works but is there a better way to do it?
def func(oStr, s_s):
if not oStr:
return s_s
elif '' in s_s:
return [oStr]
else:
for x in s_s:
st = oStr.find(x)
end = st + len(x)
res.append(oStr[st:end])
oStr = oStr.replace(x, '')
if oStr:
res.append(oStr)
return res
case 1
o_str = 'ABCNew York - Address'
s_str = ['ABC']
return ['ABC', 'New York - Address']
case 2
o_str = 'New York Friend Add | NumberABCNewYork Name | FirstName Last Name | time : Jan-31-2017'
s_str = ['New York Friend Add | Number', 'ABC', 'NewYork Name | FirstName Last Name | time: Jan-31-2017']
return ['New York Friend Add | Number', 'ABC', 'NewYork Name | FirstName Last Name | time: Jan-31-2017']
case 3
o_str = '-'
s_str = ['']
return ['-']
case 4
o_str = '1'
s_str = ['']
return ['1']
case 5
o_str = '1234Family-Name'
s_str = ['1234']
return ['1234', 'Family-Name']
case 6
o_str = ''
s_str = ['12345667', 'name']
return ['12345667', 'name']
To use a string like an array, you would just program it in the same way. For example
myStr="Hello, World!"
myString.insert(len(myString),"""Your character here""")
For your purposes .append() would work exactly the same way. Hope I helped.

Categories