I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.
Related
First time asking. Is there a way to get a new df column including all three statements (or, isnull-like, isin-like) without iterating over a for loop/ keeping code within the spirit of Pandas? I've tried advice from several threads dealing with individual aspects of common conditional problems, but every iteration I've tried usually leads me to "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." or produces incorrect results. Below is example data and code from several attempts. My goal is to produce a dataframe (printed below) where the last (new) column concatenates all words from the company and unit columns, (1) without any NaNs (note: 'unit_desc' contains no null values irl, so NaNs in 'comp_unit' mean my function isn't working properly) and (2) without repeating the company name (because sometimes 'unit_desc' already [incorrectly] contains the company name, as with example row 2).
Desired dataframe
company
unit_desc
comp_new
comp_unit
Generic
Some description
NaN
Some description
NaN
Unit with features
NaN
Unit with features
Some LLC
Some LLC Xtra cool space
Some LLC
Some LLC Xtra cool space
Another LLC
Unit with features
Another LLC
Another LLC Unit with features
Another LLC
Basic unit
Another LLC
Another LLC Basic unit
Some LLC
basic unit
Some LLC
Some LLC basic unit
Imports and initial example df
import pandas as pd
import numpy as np
df = pd.DataFrame({
'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
})
ATTEMPT 0: Uses np.where
ATTEMPT 0 Results: ValueError as above
def my_func(df, unit, comp, bad_info_list):
"""Return new dataframe with new column combining company and unit descriptions
Args:
df (DataFrame): Pandas dataframe with product and brand info
unit (str): name of unit description column
comp (str): name of company name column
bad_info_list (list): list of unwanted terms
"""
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
(df["comp_new"].isnull().all() or df["comp_new"].isin(df[unit])),
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 1: Uses np.where with ValueError suggestions as indicated with inline comments
ATTEMPT 1 Results:
Using .all(): Seems to consider all matches over entire Series, so produces wrong results
Using .any(): Seems to consider any matches over entire Series, so produces wrong results
Using .item(): Seems to check size of entire Series, so produces ValueError: can only convert an array of size 1 to a Python scalar
Using .bool(): Returns same ValueError as before
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
((df["comp_new"].isnull().all()) | (df["comp_new"].isin(df[unit]))), # Swap .all() with other options
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 1.5: Same as 1 except .isnull().all() is swapped with == np.nan
ATTEMPT 1.5: Incorrect results
I found it odd that I got no ambiguity errors with the isin statement---perhaps it's not working as intended?
ATTEMPT 2: Uses if/elif/else and different suggestions from ValueError
Seems fixable using for loops for each conditional, but shouldn't there be another way?
ATTEMPT 2 Results: see bullet points from ATTEMPT 1
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
if df["comp_new"].isnull(): # Tried .all(), .any(), .item(), etc. just before ":"
df["comp_unit"] = df[unit]
elif df["comp_new"].isin(df[unit]): # Tried .all(), etc. just before ":"
df["comp_unit"] = df[unit]
else:
df["comp_unit"] = df["comp_new"] + " " + df[unit]
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 3: Uses if/elif/else combined with apply
ATTEMPT 3 Results: AttributeError: 'float' object has no attribute 'isin'
bad_info_list=["Generic", "Name"]
df["comp_new"] = df["company"].apply(lambda x: x if x not in bad_info_list else np.nan)
def comp_unit_merge(df):
if df["comp_new"] == np.nan: #.isnull().item():
return df["unit_desc"]
elif df["comp_new"].isin(df["unit_desc"]): # AttributeError: 'float' object has no attribute 'isin'
return df["unit_desc"]
else:
return df["comp_new"] + " " + df["unit_desc"]
df["comp_unit"] = df.apply(comp_unit_merge, axis=1)
print(df)
ATTEMPT 4: Uses np.select(conditions, values)
ATTEMPT 4 Result: Incorrect results
Company name not included in last few rows
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
conditions = [
((df["comp_new"] == np.nan) | (df["comp_new"].isin(df[comp]))),
(df["comp_new"] != np.nan),
]
values = [
(df[unit]),
(df["comp_new"] + " " + df[unit]),
]
df["comp_unit"] = np.select(conditions, values)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
When using axis=1, the applied function receives a single row as an argument. Indexing into the row gives you string objects, in most cases -- except where a NaN is encountered.
Numpy NaNs are actually floats. So when you attempt to perform string operations on the company column, like checking whether the unit_desc contains the company, this throws errors for rows that contain NaNs.
Numpy has a function isnan, but calling this function on a string also throws an error. So any rows that have an actual company value will cause problems for that check.
You could check the type of the data with isinstance, or you could just remove the NaNs from your data ahead of time.
This example removes the NaNs ahead of time.
badlist=["Generic", "Name"]
def merge(row):
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['company'] = df['company'].fillna('')
df['comp_unit'] = df.apply(merge, axis=1)
print(df)
Here's an online runnable version.
Here's an alternative that safely detects the NaNs:
badlist=["Generic", "Name"]
def merge(row):
if isinstance(row['company'], float) and np.isnan(row['company']):
return row['unit_desc']
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['comp_unit'] = df.apply(merge, axis=1)
print(df)
Here's an online runnable version.
def my_func(dataframe, unit, comp, bad_info_list):
df = dataframe.copy()
df['comp_new'] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
idx = df[df.apply(lambda x: str(x['comp_new']) in str(x[unit]), axis=1) | df['comp_new'].isnull()].index
df['comp_unit'] = np.where(
df.index.isin(idx),
df[unit],
df['comp_new'] + ' ' + df[unit]
)
return df
new_df = my_func(df, 'unit_desc', 'company', ['Generic', 'Name'])
Pandas in not good at vectorizing operation between Series (or columns) containing strings, so you will have to use apply(..., axis = 1) at a moment. I would just use it once:
bad_info_list=["Generic", "Name"]
df_new = df.assign(comp_new = df.apply(
lambda row: row['unit_desc'] if pd.isna(row['company']) or
row['company'] in bad_info_list or
row['unit_desc'].startswith(row['company'])
else ' '.join(row), axis=1))
It makes no change to the original df and produces as expected:
company unit_desc comp_new
0 Generic Some description Some description
1 NaN Unit with features Unit with features
2 Some LLC Some LLC Xtra cool space Some LLC Xtra cool space
3 Another LLC Unit with features Another LLC Unit with features
4 Another LLC Basic unit Another LLC Basic unit
5 Some LLC basic unit Some LLC basic unit
first try to fill in Nan values and then just add the two columns
df = pd.DataFrame({
'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
})
df = df.fillna('')
df['new_col'] = df['company'] + ' ' + df['unit_desc']
>>>>> df
company unit_desc new_col
0 Generic Some description Generic Some description
1 Unit with features Unit with features
2 Some LLC Some LLC Xtra cool space Some LLC Some LLC Xtra cool space
3 Another LLC Unit with features Another LLC Unit with features
4 Another LLC Basic unit Another LLC Basic unit
5 Some LLC basic unit Some LLC basic unit
If I understand correctly, "Attempt 0" was close, but the condition was incorrect. Try this instead:
df["comp_unit"] = np.where(
((df["comp_new"].isnull()) | (df["comp_new"].apply(lambda row: row['comp_new'] in row[unit], axis='columns'))),
df[unit],
(df["comp_new"] + " " + df[unit]),
)
I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.
I have this dataframe data where i have like 10.000 records of sold items for 201 authors.
I want to add a column to this dataframe which is the average price for each author.
First i create this new column average_price and then i create another dataframe df
where i have 201 columns of authors and their average price. (at least i think this is the right way to do this)
data["average_price"] = 0
df = data.groupby('Author Name', as_index=False)['price'].mean()
df looks like this
Author Name price
0 Agnes Cleve 107444.444444
1 Akseli Gallen-Kallela 32100.384615
2 Albert Edelfelt 207859.302326
3 Albert Johansson 30012.000000
4 Albin Amelin 44400.000000
... ... ...
196 Waldemar Lorentzon 152730.000000
197 Wilhelm von Gegerfelt 25808.510638
198 Yrjö Edelmann 53268.928571
199 Åke Göransson 87333.333333
200 Öyvind Fahlström 351345.454545
Now i want to use this df to populate the average_price column in the larger dataframe data.
I could not come up with how to do this so i tried a for loop which is not working. (And i know you should avoid for loops working with dataframes)
for index, row in data.iterrows():
for ind, r in df.iterrows():
if row["Author Name"] == r["Author Name"]:
row["average_price"] = r["price"]
So i wonder how this should be done?
You can use transform and groupby to add a new column:
data['average price'] = data.groupby('Author Name')['price'].transform('mean')
I think based on what you described, you should use .join method on a Pandas dataframe. You don't need to create 'average_price' column mannualy. This should simply work for your case:
df = data.groupby('Author Name', as_index=False)['price'].mean().rename(columns={'price':'average_price'})
data = data.join(df, on="Author Name")
Now you can get the average price from data['average_price'] column.
Hope this could help!
I think the easiest way to do that would be using join (aka pandas.merge)
df_data = pd.DataFrame([...]) # your data here
df_agg_data = data.groupby('Author Name', as_index=False)['price'].mean()
df_data = df_data.merge(df_agg_data, on="Author Name")
print(df_data)
i'm trying to take the info from dataframe and break it out into columns with the following header names. the info is all crammed into 1 cell.
new to python, so be gentle.
thanks for the help
my code:
r=requests.get('https://nclbgc.org/search/licenseDetails?licenseNumber=80479')
page_data = soup(r.text, 'html.parser')
company_info = [' '.join(' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('tr'))]
df = pd.DataFrame(company_info, columns = ['ic_number, status, renewal_date, company_name, address, county, telephon, limitation, residential_qualifiers'])
print(df)
the result i get:
['License Number, 80479 Status, Valid Renewal Date, n/a Name, DLR Construction, LLC Address, 3217 Vagabond Dr Monroe, N
C 28110 County, Union Telephone, (980) 245-0867 Limitation, Limited Classifications, Residential Qualifiers, Arteaga, Vi
cky Rodriguez']
You can use read_html with some post processing:
url = 'https://nclbgc.org/search/licenseDetails?licenseNumber=80479'
#select first table form list of tables, remove only NaNs rows
df = pd.read_html(url)[0].dropna(how='all')
#forward fill NaNs in first column
df[0] = df[0].ffill()
#merge values in second column
df = df.groupby(0)[1].apply(lambda x: ' '.join(x.dropna())).to_frame().rename_axis(None).T
print (df)
Address Classifications County License Number \
1 3217 Vagabond Dr Monroe, NC 28110 Residential Union 80479
Limitation Name Qualifiers Renewal Date \
1 Limited DLR Construction, LLC Arteaga, Vicky Rodriguez
Status Telephone
1 Valid (980) 245-0867
Replace the df line like below:
df = pd.DataFrame(company_info, columns = ['ic_number', 'status', 'renewal_date', 'company_name', 'address', 'county', 'telephon', 'limitation', 'residential_qualifiers'])
Each column mentioned under columns should be within quotes. Else it is considered as one single column.
I'm using Pandas as a way to write data from Selenium.
Two example results from a search box ac_results on a webpage:
#Search for product_id = "01"
ac_results = "Orange (10)"
#Search for product_id = "02"
ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"]
Orange returns only one price ($10) while Banana returns a variable number of prices from different vendors, in this example three prices ($10), ($20), ($30).
The code uses regex via re.findall to grab each price and put them into a list. The code works fine as long as re.findall finds only one list item, as for Oranges.
Problem is when there are a variable amount of prices, as when searching for Bananas. I would like to create a new row for each stated price, and the rows should also include product_id and item_name.
Current output:
product_id prices item_name
01 10 Orange
02 [u'10', u'20', u'30'] Banana
Desired output:
product_id prices item_name
01 10 Orange
02 10 Banana
02 20 Banana
02 30 Banana
Current code:
df = pd.read_csv("product_id.csv")
def crawl(product_id):
#Enter search input here, omitted
#Getting results:
search_result = driver.find_element_by_class_name("ac_results")
item_name = re.match("^.*(?=(\())", search_result.text).group().encode("utf-8")
prices = re.findall("((?<=\()[0-9]*)", search_reply.text)
return pd.Series([prices, item_name])
df[["prices", "item_name"]] = df["product_id"].apply(crawl)
df.to_csv("write.csv", index=False)
FYI: Workable solution with csv module, but I want to use Pandas.
with open("write.csv", "a") as data_write:
wr_data = csv.writer(data_write, delimiter = ",")
for price in prices: #<-- This is the important part!
wr_insref.writerow([product_id, price, item_name])
# initializing here for reproducibility
pids = ['01','02']
prices = [10, [u'10', u'20', u'30']]
names = ['Orange','Banana']
df = pd.DataFrame({"product_id": pids, "prices": prices, "item_name": names})
The following snippet should work after your apply(crawl).
# convert all of the prices to lists (even if they only have one element)
df.prices = df.prices.apply(lambda x: x if isinstance(x, list) else [x])
# Create a new dataframe which splits the lists into separate columns.
# Then flatten using stack. The explicit MultiIndex allows us to keep
# the item_name and product_id associated with each price.
idx = pd.MultiIndex.from_tuples(zip(*[df['item_name'],df['product_id']]),
names = ['item_name', 'product_id'])
df2 = pd.DataFrame(df.prices.tolist(), index=idx).stack()
# drop the hierarchical index and select columns of interest
df2 = df2.reset_index()[['product_id', 0, 'item_name']]
# rename back to prices
df2.columns = ['product_id', 'prices', 'item_name']
I was not able to run your code (probably missing inputs) but you can probably transform your prices list in a list of dict and then build a DataFrame from there:
d = [{"price":10, "product_id":2, "item_name":"banana"},
{"price":20, "product_id":2, "item_name":"banana"},
{"price":10, "product_id":1, "item_name":"orange"}]
df = pd.DataFrame(d)
Then df is:
item_name price product_id
0 banana 10 2
1 banana 20 2
2 orange 10 1