Adding value to new column based on if two other columns match - python

I have a dataframe (df1) that looks like this;
title
score
id
timestamp
Stock_name
Biocryst ($BCRX) continues to remain undervalued
120
mfuz84
2021-01-28 21:32:10
...and then continues with 44000 something more rows. I have another dataframe (df2) that looks like this;
Company name
Symbol
BioCryst Pharmaceuticals, Inc.
BCRX
GameStop
GME
Apple Inc.
AAPL
...containing all nasdaq and NYSE listed stocks. What I want to do now however, is to add the symbol of the stock to the column "Stock_name" in df1. In order to do this, I want to match the df1[title] with the df2[Symbol] and then based on what symbol has a match in the title, add the corresponding stock name (df2[Company name]) to the df1[Stock_name] column. If there is more than one stock name in the title, I want to use the first one mentioned.
Is there any easy way to do this?

I tried with this little dataset and it's working, let me know if you have some problems
df1 = pd.DataFrame({"title" : ["Biocryst ($BCRX) continues to remain undervalued", "AAPL is good, buy it"], 'score' : [120,420] , 'Stock_name' : ["",""] })
df2 = pd.DataFrame({'Company name' : ['BioCryst Pharmaceuticals, Inc.','GameStop','Apple Inc.'], 'Symbol' : ["BCRX","GME","AAPL"]})
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120
1 AAPL is good, buy it 420
df2
Company name Symbol
0 BioCryst Pharmaceuticals, Inc. BCRX
1 GameStop GME
2 Apple Inc. AAPL
for j in range(0,len(df2)):
for i in range(0,len(df1)):
if df2['Symbol'][j] in df1['title'][i]:
df1['Stock_name'][i] = df2['Symbol'][j]
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120 BCRX
1 AAPL is good, buy it 420 AAPL

First, I think you should create a dictionary based on df2.
symbol_lookup = dict(zip(df2['Symbol'],df2['Company name']))
Then you need a function that will parse the title column. If you can rely on stock symbols being preceded by a dollar sign, you can use the following:
def find_name(input_string):
for symbol in input_string.split('$'):
#if the first four characters form
#a stock symbol, return the name
if symbol_lookup.get(symbol[:4]):
return symbol_lookup.get(symbol[:4])
#otherwise check the first three characters
if symbol_lookup.get(symbol[:3]):
return symbol_lookup.get(symbol[:3])
You could also write a function based on expecting the symbols to be in parentheses. If you can't rely on either, it would be more complicated.
Finally, you can apply your function to the title column:
df1['Stock_name'] = df1['title'].apply(find_name)

Related

Python Pandas problem while assigning a value to a column cell using either "loc" or "at method inside loop

I have a data-frame (df) with the following columns:
source company category header content published_date sentiment
0 Forbes General Electric None Is New England Baking The Books On Oil-Fired C... The rise of natural gas as the primary fuel fo... 2014-01-01 0
1 Forbes General Electric None DARPA Is Building A Vanishing Battery: This Po... Considering that batteries are typically desig... 2014-01-02 0
2 Forbes General Electric None Four High-Yielding ETFs For Growth And Income Growth & income exchange-traded funds typicall... 2014-01-02 0
3 Forbes Citigroup None Analyst Moves: BAC, DUK, PZZA This morning, Citigroup upgraded shares of Ban... 2014-01-02 0
4 WSJ JPMorgan MARKETS JPMorgan Broker Barred for Role in Insider Tra... Finra says information about merger, acquisiti... 2014-01-02 0
The expected result that I should get, after assigning the new valus to the column sentiment, is the following (over 23,000 rows in total):
source company category header content published_date sentiment
0 Forbes General Electric None Is New England Baking The Books On Oil-Fired C... The rise of natural gas as the primary fuel fo... 2014-01-01 -1
1 Forbes General Electric None DARPA Is Building A Vanishing Battery: This Po... Considering that batteries are typically desig... 2014-01-02 1
2 Forbes General Electric None Four High-Yielding ETFs For Growth And Income Growth & income exchange-traded funds typicall... 2014-01-02 0
3 Forbes Citigroup None Analyst Moves: BAC, DUK, PZZA This morning, Citigroup upgraded shares of Ban... 2014-01-02 -1
4 WSJ JPMorgan MARKETS JPMorgan Broker Barred for Role in Insider Tra... Finra says information about merger, acquisiti... 2014-01-02 1
The algorithm that I'm using in order to update the sentiment column cell values is the shown below.
Note: I verified the updated values before and after using 'at' o 'loc' inside the loop ' for e in ch_ix:'. The cell values do change, but only inside that loop.
If I try to verify by printing 'dfd['sentiment]' the resulting values are still the same 0s:
dfd = db_data.copy()
for index in range(len(stock_list_company_name)):
cont = dfd.loc[dfd["company"] == stock_list_company_name[index]]
#stock_data is another df which contains the columns 'ticker, date, closed, volume, sentiment' and has more rows than dfd.
cont2 = stock_data.loc[stock_data["ticker"] == ticker_list[index]]
dates = cont2["date"].values
for ix in range(len(dates)):
if(not cont.loc[cont["published_date"] == dates[ix]].empty):
ch_ix = cont.loc[cont["published_date"] == dates[ix], "sentiment"].index
for e in ch_ix:
cont.at[e,"sentiment"] = cont2["sentiment"].values[ix]
print(dfd['sentiment'] #values are still 0s
Can someone please help me if this is a loop lack of memory or chained indexing problem? I can't still figure out why the values are not being updated.
Testing and running the code in this Google Colab => url
Inside the loop, you have this assignment:
cont = dfd.loc[dfd["company"] == stock_list_company_name[index]]
Note that this makes cont a new dataframe, not just a reference to a slice of the original dataframe. That's why when you change cont, it has no effect on dfd.
The easiest way to fix this is to assign the cont value back to the dfd slice, after you have made the changes to cont, still inside the loop:
dfd.loc[dfd["company"] == stock_list_company_name[index]] = cont

Pandas map two dataframes using regex

I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex
edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column
Sample data
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
regex
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
Desired output
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution
You can loop through the unique values of the regex dataframe and apply to the text of the df frame and return the pattern in a new regex column. Then, merge in the Pattern column and drop the regex column.
The key to my approach was to first create the column as NaN and then fillna with each iteration so the columns didn't get overwritten.
import re
import numpy as np
srs = regex['regex'].unique()
df['regex'] = np.nan
for reg in srs:
df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg
if re.search(reg, x) else np.NaN))
df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)
df
Out[1]:
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer

Merging two DataFrames (Datasets) on a specific ID column but with Date condition

I have two datasets:
One contains house energy certificates issued the last 10 years with an ID for the house and the date it was issued. One house could have more certificates issued, as they can renew it.
The other contains all transactions of houses for the last 10 years and the ID (Which is the same id as in the first dataset)
My problem is then find the Energy certificate value of the house on the date it was being sold. I am able to merge the datasets on the house ID, but not quite sure to deal with the date column.
The Energy Certificates has the column with the "DateIssued" and the Transaction data set has the column "OfficialDateSold". The conditions would then be to find the Energy certificate with the right House ID and then with the date closest to the sold date, but not after.
Snippet of the dataframes:
Transactions:
address_id sold_date
0 1223632151 NaN
1 160073875 2013-09-24
2 160073875 2010-06-16
3 160073875 2009-08-05
4 160073875 2006-12-18
... ... ...
2792726 2147477357 2011-11-03
2792727 2147477357 2014-02-26
2792728 2147477579 2017-05-24
2792729 2147479054 2013-02-04
2792730 2147482539 1993-08-10
Energy Certificate
id certificate_number date_issued
0 1785963944 A2012-274656 27.11.2012 10:32:35
1 512265039 A2010-6435 30.06.2010 13:19:18
2 2003824679 A2014-459214 17.06.2014 11:00:47
3 1902877247 A2011-133593 14.10.2011 12:57:08
4 1620713314 A2009-266 25.12.2009 13:18:32
... ... ... ...
307846 753123775 A2019-1078357 30.11.2019 17:23:59
307847 1927124560 A2019-1078363 30.11.2019 20:44:22
307848 1122610963 A2019-1078371 30.11.2019 22:44:45
307849 28668673 A2019-1078373 30.11.2019 22:56:23
307850 1100393780 A2019-1078377 30.11.2019 23:38:42
Want the output
id certificate_number date_issued sold_date
id = address_id
date_issued <= sold_date
But also to find the Certificate closest to the sold_date(the newest before sold)
(I know the dates must be in the same format)
I am using Python with Jupyter Notebook.
I think you need merge_asof, but first is necessary convert columns to datetimess by to_datetime and remove rows with missing values in sold_date by DataFrame.dropna:
df1['sold_date'] = pd.to_datetime(df1['sold_date'])
df2['date_issued'] = pd.to_datetime(df2['date_issued'], dayfirst=True)
df1 = df1.dropna(subset=['sold_date'])
df = pd.merge_asof(df2.sort_values('date_issued'),
df1.sort_values('sold_date'),
left_on='date_issued',
right_on='sold_date',
left_by='id',
right_by='address_id')

Pandas: Replacing column values with ones as retrieved from other dataframe

I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.

Removing characters in chain according to an other column

I have a df like :
products price
abc|abc|abc|abc|abc 1|2|10|20|30
abc|abc|deg| 3|8|5
abc|abc|abc|abc|abc|abc 10|11|12|13|14|15|16|17|18
Explication : Each rows is a basket buy by a customers.
All products is seperate by '|', so for example
the first customer (row) took 5 products for 63 $.
So normally, a rows contains the same number of '|'.
But as you can see, on the last row, there is 6 products and 9 prices.
The problem come from the limit of 256 char, so some products are not save on the file, but we have all price for products bought (of course if the column price doesn't exceed 256 CHAR !)
I would like to bring the Price until the max of '|' on the column products and obtain a df like :
products price
abc|abc|abc|abc|abc 1|2|10|20|30
abc|abc|deg| 3|8|5
abc|abc|abc|abc|abc|abc 10|11|12|13|14|15
I try this :
def count_fx(s):
return s.count('|')
max_prod = max(df['products'].apply(count_fx))
df.ix[np.logical_and(df.products.str.len()==255), ['products']]= df['products'].str.rpartition('|',max_prod)[0]
But it doesn't work.
Do you know any solution ?
Thanks
Use str.split('|', expand=True) to create a mask on price then recombine:
have_products = df.products.str.split('|', expand=True).notnull()
get_price = df.price.str.split('|', expand=True)[have_p]
df.price = get_price.apply(lambda x: '|'.join(x.dropna().astype(str)), axis=1)
print df
products price
0 abc|abc|abc|abc|abc 1|2|10|20|30
1 abc|abc|deg| 3|8|5
2 abc|abc|abc|abc|abc|abc 10|11|12|13|14|15

Categories