I have a df like :
products price
abc|abc|abc|abc|abc 1|2|10|20|30
abc|abc|deg| 3|8|5
abc|abc|abc|abc|abc|abc 10|11|12|13|14|15|16|17|18
Explication : Each rows is a basket buy by a customers.
All products is seperate by '|', so for example
the first customer (row) took 5 products for 63 $.
So normally, a rows contains the same number of '|'.
But as you can see, on the last row, there is 6 products and 9 prices.
The problem come from the limit of 256 char, so some products are not save on the file, but we have all price for products bought (of course if the column price doesn't exceed 256 CHAR !)
I would like to bring the Price until the max of '|' on the column products and obtain a df like :
products price
abc|abc|abc|abc|abc 1|2|10|20|30
abc|abc|deg| 3|8|5
abc|abc|abc|abc|abc|abc 10|11|12|13|14|15
I try this :
def count_fx(s):
return s.count('|')
max_prod = max(df['products'].apply(count_fx))
df.ix[np.logical_and(df.products.str.len()==255), ['products']]= df['products'].str.rpartition('|',max_prod)[0]
But it doesn't work.
Do you know any solution ?
Thanks
Use str.split('|', expand=True) to create a mask on price then recombine:
have_products = df.products.str.split('|', expand=True).notnull()
get_price = df.price.str.split('|', expand=True)[have_p]
df.price = get_price.apply(lambda x: '|'.join(x.dropna().astype(str)), axis=1)
print df
products price
0 abc|abc|abc|abc|abc 1|2|10|20|30
1 abc|abc|deg| 3|8|5
2 abc|abc|abc|abc|abc|abc 10|11|12|13|14|15
Related
I create a dataframe called applesdf to show columns named id, item, price and location. My goal is to find the max price per Item and then indicate where the larger pricer was.
For example:
applesdf = {'ID': [1,1],
'Item': ['Apple','Apple'],
'Price': [2,1],
'Location':[1001,1002]
}
df = pd.DataFrame(applesdf, columns = ['ID','Item','Price','Location'])
df
My dataframe:
ID
Item
Price
Location
1
Apple
2
1001
1
Apple
1
1002
So I run a sort_values function to group the apples together and find the max price.
My code:
applesmax = df.sort_values('Price').drop_duplicates(['ID', 'Item'], keep='last').sort_index().reset_index(drop=True)
applesmax
Result:
ID
Item
Price
Location
1
Apple
2
1001
The problem is when we have the same price.
Same price table:
ID
Item
Price
Location
1
Apple
2
1001
1
Apple
2
1002
The code would return the last record obviously as instructed. I was wondering if anybody had tips/documentation as to how I could instruct the program to return both locations to indicate that neither location had a larger price.
My expected output:
ID
Item
Price
Location
1
Apple
2
1001,1002
So I created a small function that I think does what you seek. Bear with as it might not be optimised:
def max_price(dataframe, item):
df_item = dataframe[dataframe['Item']==item] ## get dataframe that only has the item you want
max_price = df_item['Price'].max()
if len(df_item[df_item['Price']==max_price])>1: ## if there are more than one rows with the price equal to the max price
return pd.DataFrame([[item, max_price, 'TIE']], columns=['Item', 'Price', 'Location']) ## return the tie
else:
return df_apple[df_apple['Price']==df_apple['Price'].max()] ## else return where the max price happens
Here is an example:
df = pd.DataFrame(zip(['Apple', 'Apple', 'Banana'], [2,2,4],[1001, 1002, 1003]),
columns = ['Item', 'Price', 'Location'])
max_price(df, 'Apple')
This way you also dont need to sort the original datframe.
I have a dataframe (df1) that looks like this;
title
score
id
timestamp
Stock_name
Biocryst ($BCRX) continues to remain undervalued
120
mfuz84
2021-01-28 21:32:10
...and then continues with 44000 something more rows. I have another dataframe (df2) that looks like this;
Company name
Symbol
BioCryst Pharmaceuticals, Inc.
BCRX
GameStop
GME
Apple Inc.
AAPL
...containing all nasdaq and NYSE listed stocks. What I want to do now however, is to add the symbol of the stock to the column "Stock_name" in df1. In order to do this, I want to match the df1[title] with the df2[Symbol] and then based on what symbol has a match in the title, add the corresponding stock name (df2[Company name]) to the df1[Stock_name] column. If there is more than one stock name in the title, I want to use the first one mentioned.
Is there any easy way to do this?
I tried with this little dataset and it's working, let me know if you have some problems
df1 = pd.DataFrame({"title" : ["Biocryst ($BCRX) continues to remain undervalued", "AAPL is good, buy it"], 'score' : [120,420] , 'Stock_name' : ["",""] })
df2 = pd.DataFrame({'Company name' : ['BioCryst Pharmaceuticals, Inc.','GameStop','Apple Inc.'], 'Symbol' : ["BCRX","GME","AAPL"]})
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120
1 AAPL is good, buy it 420
df2
Company name Symbol
0 BioCryst Pharmaceuticals, Inc. BCRX
1 GameStop GME
2 Apple Inc. AAPL
for j in range(0,len(df2)):
for i in range(0,len(df1)):
if df2['Symbol'][j] in df1['title'][i]:
df1['Stock_name'][i] = df2['Symbol'][j]
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120 BCRX
1 AAPL is good, buy it 420 AAPL
First, I think you should create a dictionary based on df2.
symbol_lookup = dict(zip(df2['Symbol'],df2['Company name']))
Then you need a function that will parse the title column. If you can rely on stock symbols being preceded by a dollar sign, you can use the following:
def find_name(input_string):
for symbol in input_string.split('$'):
#if the first four characters form
#a stock symbol, return the name
if symbol_lookup.get(symbol[:4]):
return symbol_lookup.get(symbol[:4])
#otherwise check the first three characters
if symbol_lookup.get(symbol[:3]):
return symbol_lookup.get(symbol[:3])
You could also write a function based on expecting the symbols to be in parentheses. If you can't rely on either, it would be more complicated.
Finally, you can apply your function to the title column:
df1['Stock_name'] = df1['title'].apply(find_name)
I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex
edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column
Sample data
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
regex
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
Desired output
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution
You can loop through the unique values of the regex dataframe and apply to the text of the df frame and return the pattern in a new regex column. Then, merge in the Pattern column and drop the regex column.
The key to my approach was to first create the column as NaN and then fillna with each iteration so the columns didn't get overwritten.
import re
import numpy as np
srs = regex['regex'].unique()
df['regex'] = np.nan
for reg in srs:
df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg
if re.search(reg, x) else np.NaN))
df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)
df
Out[1]:
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
Let's have ratings and books tables.
RATINGS
User-ID ISBN Book-Rating
244662 0373630689 7
19378 0812515595 10
238625 0441892604 9
180315 0140439072 0
242471 3548248950 0
BOOKS
ISBN Book-Title Book-Author Year-Of-Publication Publisher
0393000753 A Reckoning May Sarton 1981 W W Norton
Since many of the books have the same names and authors but different publishers and years of publication, I want to group them by title and replace ISBN in the rating table with the ISBN of the first row in the group.
More concretely, if the grouping looks like this
Book-Name ISBN
Name1 A
B
C
Name2 D
E
Name3 F
G
and the ratings like
User-ID ISBN Book-Rating
X B 3
X E 6
Y D 1
Z F 8
I want ratings to look like
User-ID ISBN Book-Rating
X A 3
X D 6
Y D 1
Z G 8
to save memory needed for pivot_table. The data set can be found here.
My attempt was along the lines of
book_rating_view = ratings.merge(books, how='left', on='ISBN').groupby(['Book-Title'])['ISBN']
ratings['ISBN'].replace(ratings['ISBN'], pd.Series([book_rating_view.get_group(key).min() for key,_ in book_rating_view]))
which doesn't seem to work. Another attempt was to construct the pivot_table directly as
isbn_vector = books.groupby(['Book-Title']).first()
utility = pd.DataFrame(0, index=explicit_ratings['User-ID'], columns=users['User-ID'])
for name, group in explicit_ratings.groupby('User-ID'):
user_vector = pd.DataFrame(0, index=isbn_vector, columns = [name])
for row, index in group:
user_vector[books.groupby(['Book-Title']).get_group(row['ISBN']).first()] = row['Book-Rating']
utility.join(user_vector)
which leads to a MemoryError, even though reduced table should fit into the memory.
Thanks for any advice!
I want you show us BOOK dataframe little bit more and also desired output most of all, but how about below? (Even I usually don't recommend store the data as list in the dataframe...)
Say df1 = RATINGS, df2 = BOOKS,
dfm = df2.merge(df1, on='ISBN').groupby('Book-Title').apply(list)
dfm['Book-Rating'] = dfm['Book-Rating'].map(sum)
I'm using Pandas as a way to write data from Selenium.
Two example results from a search box ac_results on a webpage:
#Search for product_id = "01"
ac_results = "Orange (10)"
#Search for product_id = "02"
ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"]
Orange returns only one price ($10) while Banana returns a variable number of prices from different vendors, in this example three prices ($10), ($20), ($30).
The code uses regex via re.findall to grab each price and put them into a list. The code works fine as long as re.findall finds only one list item, as for Oranges.
Problem is when there are a variable amount of prices, as when searching for Bananas. I would like to create a new row for each stated price, and the rows should also include product_id and item_name.
Current output:
product_id prices item_name
01 10 Orange
02 [u'10', u'20', u'30'] Banana
Desired output:
product_id prices item_name
01 10 Orange
02 10 Banana
02 20 Banana
02 30 Banana
Current code:
df = pd.read_csv("product_id.csv")
def crawl(product_id):
#Enter search input here, omitted
#Getting results:
search_result = driver.find_element_by_class_name("ac_results")
item_name = re.match("^.*(?=(\())", search_result.text).group().encode("utf-8")
prices = re.findall("((?<=\()[0-9]*)", search_reply.text)
return pd.Series([prices, item_name])
df[["prices", "item_name"]] = df["product_id"].apply(crawl)
df.to_csv("write.csv", index=False)
FYI: Workable solution with csv module, but I want to use Pandas.
with open("write.csv", "a") as data_write:
wr_data = csv.writer(data_write, delimiter = ",")
for price in prices: #<-- This is the important part!
wr_insref.writerow([product_id, price, item_name])
# initializing here for reproducibility
pids = ['01','02']
prices = [10, [u'10', u'20', u'30']]
names = ['Orange','Banana']
df = pd.DataFrame({"product_id": pids, "prices": prices, "item_name": names})
The following snippet should work after your apply(crawl).
# convert all of the prices to lists (even if they only have one element)
df.prices = df.prices.apply(lambda x: x if isinstance(x, list) else [x])
# Create a new dataframe which splits the lists into separate columns.
# Then flatten using stack. The explicit MultiIndex allows us to keep
# the item_name and product_id associated with each price.
idx = pd.MultiIndex.from_tuples(zip(*[df['item_name'],df['product_id']]),
names = ['item_name', 'product_id'])
df2 = pd.DataFrame(df.prices.tolist(), index=idx).stack()
# drop the hierarchical index and select columns of interest
df2 = df2.reset_index()[['product_id', 0, 'item_name']]
# rename back to prices
df2.columns = ['product_id', 'prices', 'item_name']
I was not able to run your code (probably missing inputs) but you can probably transform your prices list in a list of dict and then build a DataFrame from there:
d = [{"price":10, "product_id":2, "item_name":"banana"},
{"price":20, "product_id":2, "item_name":"banana"},
{"price":10, "product_id":1, "item_name":"orange"}]
df = pd.DataFrame(d)
Then df is:
item_name price product_id
0 banana 10 2
1 banana 20 2
2 orange 10 1