I create a dataframe called applesdf to show columns named id, item, price and location. My goal is to find the max price per Item and then indicate where the larger pricer was.
For example:
applesdf = {'ID': [1,1],
'Item': ['Apple','Apple'],
'Price': [2,1],
'Location':[1001,1002]
}
df = pd.DataFrame(applesdf, columns = ['ID','Item','Price','Location'])
df
My dataframe:
ID
Item
Price
Location
1
Apple
2
1001
1
Apple
1
1002
So I run a sort_values function to group the apples together and find the max price.
My code:
applesmax = df.sort_values('Price').drop_duplicates(['ID', 'Item'], keep='last').sort_index().reset_index(drop=True)
applesmax
Result:
ID
Item
Price
Location
1
Apple
2
1001
The problem is when we have the same price.
Same price table:
ID
Item
Price
Location
1
Apple
2
1001
1
Apple
2
1002
The code would return the last record obviously as instructed. I was wondering if anybody had tips/documentation as to how I could instruct the program to return both locations to indicate that neither location had a larger price.
My expected output:
ID
Item
Price
Location
1
Apple
2
1001,1002
So I created a small function that I think does what you seek. Bear with as it might not be optimised:
def max_price(dataframe, item):
df_item = dataframe[dataframe['Item']==item] ## get dataframe that only has the item you want
max_price = df_item['Price'].max()
if len(df_item[df_item['Price']==max_price])>1: ## if there are more than one rows with the price equal to the max price
return pd.DataFrame([[item, max_price, 'TIE']], columns=['Item', 'Price', 'Location']) ## return the tie
else:
return df_apple[df_apple['Price']==df_apple['Price'].max()] ## else return where the max price happens
Here is an example:
df = pd.DataFrame(zip(['Apple', 'Apple', 'Banana'], [2,2,4],[1001, 1002, 1003]),
columns = ['Item', 'Price', 'Location'])
max_price(df, 'Apple')
This way you also dont need to sort the original datframe.
Related
I have a dataframe (df1) that looks like this;
title
score
id
timestamp
Stock_name
Biocryst ($BCRX) continues to remain undervalued
120
mfuz84
2021-01-28 21:32:10
...and then continues with 44000 something more rows. I have another dataframe (df2) that looks like this;
Company name
Symbol
BioCryst Pharmaceuticals, Inc.
BCRX
GameStop
GME
Apple Inc.
AAPL
...containing all nasdaq and NYSE listed stocks. What I want to do now however, is to add the symbol of the stock to the column "Stock_name" in df1. In order to do this, I want to match the df1[title] with the df2[Symbol] and then based on what symbol has a match in the title, add the corresponding stock name (df2[Company name]) to the df1[Stock_name] column. If there is more than one stock name in the title, I want to use the first one mentioned.
Is there any easy way to do this?
I tried with this little dataset and it's working, let me know if you have some problems
df1 = pd.DataFrame({"title" : ["Biocryst ($BCRX) continues to remain undervalued", "AAPL is good, buy it"], 'score' : [120,420] , 'Stock_name' : ["",""] })
df2 = pd.DataFrame({'Company name' : ['BioCryst Pharmaceuticals, Inc.','GameStop','Apple Inc.'], 'Symbol' : ["BCRX","GME","AAPL"]})
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120
1 AAPL is good, buy it 420
df2
Company name Symbol
0 BioCryst Pharmaceuticals, Inc. BCRX
1 GameStop GME
2 Apple Inc. AAPL
for j in range(0,len(df2)):
for i in range(0,len(df1)):
if df2['Symbol'][j] in df1['title'][i]:
df1['Stock_name'][i] = df2['Symbol'][j]
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120 BCRX
1 AAPL is good, buy it 420 AAPL
First, I think you should create a dictionary based on df2.
symbol_lookup = dict(zip(df2['Symbol'],df2['Company name']))
Then you need a function that will parse the title column. If you can rely on stock symbols being preceded by a dollar sign, you can use the following:
def find_name(input_string):
for symbol in input_string.split('$'):
#if the first four characters form
#a stock symbol, return the name
if symbol_lookup.get(symbol[:4]):
return symbol_lookup.get(symbol[:4])
#otherwise check the first three characters
if symbol_lookup.get(symbol[:3]):
return symbol_lookup.get(symbol[:3])
You could also write a function based on expecting the symbols to be in parentheses. If you can't rely on either, it would be more complicated.
Finally, you can apply your function to the title column:
df1['Stock_name'] = df1['title'].apply(find_name)
Say I had a csv file with 3 columns, 'name', 'price' and 'color'.
How could I go about getting the variable of the name of say, the most expensive blue item, the most expensive red, and the most expensive yellow?
Would really appreciate any help :)
Our plan is to find the class we want (say, "blue" items) and, then, find the most expensive (the maximum in the price column).
Let's define an example DataFrame:
import pandas as pd
df = pd.DataFrame({
'name': [a for a in "abcdef"],
'price': [1.5, 3.8, 1.4, 5.9, 3.5, 1.9],
'color': ['blue', 'red', 'yellow', 'blue', 'red', 'yellow']
}).set_index('name')
And here our DataFrame:
price color
name
a 1.5 blue
b 3.8 red
c 1.4 yellow
d 5.9 blue
e 3.5 red
f 1.9 yellow
To do the first part (find the items of a specific color), we can use Pandas' query. So the following will select blue items and save to blue_items.
blue_items = df[df.color == "blue"] # selects the df's slice in which df.color is equals to "blue".
Then we can get the index of the maximum price (as I've defined name to be the index column, it'll return the name):
blue_items["price"].idxmax()
The complete code (now considering you're importing a CSV file):
import pandas as pd
df = pd.read_csv("filename.csv", index_col="name")
most_exp_blue = df[df.color == "blue"]["price"].idxmax() # the most expensive blue
most_exp_red = df[df.color == "red"]["price"].idxmax() # the most expensive red
most_exp_yellow = df[df.color == "yellow"]["price"].idxmax() # the most expensive yellow
Use pandas. You need filter by color and sort by price
df[df.color == 'color2'].sort_values(by='price', ascending=False).iloc[0]
Here is some sample:
d = [dict(name = 'nm1', price=100, color='color1'),
dict(name = 'nm2', price=200, color='color2'),
dict(name = 'nm3', price=300, color='color3'),
dict(name = 'nm4', price=400, color='color2')]
df = pd.DataFrame.from_dict(d)
Dataframe example:
name price color
0 nm1 100 color1
1 nm2 200 color2
2 nm3 300 color3
3 nm4 400
color2
You would check each item one at a time. You would check the colour and then you would check the last most expensive price for that colour you saw. If the price is bigger you record the price and name of the new largest item. if it is not bigger you move on to the next item.
import csv
with open('names.csv', newline='') as csvfile:
data = csv.DictReader(csv file)
largest = {}
for row in data:
colour = row['colour']
if largest.get(colour):
if row['price'] > largest[colour]['price']: # new largest price
largest[colour]['price'] = row['price']
largest[colour]['name'] = row['name']
else: # not seen before, make largest price
largest[colour] = {}
largest[colour]['price'] = row['price']
largest[colour]['name'] = row['name']
Example:
if your data is like this:
data={"name":['A-Blue','B-Blue','C-Blue','A-Red','B-Red','C-Red','A-Yellow','B-Yellow','C-Yellow'],
"price":[100,200,300,200,100,300,300,300,100],
"color":['Blue','Blue','Blue','Red','Red','Red','Yellow','Yellow','Yellow']}
then create pandas dataframe first with the below command:
pdf=pd.DataFrame(data,columns=['name','price','color'])
Now get the index of the records with the below command:
pdf.groupby("color")["price"].idxmax()
[ remember to use argmax instead of idxmax for the pandas older versions]
Now apply PDF[] around to get the complete row of max value of each color:
pdf.iloc[pdf.groupby("color")["price"].idxmax()]
To reset the index add reset_index to the command:
So the final answer is:
pdf.iloc[pdf.groupby("color")["price"].idxmax()].reset_index(drop=True)
Final output:
index name price color
0 C-Blue 300 Blue
1 C-Red 300 Red
2 A-Yellow 300 Yellow
( even you have duplicate higher price - first record will appear like A-yellow]
Given I have the following csv data.csv:
id,category,price,source_id
1,food,1.00,4
2,drink,1.00,4
3,food,5.00,10
4,food,6.00,10
5,other,2.00,7
6,other,1.00,4
I want to group the data by (price, source_id) and I am doing it with the following code
import pandas as pd
df = pd.read_csv('data.csv', names=['id', 'category', 'price', 'source_id'])
grouped = df.groupby(['price', 'source_id'])
valid_categories = ['food', 'drink']
for price_source, group in grouped:
if group.category.size < 2:
continue
categories = group.category.tolist()
if 'other' in categories and len(set(categories).intersection(valid_categories)) > 0:
pass
"""
Valid data in this case is:
1,food,1.00,4
2,drink,1.00,4
6,other,1.00,4
I will need all of the above data including the id for other purposes
"""
Is there an alternate way to perform the above filtering in pandas before the for loop and if it's possible, will it be any faster than the above?
The criteria for filtering is:
size of the group is greater than 1
the group by data should contain category other and at least one of either food or drink
You could directly apply a custom filter to the GroupBy object, something like
crit = lambda x: all((x.size > 1,
'other' in x.category.values,
set(x.category) & {'food', 'drink'}))
df.groupby(['price', 'source_id']).filter(crit)
Outputs
category id price source_id
0 food 1 1.0 4
1 drink 2 1.0 4
5 other 6 1.0 4
I have a df like :
products price
abc|abc|abc|abc|abc 1|2|10|20|30
abc|abc|deg| 3|8|5
abc|abc|abc|abc|abc|abc 10|11|12|13|14|15|16|17|18
Explication : Each rows is a basket buy by a customers.
All products is seperate by '|', so for example
the first customer (row) took 5 products for 63 $.
So normally, a rows contains the same number of '|'.
But as you can see, on the last row, there is 6 products and 9 prices.
The problem come from the limit of 256 char, so some products are not save on the file, but we have all price for products bought (of course if the column price doesn't exceed 256 CHAR !)
I would like to bring the Price until the max of '|' on the column products and obtain a df like :
products price
abc|abc|abc|abc|abc 1|2|10|20|30
abc|abc|deg| 3|8|5
abc|abc|abc|abc|abc|abc 10|11|12|13|14|15
I try this :
def count_fx(s):
return s.count('|')
max_prod = max(df['products'].apply(count_fx))
df.ix[np.logical_and(df.products.str.len()==255), ['products']]= df['products'].str.rpartition('|',max_prod)[0]
But it doesn't work.
Do you know any solution ?
Thanks
Use str.split('|', expand=True) to create a mask on price then recombine:
have_products = df.products.str.split('|', expand=True).notnull()
get_price = df.price.str.split('|', expand=True)[have_p]
df.price = get_price.apply(lambda x: '|'.join(x.dropna().astype(str)), axis=1)
print df
products price
0 abc|abc|abc|abc|abc 1|2|10|20|30
1 abc|abc|deg| 3|8|5
2 abc|abc|abc|abc|abc|abc 10|11|12|13|14|15
I'm using Pandas as a way to write data from Selenium.
Two example results from a search box ac_results on a webpage:
#Search for product_id = "01"
ac_results = "Orange (10)"
#Search for product_id = "02"
ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"]
Orange returns only one price ($10) while Banana returns a variable number of prices from different vendors, in this example three prices ($10), ($20), ($30).
The code uses regex via re.findall to grab each price and put them into a list. The code works fine as long as re.findall finds only one list item, as for Oranges.
Problem is when there are a variable amount of prices, as when searching for Bananas. I would like to create a new row for each stated price, and the rows should also include product_id and item_name.
Current output:
product_id prices item_name
01 10 Orange
02 [u'10', u'20', u'30'] Banana
Desired output:
product_id prices item_name
01 10 Orange
02 10 Banana
02 20 Banana
02 30 Banana
Current code:
df = pd.read_csv("product_id.csv")
def crawl(product_id):
#Enter search input here, omitted
#Getting results:
search_result = driver.find_element_by_class_name("ac_results")
item_name = re.match("^.*(?=(\())", search_result.text).group().encode("utf-8")
prices = re.findall("((?<=\()[0-9]*)", search_reply.text)
return pd.Series([prices, item_name])
df[["prices", "item_name"]] = df["product_id"].apply(crawl)
df.to_csv("write.csv", index=False)
FYI: Workable solution with csv module, but I want to use Pandas.
with open("write.csv", "a") as data_write:
wr_data = csv.writer(data_write, delimiter = ",")
for price in prices: #<-- This is the important part!
wr_insref.writerow([product_id, price, item_name])
# initializing here for reproducibility
pids = ['01','02']
prices = [10, [u'10', u'20', u'30']]
names = ['Orange','Banana']
df = pd.DataFrame({"product_id": pids, "prices": prices, "item_name": names})
The following snippet should work after your apply(crawl).
# convert all of the prices to lists (even if they only have one element)
df.prices = df.prices.apply(lambda x: x if isinstance(x, list) else [x])
# Create a new dataframe which splits the lists into separate columns.
# Then flatten using stack. The explicit MultiIndex allows us to keep
# the item_name and product_id associated with each price.
idx = pd.MultiIndex.from_tuples(zip(*[df['item_name'],df['product_id']]),
names = ['item_name', 'product_id'])
df2 = pd.DataFrame(df.prices.tolist(), index=idx).stack()
# drop the hierarchical index and select columns of interest
df2 = df2.reset_index()[['product_id', 0, 'item_name']]
# rename back to prices
df2.columns = ['product_id', 'prices', 'item_name']
I was not able to run your code (probably missing inputs) but you can probably transform your prices list in a list of dict and then build a DataFrame from there:
d = [{"price":10, "product_id":2, "item_name":"banana"},
{"price":20, "product_id":2, "item_name":"banana"},
{"price":10, "product_id":1, "item_name":"orange"}]
df = pd.DataFrame(d)
Then df is:
item_name price product_id
0 banana 10 2
1 banana 20 2
2 orange 10 1