I am new to pandas so I am not sure if i am doing what i want the best way possible, and one part seems to not be working properly.
In my database I have a table that records all the sells of the products on my website and I would like to create a csv report with all the product that have been sold, min price, max, price, and other information. The table that sells each sale has the following attributes:
product_id
sell_price
created_by
From my research i found out how to make a dataframe with the csv from a db export for now like below.
sellsdb = pd.read_csv('sellsdb.csv', delimiter = ',')
Now I make a copy of that dataframe without duplicates.
sells = sellsdb.copy().drop_duplicates(subset='product_id', keep=False)
Now I loop over each unique product title in the copied dataframe
for index, row in sells.iterrows():
countSells = sellsdb.loc[sellsdb['product_id'] == str(row['product_id'])].count()['product_id']
if countSells > 1:
print(countSells)
When I run this i have all the counts comming back as 1 even when there is a duplicate in the dataframe, but when I hard code a product id i get the right number for that id. What is going on?
In the loop i was just going to append the columns that i need for the report to the dataframe of no duplicates.
Assume that your DataFrame contains:
product_id sell_price created_by
0 Aaa 20.35 Xxxx1
1 Aaa 20.15 Xxxx2
2 Aaa 22.00 Xxxx3
3 Bbb 10.13 Xxxx4
4 Ccc 16.00 Xxxx5
5 Ccc 16.50 Xxxx6
6 Ccc 17.00 Xxxx7
To compute the number of sales per product it is much easier (and more
pandasonic) to run:
result = df.groupby('product_id').sell_price.count().rename('cnt')
I added rename('cnt') to give the result a meaningful name.
Otherwise the name would have been inherited from the original column
(sell_price), but the values are numbers of sales, not prices.
The result, for the above sample input, is:
product_id
Aaa 3
Bbb 1
Ccc 3
Name: cnt, dtype: int64
It is a Series with index named product_id and the value under each
index is the count of sales of this product.
And the final remark: I named the result cnt (not count) because
count is the name of a Pandas function (used here) and it is a bad
practice to use names "overwriting" names of existing functions or
attributes.
Related
Update pandas cell in one Dataframe from looked up value in second Dataframe
I have a case where I need to update a cell in one Dataframe, 'Stock', which holds records of stock on-hand, looking up its value in a second Dataframe, 'Items', which is the table of all items. This is a simplified example of the Dataframes with relevant fields.
Stock
item_no qty
0 9H.111 101
1 9H.222 230
2 MODEL_B 136
3 9H.444 344
4 MODEL_E 505
5 9H.666 332
Items
item_no model_no
0 9H.111 MODEL_A
1 9H.222 MODEL_B
2 9H.333 MODEL_B
3 9H.444 MODEL_C
4 9H.555 MODEL_D
5 9H.666 MODEL_E
6 9H.777 MODEL_D
7 9H.888 MODEL_F
The challenge
I have previously done this in PostgreSQL but would like to see if I can do all the processing in Pandas (I plan to link to the PostgreSQL table of items). If we look at the Stock table the item_no column should only have item numbers, see Items Dataframe (table), but sometimes the users put in the model_no instead of the item number. So in the Stock dataframe, row 2 incorrectly has the value MODEL_B.
What's needed
What is needed to be done is to:
get the value MODEL_B from the item_no column in the Stock dataframe
find that in the model_no column of the Items dataframe
then get the value from the item_no field of the Items dataframe
use that value to replace the (incorrect) model number value in the item_no column of the Stock dataframe
It gets a little more challenging... a model may have more than one part number:
1 9H.222 MODEL_B
2 9H.333 MODEL_B
In this case the 'highest' part number, in this case 9H.333, needs to be used. In SQL I use the MAX() operator.
I would like to perform this using 'set' operations in pandas (not looping), similar to running a query in SQL. So this would mean (?) joining the two dataframes on the fields stock.item_no <-> items.model_no (?) - I'm not sure how to go about it hence the question marks.
Generate Dataframes
This code will generate the dataframes discussed above.
stock = pd.DataFrame({
'item_no': ['9H.111', '9H.222', 'MODEL_B', '9H.444', 'MODEL_E', '9H.666'],
'qty': [101, 230, 136, 344, 505, 332],
})
items = pd.DataFrame({
'item_no': ['9H.111', '9H.222', '9H.333', '9H.444', '9H.555', '9H.666', '9H.777', '9H.888'],
'model_no': ['MODEL_A', 'MODEL_B', 'MODEL_B', 'MODEL_C', 'MODEL_D', 'MODEL_E', 'MODEL_D', 'MODEL_F']
})
display(stock)
display(items)
You can use:
# keep max item_no per model_no
# convert to mapping Series
mapper = (items.sort_values(by='item_no')
.drop_duplicates(subset='model_no', keep='last')
.set_index('model_no')['item_no']
)
# identify rows with model_no
m = stock['item_no'].isin(mapper.index)
# replace the values in place
stock.loc[m, 'item_no'] = stock.loc[m, 'item_no'].map(mapper)
updated stock:
item_no qty
0 9H.111 101
1 9H.222 230
2 9H.333 136
3 9H.444 344
4 9H.666 505
5 9H.666 332
so basically I have this dataframe called df:
where the first column have list of user id and the genre that they played and the total number of them. how can I extract the top 10 genres with most streams while showing the total number of users who streamed them?
so what I thought of doing is to sort the column values like this:
df_genre.sort_values(by="total_streams",ascending=False)
and then get the top genre but I got this:
But this is not what i want how can i fix it?
I think this is what you are looking for:
Data:
ID,genre,plays
12345,pop,23
12345,pop,576
12345,dance,18
12345,world,45
12345,dance,23
12345,pop,456
Input:
df = df.groupby(['ID','genre'])['plays'].sum().reset_index()
df.sort_values(by=['plays'], ascending=False)
Output:
ID genre plays
1 12345 pop 1055
2 12345 world 45
0 12345 dance 41
I have a dataset of stations
map_id longitude latitude zip_code
0 40830 -87.669147 41.857908 60608
1 40830 -87.669147 41.857908 60608
2 40120 -87.680622 41.829353 60609
3 40120 -87.680622 41.829353 60609
4 41120 -87.625826 41.831677 60616
As you can see, the first four rows are duplications and it is not an accident. They are the same stations, which are treated as separate stations of different lines.
I would like to eliminate such duplicates (it can be 2 or even 5 rows for some stations) and treat it as one station.
Moreover, I would like to create a new column "Hub", where aggregated rows will be treated a hub station. For example, as a boolean (0 for a regular station, 1 for a hub).
The desired output for the sample above with two cases of duplication -> transformed into 3 rows with 2 hubs.
map_id longitude latitude zip_code hub
0 40830 -87.669147 41.857908 60608 1
1 40120 -87.680622 41.829353 60609 1
1 41120 -87.625826 41.831677 60616 0
I appreciate any tips!
Looks to me like you want to drop duplicates and assign certain zipcodes as hub. If so, I would drop duplicates and use np.where to assign hubs. I included a non existent opcode to demonstrate how you can do this if more than one zipcode is designated as a hub
import numpy as np
df2=df.drop_duplicates(subset=['map_id','longitude','latitude','zip_code'], keep='first')
conditions=df2['zip_code'].isin(['60616','60619'])
df2['hub']=np.where(conditions,0,1)
df = df.groupby(['map_id','longitude','latitude','zip_code']).size().reset_index(name='hub')
df['hub'] = df['hub'].replace(1,0).apply(lambda x:min(x,1))
Output
map_id longitude latitude zip_code hub
0 40120 -87.680622 41.829353 60609 1
1 40830 -87.669147 41.857908 60608 1
2 41120 -87.625826 41.831677 60616 0
I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
I'm running into a problem while trying to iterate through each grouping of Customer/Employee numbers. I'm able to identify how many rows of data are in each EmployeeID/Customer number (EBCN below) group, but I need to reference specific data within those rows to assign variables for comparison.
So far, I've been able to write this code:
import pandas
import datetime
wd = pandas.read_csv(DATASOURCE)
l = 0
for row, i in wd.groupby(['EMPID', 'EBCN']).size().iteritems():
Covdt = pandas.to_datetime(wd.loc[l, 'CoverageEffDate'])
for each in range(i):
LapseDt = wd.loc[l, 'LapseDate']
if LapseDt != '?':
LapseDt = pandas.to_datetime(LapseDt) + datetime.timedelta(days=5)
if Covdt < LapseDt:
print('got one!')
l = l + 1
This code is not working because I'm trying to reference the coverage date/lapse dates on a particular row with the loc function, with my row number stored in the 'l' variable. I initially thought that Pandas would iterate through groups in the order they appear in my dataset, so that I could simply start with l=0 (i.e. the first row in the data), assign the coverage date and lapse date variables based on that, and then move on, but it appears that Pandas starts iterating through groups randomly. As a result, I do indeed get a comparison of lapse/coverage dates, but they're not associated with the groups that end up getting output by the code.
The best solution I can figure is to determine what the row number is for the first row of each group and then iterate forward by the number of rows in that group.
I've read through a question regarding finding the first row of a group, and am able to do so by using
wd.groupby(['EMPID','EBCN']).first()
but I haven't been able to figure out what row number the results are stored on in a way that I can reference with the loc function. Is there a way to store the row number for the first row of a group in a variable or something so I can iterate my coverage date and lapse date comparison forward from there?
Regarding my general method, I've read through the question here, which is very close to what I need:
pandas computation in each group
however, I need to compare each policy in the group against each other policy in the group - the question above just compares the last row in each group against the others.
Is there a way to do what I'm attempting in Pandas/Python?
For anyone needing this information in the future - I was able to implement Boud's suggestion to use the pandas.merge_asof() function to replace my code above. I had to do some data manipulation to get the desired result:
Splitting the dataframe into two separate frames - one with CoverageDate and one with LapseDate.
Replacing the '?' (null values) in my data with a numpy.nan datatype
Sorting the left and right dataframes by the Date columns
Once the data was in the correct format, I implemented the merge:
pandas.merge_asof(cov, term,
on='Date',
by='EMP|EBCN',
tolerance=pandas.Timedelta('5 days'))
Note 'cov' is my dataframe containing coverage dates, term is the dataframe with lapses. The 'EMP|EBCN' column is a concatenated column of the employee ID and Customer # fields, to allow easy use of the 'by' field.
Thanks so much to Boud for sending me down the correct path!
I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.
Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})