Pandas - Get value from second dataframe based on combination of multiple columns - python

I'm struggling with a problem at the moment.
Basically I have two DataFrames.
One that is an export from my ERP System and gives me the current physical stock level, which should be enhanced with stock reservations per sales channel, e.g.
Stock = pd.DataFrame(data={'SKU': [1,2,3], 'PhysicalStock': [100,1,2], 'FirstSeenInStock': [2,5,200], 'SafetyStock_Platform1': [np.nan,np.nan,np.nan], 'SafetyStock_Platform2': [np.nan,np.nan,np.nan]})
The columns SKU, Physical Stock and First Seen in Stock (which is days since this product was first seen with stock) come from the ERP system. The columns for Safety stock should be derived from another DataFrame, which is maintained by someone for all marketplaces and looks like this:
SafetyStock = pd.DataFrame(data={'FromAgeDays': [0,2,9], 'ToAgeDays': [3,10,999], 'SafetyStock_Platform1': [10,1,0], 'SafetyStock_Platform2': [5,3,0]})
What I tried with iloc is to identify the values from the dataframe SafetyStock and copy them into the Stock dataframe, considering the following logic:
Stock['FirstSeenInStock'] >= SafetyStock['FromAgeDays']
Stock['FirstSeenInStock'] <= SafetyStock['ToAgeDays']
Right column for platform, thus I named the columns the same in both dataframes
The desired outcome would be the following:
DesiredOutcome = pd.DataFrame(data={'SKU': [1,2,3], 'PhysicalStock': [100, 1, 2], 'FirstSeenInStock': [2,5,200], 'SafetyStock_Platform1': [10,1,0], 'SafetyStock_Platform2': [5,3,0]})

You should use merge function in pandas which is essentially known as "join" in the world of database.
Merging based on conditions is not well developed in pandas, it is known as "non equi join".

Related

How to find the mean of subseries in DataFrames?

My personnel side project right now is to analyze GDP growth rates per capita. More specifically, I want to find the average growth rate for each decade since 1960, and then analyze it.
I pulled data from the World Bank API("wbgapi")as a DataFrame:
import pandas as pd
import wbgapi as wb
gdp=wb.data.DataFrame('NY.GDP.PCAP.KD.ZG')
gdp.head()
Output:
gdp
I then used nested for loops to calculate the mean for every decade and added it to a new dataframe.
row, col = gdp.shape
meandata = pd.DataFrame(columns = ['Country', 'Decade', 'MeanGDP', 'Region'])
for r in range (0, row, 1):
countrydata = gdp.iloc[r]
for c in range (0, col-9, 10):
decade = 1960+c
tenyeargdp = countrydata.array[c:c+10].mean()
meandata = meandata.append({'Country': gdp.iloc[r].name, 'Decade': decade, 'MeanGDP': tenyeargdp}, ignore_index=True)
meandata.head(10)
The code works and generates the following output: meandata
However, I have a few questions about this step:
Is there a more efficient way to do access the subseries of dataframes? I read that "for loops" should never be used for dataframes and that one should vectorize operations on dataframes?
Is the complexity O(n^2) since there are 2 for loops?
The second step is to group the individual countries by region, for future analysis. To do so I rely on the World Bank API which has its own Region, which each has a list of member economies/countries.
I iterated through the regions and the member list of each region. If a Country is part of the Region list I added that region series.
Since an economy/country can be part of multiple regions(ie the 'USA' can be part of NA and HIC(high-income country)), I concatenated the region to the previously added regions.
for rg in wb.region.list():
for co in wb.region.members(rg['code']):
str1 ='-'+meandata.loc[meandata['Country']==co, ['Region']].astype(str)
meandata.loc[meandata['Country']==co, ['Region']] = rg['code']+ str1
The code works mostly, however, sometimes it gives the error message that 'meandata' is not defined. I use Jupyter-Lab.
Additionally, Is there a simpler/more efficient way of doing the second step?
Thanks for reading and helping. Also, this is my first python/pandas coding experience, and as such general feedback is appreciated.
Consider to use groupby:
The aggregation will be based on columns you insert inside a List of columns in groupby functions.
In sample below I get the mean for 'County' and 'Region'.
metadata = metadata.groupby(['County','Region']).agg('MeanGDP':'mean').reset_index()

Pandas dataframe: summing cell data from a group of rows, storing in a new column

As a part of a treatment for a health related issue, I need to measure my liquid intake (along with some other parameters), registring the amount of liquid every time I drink. I have a dataframe, of several months of such registration.
I want to sum my daily amount in an additional column (in red, image below)
As you may see, I wish like to store it in the first column of the slice returned by df.groupby(df['Date'])., for all the days.
I tried the following:
df.groupby(df.Date).first()['Total']= df.groupby(df.Date)['Drank'].fillna(0).sum()
But seems not to be the way to do it.
Greatful for any advice.
Thanks
Michael
use fact False==0
first row of date will be where data is not equal to shift() of date
merge() to sum
## construct a data set
d = pd.date_range("1-jan-2021", "1-mar-2021", freq="2H")
A = np.random.randint(20,300,len(d)).astype(float)
A.ravel()[np.random.choice(A.size, A.size//2, replace=False)] = np.nan
df = pd.DataFrame({"datetime":d, "Drank":A})
df = df.assign(Date=df.datetime.dt.date, Time=df.datetime.dt.time).drop(columns=["datetime"]).loc[:,["Date","Time","Drank"]]
## construction done
# first row will have different date to shift
# merge Total back
df.assign(row=df.Date.eq(df.Date.shift())).merge(df.groupby("Date", as_index=False).agg(Total=("Drank","sum")).assign(row=0),
on=["Date","row"], how="left").drop(columns="row")

groupby.mean function dividing by pre-group count rather than post-group count

So I have the following dataset of trade flows that track imports, exports, by reporting country and partner countries. After I remove some unwanted columns, I edit my data frame such that trade flows between country A and country B is showing. I'm left with something like this:
[My data frame image] 1
My issue is that I want to be able to take the average of imports and exports for every partner country ('partner_code') per year, but when I run the following:
x = df[(df.location_code.isin(["IRN"])) &
df.partner_code.isin(['TCD'])]
grouped = x.groupby(['partner_code']).mean()
I end up getting an average of all exports divided by all instances where there is a 'product_id' (so a much higher number) rather than averaging imports or exports by total for all the years.
Taking the average of the following 5 export values gives an incorrect average:
5 export values
Wrong average
In pandas, we can groupby multiple columns, based on my understanding you want to group by partner, country and year.
The following line would work:
df = df.groupby(['partner_code', 'location_code', 'year'])['import_value', 'export_value'].mean()
Please note that the resulting dataframe is has MultiIndex index.
For reference, the official documentation: DataFrame.groupby documentation

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

Counting Frequency of an Aggregate result using pandas

Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()

Categories