I have dataframe in the following general format:
customer_id,transaction_dt,product,price,units
1,2004-01-02 00:00:00,thing1,25,47
1,2004-01-17 00:00:00,thing2,150,8
2,2004-01-29 00:00:00,thing2,150,25
3,2017-07-15 00:00:00,thing3,55,17
3,2016-05-12 00:00:00,thing3,55,47
4,2012-02-23 00:00:00,thing2,150,22
4,2009-10-10 00:00:00,thing1,25,12
4,2014-04-04 00:00:00,thing2,150,2
5,2008-07-09 00:00:00,thing2,150,43
5,2004-01-30 00:00:00,thing1,25,40
5,2004-01-31 00:00:00,thing1,25,22
5,2004-02-01 00:00:00,thing1,25,2
I have it sorted by the relevant fields in ascending order. Now what I am trying to figure out how to check for a criteria inside a group and create a new indicator flag for only first time it occurs. As a toy example, I am trying to figure out something like this to start:
conditions = ((df['units'] > 20) | (df['price] > 50)
df['flag'] = df[conditions].groupby(['customer_id']).transform()
Any help on how best to formulate this properly would be most welcome!
Assuming you want the first chronological appearance of a customer_id, within the grouping you defined, you can use query, groupby, and first:
(
df.sort_values("transaction_dt")
.query("units > 20 & price > 50")
.groupby("customer_id")
.first()
)
Note: The example data you provided doesn't actually have multiple customer_id entries for the filters you specified, but the syntax will work in either case.
Related
I have a data frame with months (by year), and ID number. I am trying to calculate the attrition rate, but I am getting stuck on obtaining unique ID counts when a month equals a certain month in pandas.
ID.
Month
1
Sept. 2022
2
Oct. 2022
etc... with possible duplicates in ID and 1.75 years worth of data.
import pandas as pd
path = some path on my computer
data = pd.read_excel(path)
if data["Month"] == "Sept. 2022":
ID_SEPT = data["ID."].unique()
return ID_SEPT
I am trying to discover what I am doing incorrect here in this if-then statement. Ideally I am trying to collect all the unique ID values per each month per year to then calculate the attrition rate. Is there something obvious I am doing wrong here?
Thank you.
I tried an id-then statement and I was expecting unique value counts of ID per month.
You need to use one of the iterator functions, like items().
for (columnName, columnData) in data.iteritems():
if columnName = 'Month'
[code]
The way you do this with a dataframe, conceptually, is to filter the entire dataframe to be just the rows where your comparison is true, and then do whatever (get uniques) from there.
That would look like this:
filtered_df = df[df['Month'] == 'Sept. 2022']
ids_sept = list(filtered_df['ID.'].unique()
The first line there can look at a little strange, but what it is doing is:
df['Month'] == 'Sept. 2022' will return an array/column/series (it actually returns a series) of True/False whether or not the comparison is, well, true or false.
You then run that series of bools through df[series_of_bools] that filters the dataframe to return only the rows where it is True.
Thus, you have a filter.
If you are looking for the number of unique items, rather than the list of unique items, you can also use filtered_df['ID.'].nunique() and save yourself the step later of getting the length of the list.
You are looking for pandas.groupby.
Use it like this to get the unique values of each Group (Month)
data.groupby("Month")["ID."].unique() # You have a . after ID in your example, check if thats correct
try this
data[data.Month=='Sept. 2022']['ID.'].unique()
I have grouped the number of customers by region and year joined using groupby in Python. However I want to remove several regions from the region group.
I know in order to exclude one group from a groupby you can use the following code:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest')).index).
Therefore I initially tried the following:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index)
However that gave me the apparent error ('Southwest','Northwest').
Now I am wondering if there is a way to drop several groups at once instead of me having to type out the above code repeatedly for each region I want to remove.
I expect the output of the final query to be similar to the image shown below however information regarding the Northwest and Southwest regions should be removed.
It's not df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index). grouped.get_group takes a single name as argument. If you want to drop more than one group, you can use df1 = df.drop((grouped.get_group('Southwest').index, grouped.get_group('Northwest').index)) since drop can take a list as input.
As a side note, ('Southwest') evaluates to 'Southwest' (i.e. it's not a tuple). If you want to make a tuple of size 1, it's ('Southwest', )
I'm a beginner in panda and python, trying to learn it.
I would like to iterate over panda rows, to apply simple coded logic.
Instead of fancy mapping functions, I just want simple coded logic.
So then I can easily adapt it later for other coded logic rules as well.
In my dataframe dc,
I like to check if column AgeUnkown == 1 (or >0 )
And if so it should move the value of column Age to AgeUnknown.
And then make Age equal to 0.0
I tried various combinations of my below code but it won't work.
# using a row reference #########
for index, row in dc.iterrows():
r = row['AgeUnknown']
if (r>0):
w = dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Another attempt
for index in dc.index:
r = dc.at[index,'AgeUnknown'].[0] # also tried .sum here
if (r>0):
w= dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Also tried
if(dc[index,'Age']>0 #wasnt allowed either
Why isn't this working as far as I understood a dataframe should be able to be addressed like above.
I realize you requested a solution involving iterating the df, but I thought I'd provide one that I think is more traditional.
A non-iterating solution to your problem is something like this- 1) get all the indexes that meet your criteria 2) set those indexes of the df to what you want.
# indexes where column AgeUnknown is >0
inds = dc[dc['AgeUnknown'] > 0].index.tolist()
# change the indexes of AgeUnknown to to the Age column
dc.loc[inds, 'AgeUnknown'] = dc.loc[inds, 'Age']
# change the Age to 0 at those indexes
dc.loc[inds, 'Age'] = 0
Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()
I have CSV file which has 3 columns.
Here is what I have to do:
I want to write an if condition or whatever like if Divi == 'core' then I need the count of tags (distinct) without redundancy i.e ( two sand1 in the tag for core division should be considered as only one count).
One more if condition like Div === saturn or core && type == dev then same thing need to count the no of tags(distinct)
Can anyone help me out with this? As it was my idea.. any new ideas will be accepted if it satisfies requirement
First, load up your data with pandas.
import pandas as pd
dataframe = pd.read_csv(path_to_csv)
Second, format your data properly (you might have lower case/upper case data as in column 'Division' from your example)
for column in dataframe.columns:
dataframe[column] = dataframe[column].lower()
If you want to count frequency just by one column you can:
dataframe['Division'].value_counts()
If you want to count by two columns you can:
dataframe.groupby(['Division','tag']).count()
Hope that helps
edit:
While this wont give you just the count of when 2 conditions are met, which is what you asked for, it will give you a more 'complete' answer, showing the count for all two columns combinations