How to expand groupby df Pandas python - python

I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()

Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.

Related

Pandas extract time of certain entry from rows with same date

I'm fairly new to pandas and I'm trying find out how to extract time value from a DataFrame with rows that contain same and different dates, different sellers, different states, and time stamp for each state.
I need to extract rows where State = "received" shows up for the first time on each date from the DataFrame and for each seller.
As you can see in the screenshot above, the State "received" shows up twice on the same date for the same seller, but with different timestamp, I only need the rows where it appears for the first time.
PS. I haven't tried any code unfortunately, I searched a lot and came up empty. Any help is appreciated.
IIUC, you need pandas.groupby:
>>> df[df["State"]=="Received"].groupby(["Date", "Seller"])["Time"].first()
Date Seller
6/1/2021 Seller_1 8:00:36
Seller_2 9:45:30
6/2/2021 Seller_1 8:00:36
Seller_2 9:33:02
If you need earliest occurrence instead of first, you should use min() instead.
I would use:
df[df['State'] == 'Received'].sort_values(by=['Seller', 'Date', 'Time']) \
.drop_duplicates(subset=['Seller', 'Date'])
This would be mildly more performant than grouping.

Can i aggregate a dataframe monthly in Python taking into consideration also the other variables?

Sorry if the title is not clear enough. The dataset I have is 'df': representing daily data for two 'id' (1 and 2) in the month of january; for each 'id' in each day of january is associated a Value(a or b). The problem I'm having is that I want, starting from a dataset like df (the links are below), to arrive to df1. So the goal is to group monthly the data but for each 'id' and not the whole values. The Value column should be the sum of all 'a' and 'b' of a certain 'id' in a certain month.
I don't know if I have been clear explaining the problem. I hope the links below to help you. I am a very beginner in Python and I am facing may difficulties.
Thank you very much in advance.
Dataframe df head
Dataframe df end
Dataframe df1: the output I would like to obtain
From what I see, this should do:
df = df.set_index(pd.to_datetime(df['Date']))
df1= df.groupby([pd.Grouper(freq='M'), 'id']).agg({'value':'sum'}
But, as stated, you should post a reproducible example if you want te get better help.

How to update/apply validation to pandas columns

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

Cryptocurrency correlation in python, working with dictionaries

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I
To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)

Pandas: get average over certain rows and return as a dataframe

I have a df like this
It contains speed and dir at different date's hour minute. For example, the first row records that at 7:11, 20060101, the dir=87, speed=5.
Now, I think the data might be too precise, and I want to use the average at each hour for later computation. How can I do it?
I can do it by groupy
df['Hr']=df['HrMn'].apply(lambda x: str(x)[:-2])
df.groupby(['date', 'Hr'])['speed'].mean()
which would return what I want
But it is not a dataframe, and how can I use for later computation? Specifically, I want to know
If the groupby approach I'm using is the right approach for this problem? If so, how to use it later as a dataframe? (I also need to get dir, dir_max and other attributes as well)
The result groupby return is not well-orderd (in date and Hr), is there anyway to re-order it?
Update:
If I do df.groupby(['date', 'Hr'])['speed'].mean().unstack(), it would return
The data is certainly correct, but I still want to it follow the initial dataframe form as
Except that HrMn -> Hr
What you are getting is a multi-index dataframe. you can try
df.groupby(['date', 'Hr'])['speed'].mean().reset_index()
If you want mean for rest of the data, try
df.groupby(['date', 'Hr'])['speed', 'dir_max', 'speed_max'].mean().reset_index()
EDIT:
Applying mean on speed column and max on dir_max and speed_max
df.groupby(['date', 'Hr']).agg({'speed' : np.mean,'dir_max' : np.max, 'speed_max': np.max}).reset_index()

Categories