I'm stuck trying to figure out how to sum one of the columns in my dataframe based on day/month/year etc. I don't want to perform the aggregation on the other columns. As the dataframe will become shorter, I would like to use the minimum value from the other columns of the dataframe.
This is what I have, but it does not produce what I want. It only sums the first and last part and then gives me NaN values for the rest.
df = pd.DataFrame(zip(points, data, junk), columns=['Dates', 'Data', 'Junk'])
df.set_index('Dates', inplace=True)
_add = {'Data': np.sum, 'Junk': np.min}
newdf = df.resample('D', how=_add)
Thanks
Related
I have a .csv file with many rows and columns. For analysis purposes, I want to select a row number from the dataset and pass it as a dataframe in pandas.
Instead of writing the column names and input values inside a dict, how can I make it faster?
Right now I have:
df= pd.read_csv('filename.csv')
df2= pd.DataFrame({'var1': 5, 'var2': 10, 'var3': 15})
var1,var2,var3 are df columns. I want to make a seperate dataframe with df data.
You can either select a random row, or a given row number.
Thank you for your help.
df2 = df.iloc[rownum:rownum + 1, :]
If you want to filter out data as new dataframe from existing one you can use something like this -
based on particular rows required
df2 = df.iloc[4:5,:]
or data using some condition
df3 = df[df['var1'] < 10]
I've a dataframe in csv for different stock options and their close/high/low/open etc. But the format of the data is a bit difficult to work with. When calculating the returns using the adjusted close value, I've to create a new df each time to drop the null values.
Original Data
How do I convert it to the following format instead?
Converted data
The best way I could think of is to pivot the data:
# Drop date column (as it is already in the index), and pivot on Feature
df2 = df.drop(columns="Date").pivot(columns="Feature")
# Swap the column levels, so that Feature is first, then Ticker
df2 = df2.swaplevel(0, 1, 1)
# Stack the columns, so Ticker is one column, increasing the number of rows
df2 = df2.stack()
# Reset in the index, but keep it (the Date column)
df2.reset_index(inplace=True)
# Sort the rows on the Ticker and Date
df2.sort_values(["level_1", "Date"], inplace=True)
# Rename the Ticker column
df2.rename(columns={"level_1": "Ticker"}, inplace=True)
# Reset the index
df2.reset_index(drop=True, inplace=True)
This could all be run once, rather than defining df2 each time:
df2 = df.drop(columns="Date").pivot(columns="Feature") \
.swaplevel(0, 1, 1).stack().reset_index() \
.sort_values(["level_1", "Date"]) \
.rename(columns={"level_1": "Ticker"}).reset_index(drop=True)
Hopefully this works for you!
I created a pandas DataFrame that holds various summary statistics for several variables in my dataset. I want to name the columns of the dataframe, but every time I try it deletes all my data. Here is what it looks like without column names:
MIN = df.min(axis=0, numeric_only=True)
MAX = df.max(axis=0, numeric_only=True)
RANGE = MAX-MIN
MEAN = df.mean(axis=0, numeric_only=True)
MED = df.median(axis=0, numeric_only=True)
sum_stats = pd.concat([MIN, MAX, RANGE, MEAN, MED], axis=1)
sum_stats = pd.DataFrame(data=sum_stats)
sum_stats
My output looks like this:
But for some reason when I add column names:
sum_stats = pd.concat([MIN, MAX, RANGE, MEAN, MED], axis=1)
columns = ['MIN', 'MAX', 'RANGE', 'MEAN', 'MED']
sum_stats = pd.DataFrame(data=sum_stats, columns=columns)
sum_stats
My output becomes this:
Any idea why this is happening?
From the documentation for the columns parameter of the pd.DataFrame constructor:
[...] If data contains column labels, will perform column selection instead.
That means that, if the data passed is already a dataframe, for example, the columns parameter will act as a list of columns to select from the data.
If you change columns to equal a list of some columns that already exist in the dataframe that you're using, e.g. columns=[1, 4], you'll see that the resulting dataframe only contains those two columns, copied from the original dataframe.
Instead, you can assign the columns after you create the dataframe:
sum_stats.columns = ['MIN', 'MAX', 'RANGE', 'MEAN', 'MED']
Here is my data
threats = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv', index_col = 0)
And here is my code -
df = (threats
.query('threatened>0')
.groupby(['continent', 'threat_type'])
.agg({'threatened':'size'}))
However df.columns only Index(['threatened'], dtype='object') is the result. That is, only the threatened column is displaying not the columns I have actually grouped by i.e continent and threat_type although present in my data frame.
I would like to perform operation on the continent column of my data frame, but it is not displaying as one of the columns. For eg - continents = df.continent.unique(). This command gives me a key error of continent not found.
After groupby...pandas put the groupby columns in the index. Always reset index after doing groupby in pandas and don't do drop=True.
After your code.
df = df.reset_index()
And then you will get required columns.
I have a csv having multiple columns.
As an example, here is the header and the first 2 rows of the file:
ACC;SYM;SumRealPNL;Count;MinAVG;PerLotPNL;SumOneLotPNL;ProfitOnly;ProfitOnlyCount;ProfitOnlyMinAVG;LossOnly;LossOnlyCount;LossOnlyMinAVG;Period;-;P;Q;R;S;Total;U;AS;W;YEAH;Y
31942;EURUSD;4.593,00;17;730;336,47;5.720,00;5.720,00;17;730;0,00;0;0;4;;1;2;0;1;4;A;31942EURUSD1;12;16;18
34887;XAUUSD;16.150,00;7;276;588,43;4.119,00;4.119,00;7;276;0,00;0;0;4;;1;2;0;1;4;A;34887XAUUSD1;12;16;18
I load the csv file to a dataframe:
df = pd.read_csv('aaaa.csv', header=0, sep=';')
I grouped the dataframe by AS column:
byAS = df.groupby('AS')
Now I want to create a new dataframe having the following columns using the DataFrameGroupBy object (byAS):
AS column
First value of ACC column
First value of U column
Average of PerLotPNL column
Sum of SumOneLotPNL column
Sum of Y column
How can I do that?
Once you have your dataframe df and group on the AS column as you have already in your post, you can use the agg function to obtain the desired output.
byAS = df.groupby('AS')
result = byAS.agg({'ACC': 'first',
'U': 'first',
'PerLotPNL': np.mean,
'SumOneLotPNL': np.sum,
'Y': np.sum}).reset_index(inplace=True)