This question already has an answer here:
Renaming columns when using resample
(1 answer)
Closed 5 years ago.
The line of code below takes columns that represent each months total sales and averages the sales by quarter.
mdf = tdf[sel_cols].resample('3M',axis=1).mean()
What I need to do is title the columns with a str (cannot use pandas .Period function).
I attempting to use the following code, but I cannot get it to work.
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year, [1, 2, 3, 4][x.quarter==1]))
I want the columns to read... 2000q1, 2000q2, 2000q3, 2000q4, 2001q1,... etc, but keep getting wrong things like 2000q1, 2000q1, 2000q1, 2000q2, 2001q1.
How can I use the .format function to make this work properly.
The easiest way is to perform the quarter function on the datetime list like so
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year,x.quarter))
Related
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I have used the following code to make a distplot.
data_agg = data.groupby('HourOfDay')['travel_time'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,3))
sns.pointplot(data.HourOfDay.values, data.travel_time.values)
plt.show()
However I want to choose hours above 8 only and not 0-7. How do I proceed with that?
What about filtering first?
data_filtered = data[data['HourOfDay'] > 7]
# depending of the type of the column of date
data_agg = data_filtered.groupby('HourOfDay')['travel_time'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,3))
Sns.pointplot(data_filtered.HourOfDay.values, data_filtered.travel_time.values)
plt.show()
This question already has an answer here:
Hash each row of pandas dataframe column using apply
(1 answer)
Closed 2 years ago.
I have a dataframe df with columns as
Index(['learner_assignment_xid', 'assignment_xid', 'assignment_attempt_xid',
'learner_xid', 'section_xid', 'final_score_unweighted',
'attempt_score_unweighted', 'points_possible_unweighted',
'scored_datetime', 'gradebook_category_weight', 'status', 'is_deleted',
'is_scorable', 'drop_state', 'is_manual', 'created_datetime',
'updated_datetime'],
dtype='object')
i want to add a new column to thif df called checksum which will concatenate some of these columns and do md5 hash of it.
I am trying this :
df_gradebook['updated_checksum']=df_gradebook['final_score_unweighted'].astype(str)+df_gradebook['attempt_score_unweighted'].astype(str)+df_gradebook['points_possible_unweighted'].astype(str)+df_gradebook['scored_datetime'].astype(str)+df_gradebook['status'].astype(str)+df_gradebook['is_deleted'].astype(str)+df_gradebook['is_scorable'].astype(str)+df_gradebook['drop_state'].astype(str)+df_gradebook['updated_datetime'].astype(str)
Part I am struggling with is hash. How to apply md5 after concatenation is done.
I can do this in spark scala like this :
.withColumn("update_checksum",md5(concat(
$"final_score_unweighted",
$"attempt_score_unweighted",
$"points_possible_unweighted",
$"scored_datetime",
$"status",
$"is_deleted",
$"is_scorable",
$"drop_state",
$"updated_datetime"
)))
wanted to know how can I do md5 in python
df_gradebook['concat']=df_gradebook['final_score_unweighted'].astype(str)+df_gradebook['attempt_score_unweighted'].astype(str)+df_gradebook['points_possible_unweighted'].astype(str)+df_gradebook['scored_datetime'].astype(str)+df_gradebook['status'].astype(str)+df_gradebook['is_deleted'].astype(str)+df_gradebook['is_scorable'].astype(str)+df_gradebook['drop_state'].astype(str)+df_gradebook['updated_datetime'].astype(str)
df_gradebook['digest'] = df_gradebook['concat'].apply(lambda x: hashlib.md5(x.encode()).hexdigest())
Don't do everything in a single line, it makes it harder to read.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
So here's my daily challenge :
I have an Excel file containing a list of streets, and some of those streets will be doubled (or tripled) based on their road type. For instance :
In another Excel file, I have the street names (without duplicates) and their mean distances between features such as this :
Both Excel files have been converted to pandas dataframes as so :
duplicates_df = pd.DataFrame()
duplicates_df['Street_names'] = street_names
dist_df=pd.DataFrame()
dist_df['Street_names'] = names_dist_values
dist_df['Mean_Dist'] = dist_values
dist_df['STD'] = std_values
I would like to find a way to append the values of mean distance and STD many times in the duplicates_df whenever a street has more than one occurence, but I am struggling with the proper syntax. This is probably an easy fix, but I've never done this before.
The desired output would be :
Any help would be greatly appreciated!
Thanks again!
pd.merge(duplicates_df, dist_df, on="Street_names")
This question already has answers here:
Pandas: Selecting DataFrame rows between two dates (Datetime Index)
(3 answers)
Select rows between two DatetimeIndex dates
(2 answers)
Closed 4 years ago.
I've got a data frame of weekly stock price returns that are indexed by date, as follows.
FTSE_350 SP_500
2005-01-14 -0.004498 -0.001408
2005-01-21 0.001287 -0.014056
2005-01-28 0.011469 0.002988
2005-02-04 0.016406 0.027037
2005-02-11 0.015315 0.001887
I would like to return a data frame of rows where the index is in some interval, let's say all dates in January 2005. I'm aware that I could do this by turning the index into a "Date" column, but I was wondering if there's any way to do this directly.
Yup, there is, even simpler than doing a column!
Using .loc function, then just slice the dates out, like:
print(df.loc['2005-01-01':'2005-01-31'])
Output:
FTSE_350 SP_500
2005-01-14 -0.004498 -0.001408
2005-01-21 0.001287 -0.014056
2005-01-28 0.011469 0.002988
Btw, if index are objects, do:
df.index = pd.to_datetime(df.index)
before everything.
As #Peter mentioned The best is:
print(df.loc['2005-01'])
Also outputs:
FTSE_350 SP_500
2005-01-14 -0.004498 -0.001408
2005-01-21 0.001287 -0.014056
2005-01-28 0.011469 0.002988
This question already has answers here:
How to access pandas groupby dataframe by key
(6 answers)
Closed 8 years ago.
I want to group a dataframe by a column, called 'A', and inspect a particular group.
grouped = df.groupby('A', sort=False)
However, I don't know how to access a group, for example, I expect that
grouped.first()
would give me the first group
Or
grouped['foo']
would give me the group where A=='foo'.
However, Pandas doesn't work like that.
I couldn't find a similar example online.
Try: grouped.get_group('foo'), that is what you need.
from io import StringIO # from StringIO... if python 2.X
import pandas
data = pandas.read_csv(StringIO("""\
area,core,stratum,conc,qual
A,1,a,8.40,=
A,1,b,3.65,=
A,2,a,10.00,=
A,2,b,4.00,ND
A,3,a,6.64,=
A,3,b,4.96,=
"""), index_col=[0,1,2])
groups = data.groupby(level=['area', 'stratum'])
groups.get_group(('A', 'a')) # make sure it's a tuple
conc qual
area core stratum
A 1 a 8.40 =
2 a 10.00 =
3 a 6.64 =