Creating a table for time series analysis including counts - python

I got a dataframe on which I would like to perform some analysis. An easy example of what I would like to achieve is, having the dataframe:
data = ['2017-02-13', '2017-02-13', '2017-02-13', '2017-02-15', '2017-02-16']
df = pd.DataFrame(data = data, columns = ['date'])
I would like to create a new dataframe from this. The new dataframe should contain 2 columns, the entire date span. So it should also include 2017-02-14 and the number of times each date appears in the original data.
I managed to construct a dataframe that includes all the dates as so:
dates = pd.to_datetime(df['date'], format = "%Y-%m-%d")
dateRange = pd.date_range(start = dates.min(), end = dates.max()).tolist()
df2 = pd.DataFrame(data = datumRange, columns = ['datum'])
My question is, how would I add the counts of each date from df to df 2? I've been messing around trying to write my own functions but have not managed to achieve it. I am assuming this needs to be done more often and that I am thinking to difficult...

Try this:
df2['counts'] = df2['datum'].map(pd.to_datetime(df['date']).value_counts()).fillna(0).astype(int)

Related

Pandas DataFrame sorting issues, grouping for no reason?

I have one data frame containing stats about NBA season. I'm simply trying to sort by date, but for some reason it's grouping all games that have the same data and changing the values of that said date to the same values.
df = pd.read_csv("gamedata.csv")
df["Total"] = df["Tm"] + df["Opp.1"]
teams = df['Team']
df = df.drop(columns=['Team'])
df.insert(loc=4, column='Team', value=teams)
df["W/L"] = df["W/L"]=="W"
df["W/L"] = df["W/L"].astype(int)
df = df.sort_values("Date")
df.to_csv("gamedata_clean.csv")
Before
After
I expected the df to be unchanged except for the order to be in ascending date, but it's changing values in other columns for reasons I do not know.
Please add this line to your code to sort your dataframe by date
df.sort_values(by='Date')
I hope you will get the desired output

python pandas: how to modify column header name and modify the date formate

Using python pandas how can we change the data frame
First, how to copy the column name down to other cell(blue)
Second, delete the row and index column(orange)
Third, modify the date formate(green)
I would appreciate any feedback~~
Update
df.iloc[1,1] = df.columns[0]
df = df.iloc[1:].reset_index(drop=True)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df.set_index('Date')
print(df.columns)
Question 1 - How to copy column name to a column (Edit- Rename column)
To rename a column pandas.DataFrame.rename
df.columns = ['Date','Asia Pacific Equity Fund']
# Here the list size should be 2 because you have 2 columns
# Rename using pandas pandas.DataFrame.rename
df.rename(columns = {'Asia Pacific Equity Fund':'Date',"Unnamed: 1":"Asia Pacific Equity Fund"}, inplace = True)
df.columns will return all the columns of dataframe where you can access each column name with index
Please refer Rename unnamed column pandas dataframe to change unnamed columns
Question 2 - Delete a row
# Get rows from first index
df = df.iloc[1:].reset_index()
# To remove desired rows
df.drop([0,1]).reset_index()
Question 3 - Modify the date format
current_format = '%Y-%m-%d %H:%M:%S'
desired_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date']).dt.strftime(desired_format)
# Input the existing format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=current_format).dt.strftime(desired_format)
# To update date format of Index
df.index = pd.to_datetime(df.index,infer_datetime_format=current_format).strftime(desired_format)
Please refer pandas.to_datetime for more details
I'm not sure I understand your questions. I mean, do you actually want to change the dataframe or how it is printed/displayed?
Indexes can be changed by using methods .set_index() or .reset_index(), or can be dropped eventually. If you just want to remove the first digit from each index (that's what I understood from the orange column), you should then create a list with the new indexes and pass it as a column to your dataframe.
Regarding the date format, it depends on what you want the changed format to become. Take a look into python datetime.
I would strongly suggest you to take a better look into pandas features and documentations, and how to handle a dataframe with this library. There is plenty of great sources a Google-search away :)
Delete the first two rows using this.
Rename the second column using this.
Work with datetime format using the datetime package. Read about it here

Match 2 data frames by date and column name to get values

I have two data frames (they are already in a data frame format but for illustration, I created them as a dictionary first):
first = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul']}
df1 = pd.DataFrame(first)
And
second = {
'Date':['2013-02-28','2013-03-31','2013-05-30','2014-10-31'],
'Felix':['Value1_x','Value2_x','Value3_x','Value4_x'],
'Peter':['Value1_y','Value2_y','Value3_y','Value4_y']}
df2 = pd.DataFrame(second)
Now, I'd like to add an additional column to df1 containing the values of df2 if the df1.Date matches the df2.Date by year and month (the day does not usually match since df1 contains end of month dates) AND if the column name of df2 matches the according df1.Name values.
So the result should look like this:
df_new = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul'],
'Values':['Value1_x','Value2_x','Value3_y','NaN']}
df_new = pd.DataFrame(df_new)
Do you have any suggestions how to solve this problem?
I considered creating additional columns for year and month (df1['year']= df1['Date'].dt.year) and then matching df1[(df1['year'] == df2['year']) & (df1['month'] == df2['month'])] and calling the df2.column but I cant figure out how to put everything together
In general, try not to post your data sets as images, b/c it's hard to help you out then.
I think the easiest thing to do would be to create a column in each data frame where the Date is rounded to the first day of each month.
df1['Date_round'] = df1['Date'] - pd.offsets.MonthBegin(1)
df2['Date_round'] = df2['Date'] - pd.offsets.MonthBegin(1)
Then reshape df2 using melt.
df2_reshaped = df2.melt(id_vars=['Date','Date_round'], var_name='Name', value_name='Values')
And then you can join the data frames on Date_round and Name using pd.merge.
df = pd.merge(df1, df2_reshaped.drop('Date', axis=1), how='left', on=['Date_round', 'Name'])

Is there a way to rename multiple df columns in Python?

I'm trying to rename multiple columns in a dataframe to certain dates with Python.
Currently, the columns are as such: 2016-04, 2016-05, 2016-06....
I would like the columns to read: April2016, May2016, June2016...
There are around 40 columns. I am guessing a for loop would be the most efficient way to do this, but I'm relatively new to Python and not sure how to concatenate the column names correctly.
You can use loops or comprehensions along with a month dictionary to split, reorder, and replace the string column names
#in your case this would be cols=df.columns
cols=['2016-04', '2016-05', '2016-06']
rpl={'04':'April','05':'May','06':'June'}
cols=[\
''.join(\
[rpl[i.split('-')[1]],
i.split('-')[0]]) \
for i in cols]
cols
['April2016', 'May2016', 'June2016']
#then you would assign it back with df.columns = cols
You didn't share your dataframe so I used basic dataframe to explain how to get month is given date.I supposed your dataframe likes:
d = {'dates': ['2016-04', '2016-05','2016-06']} #just 3 of them
so all code :
import datetime
import pandas as pd
d = {'dates': ['2016-04', '2016-05','2016-06']}
df = pd.DataFrame(d)
for index, row in df.iterrows():
get_date= row['dates'].split('-')
get_month = get_date[1]
month = datetime.date(1900, int(get_month), 1).strftime('%B')
print (month+get_date[0])
OUTPUT :
2016April
2016May
2016June

how can I use reset_index with the multi grouped values(Hierarchical format) in Pandas Python

this is my data format, I want to reset the index and wanna make it in one table format, so I can take the count of all the id's which is 2nd row and can plot them with the histogram by date and the count,
any simple idea?
if reset_index() is not working, you can convert the table manually also.
Assume df1 is your existing data frame, we'll create df2 (new one) that you want.
df2 = pd.DataFrame()
df2['DateTime'] = df1.index.get_level_values(0).tolist()
df2['ID1'] = df1.index.get_level_values(1).tolist()
df2['ID2'] = df1['ID2'].values.tolist()
df2['Count'] = df1['Count'].values.tolist()

Categories