get unique column value by date in python - python

i've generated this dataframe:
np.random.seed(123)
len_df = 10
groups_list = ['A','B']
dates_list = pd.date_range(start='1/1/2020', periods=10, freq='D').to_list()
df2 = pd.DataFrame()
df2['date'] = np.random.choice(dates_list, size=len_df)
df2['value'] = np.random.randint(232, 1532, size=len_df)
df2['group'] = np.random.choice(groups_list, size=len_df)
df2 = df2.sort_values(by=['date'])
df2.reset_index(drop=True, inplace=True)
date group value
0 2020-01-01 A 652
1 2020-01-02 B 1174
2 2020-01-02 B 1509
3 2020-01-02 A 840
4 2020-01-03 A 870
5 2020-01-03 A 279
6 2020-01-04 B 456
7 2020-01-07 B 305
8 2020-01-07 A 1078
9 2020-01-10 A 343
I need to get rid of duplicated groups in the same date. I just want that one group appears only once in a date.
Result
date group value
0 2020-01-01 A 652
1 2020-01-02 B 1174
2 2020-01-02 A 840
3 2020-01-03 A 870
4 2020-01-04 B 456
5 2020-01-07 B 305
6 2020-01-07 A 1078
7 2020-01-10 A 343

.drop_duplicates() is in the pandas library and allows you to do exactly that. Read more in the documentation.
df2.drop_duplicates(subset=["date", "group"], keep="first")
Out[9]:
date group value
0 2020-01-01 A 652
1 2020-01-02 B 1174
3 2020-01-02 A 840
4 2020-01-03 A 870
6 2020-01-04 B 456
7 2020-01-07 B 305
8 2020-01-07 A 1078
9 2020-01-10 A 343

You can use drop_duplicates() to drop based on a subset of columns. However, you need to specify which row to keep e.g. first/last row.
df2 = df2.drop_duplicates(subset=['date', 'group'], keep='first')

you are looking for the drop_duplicates method on a dataframe.
df2 = df2.drop_duplicates(subset=['date', 'group'], keep='first').reset_index(drop=True)
date value group
0 2020-01-01 652 A
1 2020-01-02 1174 B
2 2020-01-02 840 A
3 2020-01-03 870 A
4 2020-01-04 456 B
5 2020-01-07 305 B
6 2020-01-07 1078 A
7 2020-01-10 343 A

Related

Duplicate rows for dates between interval

I have a frame like this:
ID
Start
Stop
1
2020-01-01
2020-01-05
2
2020-01-01
2020-01-10
And I want to duplicate the rows so I end up with a table like this:
ID
Start
Stop
Date
1
2020-01-01
2020-01-05
2020-01-01
1
2020-01-01
2020-01-05
2020-01-02
1
2020-01-01
2020-01-05
2020-01-03
1
2020-01-01
2020-01-05
2020-01-04
1
2020-01-01
2020-01-05
2020-01-05
2
2020-01-01
2020-01-10
2020-01-01
2
2020-01-01
2020-01-10
2020-01-02
2
2020-01-01
2020-01-10
2020-01-03
2
2020-01-01
2020-01-10
2020-01-04
2
2020-01-01
2020-01-10
2020-01-05
2
2020-01-01
2020-01-10
2020-01-06
2
2020-01-01
2020-01-10
2020-01-07
2
2020-01-01
2020-01-10
2020-01-08
2
2020-01-01
2020-01-10
2020-01-09
2
2020-01-01
2020-01-10
2020-01-10
I am however lost on how to achieve this, any pointers?
Generate a list using date_range() then expand it using explode()
df = pd.read_csv(io.StringIO("""ID Start Stop
1 2020-01-01 2020-01-05
2 2020-01-01 2020-01-10
"""), sep="\t", index_col=0)
df.Start = pd.to_datetime(df.Start)
df.Stop = pd.to_datetime(df.Stop)
df.assign(Date=lambda dfa: dfa.apply(lambda r: pd.date_range(r["Start"], r["Stop"]), axis=1)).explode("Date")
You should look into pandas DataFrames' methods iterrows to iterate over the rows of your original DataFrame and the function date_range to create the dates between each Start and Stop date.
Create a new DataFrame for each row from your original DataFrame (df) and then combine all the created DataFrames into one big DataFrame.
import pandas as pd
expanded_dfs = []
for idx, row in df.iterrows():
dates = pd.date_range(row["Start"], row["Stop"], freq="D")
expanded = pd.DataFrame({
"Start": row["Start"],
"End": row["Stop"],
"Date": dates,
"ID": row["ID"]
})
expanded_dfs.append(expanded)
pd.concat(expanded_dfs)

python difference with previous row by group

i am trying to take diff value from previous row in a dataframe by grouping column "group", there are several similar questions but i can't get this working.
date group value
0 2020-01-01 A 808
1 2020-01-01 B 331
2 2020-01-02 A 612
3 2020-01-02 B 1391
4 2020-01-03 A 234
5 2020-01-04 A 828
6 2020-01-04 B 820
6 2020-01-05 A 1075
8 2020-01-07 B 572
9 2020-01-10 B 736
10 2020-01-10 A 1436
df.sort_values(['group','date'], inplace=True)
df['diff'] = df['value'].diff()
print(df)
date value group diff
1 2020-01-03 234 A NaN
8 2020-01-01 331 B 97.0
2 2020-01-07 572 B 241.0
9 2020-01-02 612 A 40.0
5 2020-01-10 736 B 124.0
17 2020-01-01 808 A 72.0
14 2020-01-04 820 B 12.0
4 2020-01-04 828 A 8.0
18 2020-01-05 1075 A 247.0
7 2020-01-02 1391 B 316.0
10 2020-01-10 1436 A 45.0
This is the result that i need
date group value diff
0 2020-01-01 A 808 Na
2 2020-01-02 A 612 -196
4 2020-01-03 A 234 -378
5 2020-01-04 A 828 594
6 2020-01-05 A 1075 247
10 2020-01-10 A 1436 361
1 2020-01-01 B 331 Na
3 2020-01-02 B 1391 1060
6 2020-01-04 B 820 -571
8 2020-01-07 B 572 -248
9 2020-01-10 B 736 164
Shifts through each group to create a calculated column. Subtract that column from the original value column to create the difference column.
df.sort_values(['group','date'], ascending=[True,True], inplace=True)
df['shift'] = df.groupby('group')['value'].shift()
df['diff'] = df['value'] - df['shift']
df = df[['date','group','value','diff']]
1
df
date group value diff
0 2020-01-01 A 808 NaN
2 2020-01-02 A 612 -196.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
6 2020-01-05 A 1075 247.0
10 2020-01-10 A 1436 361.0
1 2020-01-01 B 331 NaN
3 2020-01-02 B 1391 1060.0
6 2020-01-04 B 820 -571.0
8 2020-01-07 B 572 -248.0
9 2020-01-10 B 736 164.0
You can group use diff()
df = df.sort_values('date')
df['diff'] = df.groupby(['group'])['value'].diff()
gives
date group value diff
0 2020-01-01 A 808 NaN
1 2020-01-01 B 331 NaN
2 2020-01-02 A 612 -196.0
3 2020-01-02 B 1391 1060.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
6 2020-01-04 B 820 -571.0
7 2020-01-05 A 1075 247.0
8 2020-01-07 B 572 -248.0
10 2020-01-10 A 1436 361.0
9 2020-01-10 B 736 164.0
If you want the dataset ordered as you have it you can add group to the sort but its not necessary for the operation and can be done before or after you get the differences.
df.sort_values(['group','date'])
date group value diff
0 2020-01-01 A 808 NaN
2 2020-01-02 A 612 -196.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
7 2020-01-05 A 1075 247.0
10 2020-01-10 A 1436 361.0
1 2020-01-01 B 331 NaN
3 2020-01-02 B 1391 1060.0
6 2020-01-04 B 820 -571.0
8 2020-01-07 B 572 -248.0
9 2020-01-10 B 736 164.0

How do I add a column to a dataframe using date as a filter

I'm trying to add a column from a pandas dataframe to another pandas dataframe using date columns to match them up. As you can see, one dataframe has one extra date I'd like to just skip. I figured an if statement would work for this but I'm getting: ValueError: Can only compare identically-labeled Series objects
Date Adj Close
0 2020-01-02 9.903333
1 2020-01-03 9.883333
2 2020-01-06 9.883333
3 2020-01-07 9.883333
4 2020-01-08 9.913333
.. ... ...
163 2020-08-25 10.133333
164 2020-08-26 10.173333
165 2020-08-27 10.183333
166 2020-08-28 10.206667
167 2020-08-31 10.203334
[168 rows x 2 columns]
Date Capital Stock
0 2020-01-02 7251.39
1 2020-01-03 47200.86
2 2020-01-06 119020.28
3 2020-01-07 11751250.39
4 2020-01-08 4790267.25
.. ... ...
162 2020-08-25 6348.29
163 2020-08-26 -11870.05
164 2020-08-27 73210.22
165 2020-08-28 120581.32
166 2020-08-31 134085.86
[167 rows x 2 columns]
here's what I've got!
if dshares_df['Date'] == price3['Date']:
dshares_df['close'] = price3['Adj Close']
For this case you want to merge the two dataframes,
Try:
merged_df = dshares_df.merge(price3, on='Date')
# To filter out the extra row
merged_df = merged_df.dropna()
see the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

Pandas DataFrame Pivot Using Dates and Counts

I've taken a large data file and managed to use groupby and value_counts to get the dataframe below. However, I want to format it so the company is on the left, with the months on top, and each number would be the number of calls that month, the third column.
Here is my code to sort:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count)
df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
Here is my output df=
recvd_dttm CompanyName
1/1/2015 11:42 Company 1 1
1/1/2015 14:29 Company 2 1
1/1/2015 8:12 Company 4 1
1/1/2015 9:53 Company 1 1
1/10/2015 11:38 Company 3 1
1/10/2015 11:31 Company 5 1
1/10/2015 12:04 Company 2 1
I want
Company Jan Feb Mar Apr May
Company 1 10 4 45 40 34
Company 2 2 5 56 5 57
Company 3 3 7 71 6 53
Company 4 4 4 38 32 2
Company 5 20 3 3 3 29
I know that there is a nifty pivot function for dataframes from this documentation http://pandas.pydata.org/pandas-docs/stable/reshaping.html for pandas, so I've been trying to use df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
One problem is that the third column doesn't have a name, so I can't use it for values = 'NumberCalls'. The second problem is figuring out how to take the datetime format in my dataframe and make it display by month only.
Edit:
CompanyName is the first column, recvd_dttm is the 15th column. This is my code after some more attempts:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count).set_index('recvd_dttm').sort_index()
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
result.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
It is throwing this error: KeyError: 'recvd_dttm' and won't get to the result line.
You need to aggregate the data before creating the pivot table. If there is no column name, you can either refer it to df.iloc[:, 1] (the 2nd column) or simply rename the df.
import pandas as pd
import numpy as np
# just simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']
Out[34]:
CompnayName
recvd_dttm
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company1 1
2015-01-01 00:00:00 company2 1
2015-01-01 01:00:00 company4 1
2015-01-01 01:00:00 company2 1
2015-01-01 01:00:00 company5 1
2015-01-01 03:00:00 company3 1
2015-01-01 03:00:00 company2 1
2015-01-01 03:00:00 company3 1
2015-01-01 04:00:00 company4 1
2015-01-01 04:00:00 company1 1
2015-01-01 04:00:00 company3 1
2015-01-01 05:00:00 company2 1
2015-01-01 06:00:00 company5 1
... ... ..
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company5 1
2015-06-29 19:00:00 company5 1
2015-06-29 20:00:00 company1 1
2015-06-29 20:00:00 company4 1
2015-06-29 22:00:00 company1 1
2015-06-29 22:00:00 company2 1
2015-06-29 22:00:00 company4 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company2 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company4 1
[10000 rows x 2 columns]
# first groupby month and company name, and calculate the sum of calls, and reset all index
# since we don't have a name for that columns, simply tell pandas it is the 2nd column we try to count on
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
# rename the columns
result.columns = ['Month', 'CompanyName', 'counts']
Out[41]:
Month CompanyName counts
0 1 company1 328
1 1 company2 337
2 1 company3 342
3 1 company4 345
4 1 company5 331
5 2 company1 295
6 2 company2 300
7 2 company3 328
8 2 company4 304
9 2 company5 329
10 3 company1 366
11 3 company2 398
12 3 company3 339
13 3 company4 336
14 3 company5 345
15 4 company1 322
16 4 company2 348
17 4 company3 351
18 4 company4 340
19 4 company5 312
20 5 company1 347
21 5 company2 354
22 5 company3 347
23 5 company4 363
24 5 company5 312
25 6 company1 316
26 6 company2 311
27 6 company3 331
28 6 company4 307
29 6 company5 316
# create pivot table
result.pivot(index='CompanyName', columns='Month', values='counts')
Out[44]:
Month 1 2 3 4 5 6
CompanyName
company1 326 297 339 337 344 308
company2 310 318 342 328 355 296
company3 347 315 350 343 347 329
company4 339 314 367 353 343 311
company5 370 331 370 320 357 294

Categories