I want to add all the data from charts.zip from https://doi.org/10.5281/zenodo.4778562 in a single DataFrame. The data consist of a file per year that contains multiple CSVs. I made the following code:
header = 0
dfs = []
for file in glob.glob('Charts/*/201?/*.csv'):
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
weekly_chart = pd.read_csv(file, header=header, sep='\t')
weekly_chart['week_start'] = datetime.strptime(dates[0], '%Y-%m-%d')
weekly_chart['week_end'] = datetime.strptime(dates[1], '%Y-%m-%d')
weekly_chart['region'] = region
dfs.append(weekly_chart)
all_charts = pd.concat(dfs)
But, when I run it, python returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_12886/3473678833.py in <module>
9 weekly_chart['region'] = region
10 dfs.append(weekly_chart)
---> 11 all_charts = pd.concat(dfs)
~/Downloads/enter/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
344 ValueError: Indexes have overlapping values: ['a']
345 """
--> 346 op = _Concatenator(
347 objs,
348 axis=axis,
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
401
402 if len(objs) == 0:
--> 403 raise ValueError("No objects to concatenate")
404
405 if keys is None:
ValueError: No objects to concatenate
How can I fix it?
I think the glob.glob might just be over complicating things... This works perfectly for me.
# Gives you a list of EVERY file in the Charts directory
# and sub directories that is a CSV
file_list = []
for path, subdirs, files in os.walk("Charts"):
file_list.extend([os.path.join(path, x) for x in files if x.endswith('.csv')])
dfs = []
for file in file_list:
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
df = pd.read_csv(file, sep='\t')
df['week_start'] = dates[0]
df['week_end'] = dates[1]
df['region'] = region
dfs.append(df)
all_charts = pd.concat(dfs, ignore_index=True)
print(all_charts)
Output:
position song_id song_name artist streams ... peak_position position_status week_start week_end region
0 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 ... 1 0 2017-10-20 2017-10-27 us
1 2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 ... 2 0 2017-10-20 2017-10-27 us
2 3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 ... 3 1 2017-10-20 2017-10-27 us
3 4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 ... 4 1 2017-10-20 2017-10-27 us
4 5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 ... 1 -2 2017-10-20 2017-10-27 us
... ... ... ... ... ... ... ... ... ... ... ...
273595 196 6kex4EBAj0WHXDKZMEJaaF Swalla (feat. Nicki Minaj & Ty Dolla $ign) Jason Derulo 3747830 ... 8 -5 2018-03-02 2018-03-09 global
273596 197 0CokSRCu5hZgPxcZBaEzVE Glorious (feat. Skylar Grey) Macklemore 3725286 ... 14 -8 2018-03-02 2018-03-09 global
273597 198 7oK9VyNzrYvRFo7nQEYkWN Mr. Brightside The Killers 3717326 ... 148 -3 2018-03-02 2018-03-09 global
273598 199 7EUfNvyCVxQV3oN5ScA2Lb Next To Me Imagine Dragons 3681739 ... 122 -77 2018-03-02 2018-03-09 global
273599 200 6u0EAxf1OJTLS7CvInuNd7 Vai malandra (feat. Tropkillaz & DJ Yuri Martins) Anitta 3676542 ... 30 -23 2018-03-02 2018-03-09 global
If you really want the dates to be dates, you can run this on those two columns at the end.
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
Personally, I'd also do the following:
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
all_charts['week_end'] = pd.to_datetime(all_charts['week_end'])
all_charts['region'] = all_charts['region'].astype('category')
all_charts['artist'] = all_charts['artist'].astype('category')
all_charts['song_name'] = all_charts['song_name'].astype('category')
all_charts['song_id'] = all_charts['song_id'].astype('category')
all_charts.set_index(['region', 'week_start', 'week_end', 'position'], inplace=True)
all_charts.position_status = pd.to_numeric(all_charts.position_status, errors='coerce')
print(df.head(10))
Giving:
song_id song_name artist streams last_week_position weeks_on_chart peak_position position_status
region week_start week_end position
us 2017-10-20 2017-10-27 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 1.0 3 1 0.0
2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 2.0 6 2 0.0
3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 4.0 5 3 1.0
4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 5.0 3 4 1.0
5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 3.0 26 1 -2.0
6 5Gd19NupVe5X8bAqxf9Iaz Gorgeous Taylor Swift 6940802 NaN 1 6 NaN
7 0ofbQMrRDsUaVKq2mGLEAb Havana Camila Cabello 6623184 10.0 12 7 3.0
8 2771LMNxwf62FTAdpJMQfM Bodak Yellow Cardi B 6472727 6.0 14 3 -2.0
9 5Z3GHaZ6ec9bsiI5BenrbY Young Dumb & Broke Khalid 5982108 9.0 29 6 0.0
10 7GX5flRQZVHRAGd6B4TmDO XO Tour Llif3 Lil Uzi Vert 5822583 8.0 9 2 -2.0
Related
Problem
So I am trying to aggregate using Groupby my dataset but I've run into this error which I can't really decipher.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/Users/hamza.ahmed/Coding Projects/Client_Works/EPS/EPS_SproutSocial_GoogleSheets_Snowflake_v1.ipynb Cell 25 in <cell line: 1>()
1 out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
2 'customer_profile_id', 'profile_guid', 'text', 'perma_link',
3 'network', 'sent', 'created_time',
4 'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
5 'metrics.lifetime.impressions',
6 'metrics.lifetime.impressions_organic',
7 'metrics.lifetime.post_content_clicks',
8 'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
9 'metrics.lifetime.video_views', 'from.guid', 'from_profile',
10 'from.profile_picture',
11 'metrics.lifetime.impressions_organic_unique',
12 'metrics.lifetime.impressions_paid_unique',
13 'metrics.lifetime.impressions_paid',
14 'metrics.lifetime.impressions_unique',
15 'metrics.lifetime.impressions_follower_unique',
16 'metrics.lifetime.impressions_nonviral_unique',
17 'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
---> 18 'title', 'tags', 'Campaigns'], as_index=False)['Campaigns'].agg(list)
19 # or out.fillna('').groupby(['contents', 'posts', 'impressions', 'clicks', 'reactions'], as_index=False)['Campaigns'].agg(', '.join)
20 print(out)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/groupby/generic.py:883, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
878 if result is None:
879
880 # grouper specific aggregations
881 if self.grouper.nkeys > 1:
882 # test_groupby_as_index_series_scalar gets here with 'not self.as_index'
--> 883 return self._python_agg_general(func, *args, **kwargs)
884 elif args or kwargs:
885 # test_pass_args_kwargs gets here (with and without as_index)
886 # can't return early
887 result = self._aggregate_frame(func, *args, **kwargs)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1477, in GroupBy._python_agg_general(self, func, *args, **kwargs)
1474 return self._python_apply_general(f, self._selected_obj)
1476 for idx, obj in enumerate(self._iterate_slices()):
-> 1477 name = obj.name
1479 try:
1480 # if this function is invalid for this dtype, we will ignore it.
1481 result = self.grouper.agg_series(obj, f)
AttributeError: 'str' object has no attribute 'name'
My Dataset
index_1
index
content_category
post_category
post_type
customer_profile_id
profile_guid
text
perma_link
network
sent
created_time
metrics.lifetime.likes
metrics.lifetime.comments_count
metrics.lifetime.impressions
metrics.lifetime.impressions_organic
metrics.lifetime.post_content_clicks
metrics.lifetime.shares_count
metrics.lifetime.reactions
metrics.lifetime.video_views
from.guid
from_profile
from.profile_picture
metrics.lifetime.impressions_organic_unique
metrics.lifetime.impressions_paid_unique
metrics.lifetime.impressions_paid
metrics.lifetime.impressions_unique
metrics.lifetime.impressions_follower_unique
metrics.lifetime.impressions_nonviral_unique
metrics.lifetime.impressions_viral_unique
from.screen_name
title
tags
Campaigns
0
0
VIDEO
POST
LINKEDIN_COMPANY_UPDATE
4526462
licp:596893
Let's celebrate International Beer Day by givi...
https://linkedin.com/feed/update/urn:li:ugcPos...
LINKEDIN
TRUE
2022-08-06T02:10:17Z
27
0
1118
1118
5
2
27
221
licp:596893
Ernest Packaging Solutions
https://media-exp2.licdn.com/dms/image/C560BAQ...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1141696
Fun Extra
1
0
VIDEO
POST
LINKEDIN_COMPANY_UPDATE
4526462
licp:596893
Let's celebrate International Beer Day by givi...
https://linkedin.com/feed/update/urn:li:ugcPos...
LINKEDIN
TRUE
2022-08-06T02:10:17Z
27
0
1118
1118
5
2
27
221
licp:596893
Ernest Packaging Solutions
https://media-exp2.licdn.com/dms/image/C560BAQ...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1141676
Video
2
1
VIDEO
POST
FACEBOOK_POST
4530996
fbpr:259614427411446
The best way to celebrate International Beer D...
https://www.facebook.com/259614427411446/posts...
FACEBOOK
TRUE
2022-08-06T00:10:19Z
4
0
67
67
0
0
4
18
fbpr:259614427411446
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t1.6435...
67
0
0
67
50
67
0
ErnestPackaging
NaN
1141696
Fun Extra
3
1
VIDEO
POST
FACEBOOK_POST
4530996
fbpr:259614427411446
The best way to celebrate International Beer D...
https://www.facebook.com/259614427411446/posts...
FACEBOOK
TRUE
2022-08-06T00:10:19Z
4
0
67
67
0
0
4
18
fbpr:259614427411446
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t1.6435...
67
0
0
67
50
67
0
ErnestPackaging
NaN
1141676
Video
4
2
VIDEO
POST
INSTAGRAM_MEDIA
4530999
ibpr:17841401555616258
The best way to celebrate #internationalbeerda...
https://www.instagram.com/p/Cg5YLn9sOhA/
INSTAGRAM
TRUE
2022-08-05T23:38:16Z
10
2
152
152
NaN
NaN
10
47
ibpr:17841401555616258
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t51.288...
138
NaN
NaN
138
NaN
NaN
NaN
ernest_packaging
NaN
1141696
Fun Extra
My Code
out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
'customer_profile_id', 'profile_guid', 'text', 'perma_link',
'network', 'sent', 'created_time',
'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
'metrics.lifetime.impressions',
'metrics.lifetime.impressions_organic',
'metrics.lifetime.post_content_clicks',
'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
'metrics.lifetime.video_views', 'from.guid', 'from_profile',
'from.profile_picture',
'metrics.lifetime.impressions_organic_unique',
'metrics.lifetime.impressions_paid_unique',
'metrics.lifetime.impressions_paid',
'metrics.lifetime.impressions_unique',
'metrics.lifetime.impressions_follower_unique',
'metrics.lifetime.impressions_nonviral_unique',
'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
'title', 'tags', 'Campaigns'], as_index=False)['Campaigns'].agg(list)
My Target Result
index_1
index
content_category
post_category
post_type
customer_profile_id
profile_guid
text
perma_link
network
sent
created_time
metrics.lifetime.likes
metrics.lifetime.comments_count
metrics.lifetime.impressions
metrics.lifetime.impressions_organic
metrics.lifetime.post_content_clicks
metrics.lifetime.shares_count
metrics.lifetime.reactions
metrics.lifetime.video_views
from.guid
from_profile
from.profile_picture
metrics.lifetime.impressions_organic_unique
metrics.lifetime.impressions_paid_unique
metrics.lifetime.impressions_paid
metrics.lifetime.impressions_unique
metrics.lifetime.impressions_follower_unique
metrics.lifetime.impressions_nonviral_unique
metrics.lifetime.impressions_viral_unique
from.screen_name
title
tags
Campaigns
0
0
VIDEO
POST
LINKEDIN_COMPANY_UPDATE
4526462
licp:596893
Let's celebrate International Beer Day by givi...
https://linkedin.com/feed/update/urn:li:ugcPos...
LINKEDIN
TRUE
2022-08-06T02:10:17Z
27
0
1118
1118
5
2
27
221
licp:596893
Ernest Packaging Solutions
https://media-exp2.licdn.com/dms/image/C560BAQ...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1141696
[Fun Extra, video]
1
1
VIDEO
POST
FACEBOOK_POST
4530996
fbpr:259614427411446
The best way to celebrate International Beer D...
https://www.facebook.com/259614427411446/posts...
FACEBOOK
TRUE
2022-08-06T00:10:19Z
4
0
67
67
0
0
4
18
fbpr:259614427411446
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t1.6435...
67
0
0
67
50
67
0
ErnestPackaging
NaN
1141696
[Fun Extra, video]
2
2
VIDEO
POST
INSTAGRAM_MEDIA
4530999
ibpr:17841401555616258
The best way to celebrate #internationalbeerda...
https://www.instagram.com/p/Cg5YLn9sOhA/
INSTAGRAM
TRUE
2022-08-05T23:38:16Z
10
2
152
152
NaN
NaN
10
47
ibpr:17841401555616258
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t51.288...
138
NaN
NaN
138
NaN
NaN
NaN
ernest_packaging
NaN
1141696
Fun Extra
Any help would be greatly appreciated!!
Thank you so much!!
Try to remove the 'Campaigns' inside the groupby method.
Your code:
out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
'customer_profile_id', 'profile_guid', 'text', 'perma_link',
'network', 'sent', 'created_time',
'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
'metrics.lifetime.impressions',
'metrics.lifetime.impressions_organic',
'metrics.lifetime.post_content_clicks',
'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
'metrics.lifetime.video_views', 'from.guid', 'from_profile',
'from.profile_picture',
'metrics.lifetime.impressions_organic_unique',
'metrics.lifetime.impressions_paid_unique',
'metrics.lifetime.impressions_paid',
'metrics.lifetime.impressions_unique',
'metrics.lifetime.impressions_follower_unique',
'metrics.lifetime.impressions_nonviral_unique',
'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
'title', 'tags', 'Campaigns'], as_index=False)['Campaigns'].agg(list)
My sugestion:
out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
'customer_profile_id', 'profile_guid', 'text', 'perma_link',
'network', 'sent', 'created_time',
'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
'metrics.lifetime.impressions',
'metrics.lifetime.impressions_organic',
'metrics.lifetime.post_content_clicks',
'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
'metrics.lifetime.video_views', 'from.guid', 'from_profile',
'from.profile_picture',
'metrics.lifetime.impressions_organic_unique',
'metrics.lifetime.impressions_paid_unique',
'metrics.lifetime.impressions_paid',
'metrics.lifetime.impressions_unique',
'metrics.lifetime.impressions_follower_unique',
'metrics.lifetime.impressions_nonviral_unique',
'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
'title', 'tags'], as_index=False)['Campaigns'].agg(list)
I'm very new to Python and was hoping to get some help. I am following an online example where the author creates a dictionary, adds some data to it and then appends this to his original dataframe.
When I follow the code the data in the dictionary doesn't get appended to the dataframe and as such I can't continue with the example.
The authors code is as follows:
from collections import defaultdict
won_last = defaultdict(int)
for index,row in data.iterrows():
home_team = row['HomeTeam']
visitor_team = row['AwayTeam']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
results.ix[index]=row
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
When I run this code I get the error message (note that the name of the dataframe is different but apart from that nothing has changed)
AttributeError Traceback (most recent call last)
<ipython-input-46-d31706a5f745> in <module>
4 row['HomeLastWin'] = won_last[home_team]
5 row['VisitorLastWin'] = won_last[visitor_team]
----> 6 data.ix[index]=row
7 won_last[home_team] = row['HomeWin']
8 won_last[visitor_team] = not row['HomeWin']
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'ix'
If I change the row data.ix[index]=row to data.loc[index]=row the code runs ok but nothing happens to my dataframe
Below is an example of the dataset I am working with
Div Date Time HomeTeam AwayTeam FTHG FTAG FTR HomeWIn
E0 12/09/2020 12:30 Fulham Arsenal 0 3 A FALSE
E0 12/09/2020 15:00 Crystal Palace Southampton 1 0 H FALSE
E0 12/09/2020 17:30 Liverpool Leeds 4 3 H TRUE
E0 12/09/2020 20:00 West Ham Newcastle 0 2 A TRUE
E0 13/09/2020 14:00 West Brom Leicester 0 3 A FALSE
and below is the dataset of the example I am working through with the columns added
Date Visitor Team VisitorPts Home Team HomePts HomeWin
20 01/11/2013 Milwaukee 105 Boston 98 FALSE
21 01/11/2013 Miami Heat 100 Brooklyn 101 TRUE
22 01/11/2013 Clevland 84 Charlotte 90 TRUE
23 01/11/2013 Portland 113 Denver 98 FALSE
24 01/11/2013 Dallas 91 Houston 113 TRUE
HomeLastWin VisitorLastWIn
FALSE FALSE
FALSE FALSE
FALSE TRUE
FALSE FALSE
TRUE TRUE
Thanks
Jon
Could you please try this,
Data that used as dataset_stack.csv
from collections import defaultdict
won_last = defaultdict(int)
# Load the Pandas libraries with alias 'pd'
import pandas as pd
# Read data from file 'dataset_stack.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
data = pd.read_csv("dataset_stack.csv")
results=pd.DataFrame(data=data)
#print(results)
# Preview the first 5 lines of the loaded data
#data.head()
for index,row in data.iterrows():
home_team = row['HomeTeam']
visitor_team = row['VisitorTeam']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
#results.ix[index]=row
#results.loc[index]=row
#add new column directly to dataframe instead of adding it to row & appending to dataframe
results['HomeLastWin']=won_last[home_team]
results['VisitorLastWin']=won_last[visitor_team]
results.append(row, ignore_index=True)
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
print(results)
Output:
Date VisitorTeam VisitorPts HomeTeam HomePts HomeWin \
0 1/11/2013 Milwaukee 105 Boston 98 False
1 1/11/2013 Miami Heat 100 Brooklyn 101 True
2 1/11/2013 Clevland 84 Charlotte 90 True
3 1/11/2013 Portland 113 Denver 98 False
4 1/11/2013 Dallas 91 Houston 113 True
HomeLastWin VisitorLastWin
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
I am trying to assign each game in the NFL a value for the week in which they occur.
Example for the 2008 season all the games that occur in the range between the 4th and 10th of September occur in week 1
i = 0
week = 1
start_date = df2008['date'].iloc[0]
end_date = df2008['date'].iloc[-1]
week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')
for row in df2008['date']:
row = row.date()
if row in week_range[i]:
df2008['week'] = week
else:
week += 1
However, this is updating all of the games to week 1
date week
1601 2008-09-04 1
1602 2008-09-07 1
1603 2008-09-07 1
1604 2008-09-07 1
1605 2008-09-07 1
... ... ...
1863 2009-01-11 1
1864 2009-01-11 1
1865 2009-01-18 1
1866 2009-01-18 1
1867 2009-02-01 1
I have tried using print statements to debug and these are my results. "In Range" are games that occur in week 1 and are returning as expected.
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range
df_sample:
display(df2008[['date', 'home', 'away', 'week']])
date home away week
1601 2008-09-04 Giants Redskins 1
1602 2008-09-07 Falcons Lions 1
1603 2008-09-07 Bills Seahawks 1
1604 2008-09-07 Titans Jaguars 1
1605 2008-09-07 Dolphins Jets 1
... ... ... ... ...
1863 2009-01-11 Giants Eagles 1
1864 2009-01-11 Steelers Chargers 1
1865 2009-01-18 Cardinals Eagles 1
1866 2009-01-18 Steelers Ravens 1
1867 2009-02-01 Cardinals Steelers 1
Can anyone point out where I am going wrong?
OP's original question was: "Can anyone point out where I am going wrong?",
so - though as Parfait pointed out using pandas.Series.dt.week is a fine pandas solution - to help him to find answer to it, I followed OP's original code logic, with some fixing:
import pandas as pd
i = 0
week = 1
df2008 = pd.DataFrame({"date": [pd.Timestamp("2008-09-04"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-02-01")],
"home": ["Giants", "Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals"],
"away": ["Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals", "Ravens"]
})
i = 0
week = 1
start_date = df2008['date'].iloc[0]
#end_date = df2008['date'].iloc[-1]
end_date = pd.Timestamp("2009-03-01")
week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')
df2008['week'] = None
for i in range(len(df2008['date'])):
rd = df2008.loc[i, 'date'].date()
while True:
if week == len(week_range):
break
if rd in week_range[week - 1]:
df2008.loc[i, 'week'] = week
break
else:
week += 1
print(df2008)
Out:
date home away week
0 2008-09-04 Giants Falcon 1
1 2008-09-07 Falcon Bills 1
2 2008-09-07 Bills Titans 1
3 2008-09-07 Titans Dolphins 1
4 2008-09-07 Dolphins Giants 1
5 2009-01-11 Giants Steelers 19
6 2009-01-11 Steelers Cardinals 19
7 2009-01-18 Cardinals Steelers 20
8 2009-01-18 Steelers Cardinals 20
9 2009-02-01 Cardinals Ravens 22
Consider avoiding any looping and use pandas.Series.dt.week on datetime fields which returns week in year. Then, subtract from the first week. However, a wrinkle occurs when considering the new year so must handle conditionally by adding difference of end of year and then weeks of new year. Fortunately, weeks start on Monday (so Thursday - Sunday maintain same week number).
first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values
# FIND LAST SUNDAY OF YEAR (NOT NECESSARILY DEC 31)
end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values
new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values
# CONDITIONALLY ASSIGN
df2008['week'] = np.where(df2008['date'] < '2009-01-01',
(df2008['date'].dt.week - first_week) + 1,
((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
)
To demonstrate with random seeded data (including new year dates). Will replace for OP's reproducible sample.
Data
import numpy as np
import pandas as pd
### DATA BUILD
np.random.seed(120619)
df2008 = pd.DataFrame({'group': np.random.choice(['sas', 'stata', 'spss', 'python', 'r', 'julia'], 500),
'int': np.random.randint(1, 10, 500),
'num': np.random.randn(500),
'char': [''.join(np.random.choice(list('ABC123'), 3)) for _ in range(500)],
'bool': np.random.choice([True, False], 500),
'date': np.random.choice(pd.date_range('2008-09-04', '2009-01-06'), 500)
})
Calculation
first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values
end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values
new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values
df2008['week'] = np.where(df2008['date'] < '2008-12-28',
(df2008['date'].dt.week - first_week) + 1,
((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
)
df2008 = df2008.sort_values('date').reset_index(drop=True)
print(df2008.head(10))
# group int num char bool date week
# 0 sas 2 0.099927 A2C False 2008-09-04 1
# 1 python 3 0.241393 2CB False 2008-09-04 1
# 2 python 8 0.516716 ABC False 2008-09-04 1
# 3 spss 2 0.974715 3CB False 2008-09-04 1
# 4 stata 9 -1.582096 CAA True 2008-09-04 1
# 5 sas 3 0.070347 1BB False 2008-09-04 1
# 6 r 5 -0.419936 1CA True 2008-09-05 1
# 7 python 6 0.628749 1AB True 2008-09-05 1
# 8 python 3 0.713695 CA1 False 2008-09-05 1
# 9 python 1 -0.686137 3AA False 2008-09-05 1
print(df2008.tail(10))
# group int num char bool date week
# 490 spss 5 -0.548257 3CC True 2009-01-04 17
# 491 julia 8 -0.176858 AA2 False 2009-01-05 18
# 492 julia 5 -1.422237 A1B True 2009-01-05 18
# 493 stata 2 -1.710138 BB2 True 2009-01-05 18
# 494 python 4 -0.285249 1B1 True 2009-01-05 18
# 495 spss 3 0.918428 C23 True 2009-01-06 18
# 496 r 5 -1.347936 1AC False 2009-01-06 18
# 497 stata 3 0.883093 1C3 False 2009-01-06 18
# 498 python 9 0.448237 12A True 2009-01-06 18
# 499 spss 3 1.459097 2A1 False 2009-01-06 18
Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.
Here is my dataframe:
Date cell tumor_size(mm)
25/10/2015 113 51
22/10/2015 222 50
22/10/2015 883 45
20/10/2015 334 35
19/10/2015 564 47
19/10/2015 123 56
22/10/2014 345 36
13/12/2013 456 44
What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on.
As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way:
for the cell 222:
get_size_distance(absolute value):
(50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883
The resulting output should look like this:
Date cell tumor_size(mm) pair size_difference
25/10/2015 113 51 222 1
22/10/2015 222 50 123 6
22/10/2015 883 45 456 1
20/10/2015 334 35 345 1
19/10/2015 564 47 456 3
19/10/2015 123 56 456 12
22/10/2014 345 36 456 8
13/12/2013 456 44 NaN NaN
I will really appreciate your help
It's not pretty, but I believe it does the trick
a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]
# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
# Loop over all unique dates
for date in df.Date.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.Date < date].copy()
# Loop over each cell for this date and find the minimum
for row in df.loc[df.Date == date].itertuples():
# If no cells earlier are available use nans.
if compare_df.empty:
pairs.append(float('nan'))
diffs.append(float('nan'))
# Take lowest absolute value and fill in otherwise
else:
compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.size_diff.values[0])
df['pair'] = pairs
df['size_difference'] = diffs
returns:
Date cell tumor_size pair size_difference
0 2015-10-25 113 51 222.0 1.0
1 2015-10-22 222 50 564.0 3.0
2 2015-10-22 883 45 564.0 2.0
3 2015-10-20 334 35 345.0 1.0
4 2015-10-19 564 47 345.0 11.0
5 2015-10-19 123 56 345.0 20.0
6 2014-10-22 345 36 NaN NaN