I have below code on toy data which works the day i want. Last 2 columns provide how many times value in column Jan was found in column URL and in how many distinct rows value in column Jan was found in column URL
sales = [{'account': '3', 'Jan': 'xxx', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf try bbbbb why try'},
{'account': '1', 'Jan': 'try', 'Feb': '210', 'URL': ''},
{'account': '2', 'Jan': 'bbbbb', 'Feb': '90', 'URL': 'ea2017-104.pdf bbbbb cc for why try' }]
df = pd.DataFrame(sales)
df
df['found_in_column'] = df['Jan'].apply(lambda x: ''.join(df['URL'].tolist()).count(x))
df['distinct_finds'] = df['Jan'].apply(lambda x: sum(df['URL'].str.contains(x)))
why does the same code fails in the last case? How could i change my code to avoid the error. In case of my last example there are special characters in the first column, I felt that they are causing the problem. But when i look at row where index is 3 and 4, they have special characters too and code runs fine
answer2=answer[['Value','non_repeat_pdf']].iloc[0:11]
print(answer2)
Value non_repeat_pdf
0 effect\nive Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 closing ####
2 executing ####
3 order, ####
4 waives: ####
5 right ####
6 notice ####
7 intention ####
8 prohibit ####
9 further ####
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[220]:
0 1
1 0
2 1
3 0
4 1
5 1
6 0
7 0
8 1
9 0
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[10:11]
print(answer2)
Value non_repeat_pdf
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[212]:
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[11:12]
print(answer2)
Value non_repeat_pdf
11 1818(e); ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Traceback (most recent call last):
File "<ipython-input-215-2df7f4b2de41>", line 1, in <module>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1574, in pandas._libs.lib.map_infer
File "<ipython-input-215-2df7f4b2de41>", line 1, in <lambda>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 1562, in contains
regex=regex)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 254, in str_contains
stacklevel=3)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update
I modified my code and removed all special character from the Value column. I am still getting the error...what could be wrong.
Even with the error, the new column gets added to my answer2 dataframe
answer2=answer[['Value','non_repeat_pdf']]
print(answer2)
Value non_repeat_pdf
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
[1582 rows x 2 columns]
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
Traceback (most recent call last):
File "<ipython-input-298-4dc80361895c>", line 1, in <module>
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2404, in _set_item
self._check_setitem_copy()
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 1873, in _check_setitem_copy
warnings.warn(t, SettingWithCopyWarning, stacklevel=stacklevel)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update2
below works
answer2=answer[['Value','non_repeat_pdf']]
xyz= answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
xyz=xyz.to_frame()
xyz.columns=['found_in_all_PDF']
pd.concat([answer2, xyz], axis=1)
Out[305]:
Value non_repeat_pdf \
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
found_in_all_PDF
0 6
1 1
2 4
3 1036
4 9
5 93
6 4
7 2
8 1
9 2
10 6
11 1
12 0
13 1
14 3
15 1
16 0
17 25
18 20
19 3
20 14
21 4
22 358
23 2
24 1
25 2
26 6
27 1
28 1
29 3
...
1552 3
1553 2
1554 0
1555 5
1556 2
1557 3
1558 0
1559 2
1560 1
1561 5
1562 2
1563 7
1564 8
1565 3
1566 0
1567 1
1568 1
1569 4
1570 1
1571 9
1572 2
1573 2
1574 96
1575 1
1576 1
1577 1
1578 0
1579 0
1580 1
1581 0
[1582 rows x 3 columns]
Unfortunately i can't reproduce exactly same error on my environment. But what I see is warning about wrong regexp usage. Your string was interpreted as capturing regular expression because of brackets in the string "1818(e);". Try use str.contains with regex=False.
answer2 =pd.DataFrame({'Value': {11: '1818(e);'}, 'non_repeat_pdf': {11: '####'}})
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x,regex=False)))
Output:
11 0
Name: Value, dtype: int64
Related
Problem
So I am trying to aggregate using Groupby my dataset but I've run into this error which I can't really decipher.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/Users/hamza.ahmed/Coding Projects/Client_Works/EPS/EPS_SproutSocial_GoogleSheets_Snowflake_v1.ipynb Cell 25 in <cell line: 1>()
1 out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
2 'customer_profile_id', 'profile_guid', 'text', 'perma_link',
3 'network', 'sent', 'created_time',
4 'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
5 'metrics.lifetime.impressions',
6 'metrics.lifetime.impressions_organic',
7 'metrics.lifetime.post_content_clicks',
8 'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
9 'metrics.lifetime.video_views', 'from.guid', 'from_profile',
10 'from.profile_picture',
11 'metrics.lifetime.impressions_organic_unique',
12 'metrics.lifetime.impressions_paid_unique',
13 'metrics.lifetime.impressions_paid',
14 'metrics.lifetime.impressions_unique',
15 'metrics.lifetime.impressions_follower_unique',
16 'metrics.lifetime.impressions_nonviral_unique',
17 'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
---> 18 'title', 'tags', 'Campaigns'], as_index=False)['Campaigns'].agg(list)
19 # or out.fillna('').groupby(['contents', 'posts', 'impressions', 'clicks', 'reactions'], as_index=False)['Campaigns'].agg(', '.join)
20 print(out)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/groupby/generic.py:883, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
878 if result is None:
879
880 # grouper specific aggregations
881 if self.grouper.nkeys > 1:
882 # test_groupby_as_index_series_scalar gets here with 'not self.as_index'
--> 883 return self._python_agg_general(func, *args, **kwargs)
884 elif args or kwargs:
885 # test_pass_args_kwargs gets here (with and without as_index)
886 # can't return early
887 result = self._aggregate_frame(func, *args, **kwargs)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1477, in GroupBy._python_agg_general(self, func, *args, **kwargs)
1474 return self._python_apply_general(f, self._selected_obj)
1476 for idx, obj in enumerate(self._iterate_slices()):
-> 1477 name = obj.name
1479 try:
1480 # if this function is invalid for this dtype, we will ignore it.
1481 result = self.grouper.agg_series(obj, f)
AttributeError: 'str' object has no attribute 'name'
My Dataset
index_1
index
content_category
post_category
post_type
customer_profile_id
profile_guid
text
perma_link
network
sent
created_time
metrics.lifetime.likes
metrics.lifetime.comments_count
metrics.lifetime.impressions
metrics.lifetime.impressions_organic
metrics.lifetime.post_content_clicks
metrics.lifetime.shares_count
metrics.lifetime.reactions
metrics.lifetime.video_views
from.guid
from_profile
from.profile_picture
metrics.lifetime.impressions_organic_unique
metrics.lifetime.impressions_paid_unique
metrics.lifetime.impressions_paid
metrics.lifetime.impressions_unique
metrics.lifetime.impressions_follower_unique
metrics.lifetime.impressions_nonviral_unique
metrics.lifetime.impressions_viral_unique
from.screen_name
title
tags
Campaigns
0
0
VIDEO
POST
LINKEDIN_COMPANY_UPDATE
4526462
licp:596893
Let's celebrate International Beer Day by givi...
https://linkedin.com/feed/update/urn:li:ugcPos...
LINKEDIN
TRUE
2022-08-06T02:10:17Z
27
0
1118
1118
5
2
27
221
licp:596893
Ernest Packaging Solutions
https://media-exp2.licdn.com/dms/image/C560BAQ...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1141696
Fun Extra
1
0
VIDEO
POST
LINKEDIN_COMPANY_UPDATE
4526462
licp:596893
Let's celebrate International Beer Day by givi...
https://linkedin.com/feed/update/urn:li:ugcPos...
LINKEDIN
TRUE
2022-08-06T02:10:17Z
27
0
1118
1118
5
2
27
221
licp:596893
Ernest Packaging Solutions
https://media-exp2.licdn.com/dms/image/C560BAQ...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1141676
Video
2
1
VIDEO
POST
FACEBOOK_POST
4530996
fbpr:259614427411446
The best way to celebrate International Beer D...
https://www.facebook.com/259614427411446/posts...
FACEBOOK
TRUE
2022-08-06T00:10:19Z
4
0
67
67
0
0
4
18
fbpr:259614427411446
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t1.6435...
67
0
0
67
50
67
0
ErnestPackaging
NaN
1141696
Fun Extra
3
1
VIDEO
POST
FACEBOOK_POST
4530996
fbpr:259614427411446
The best way to celebrate International Beer D...
https://www.facebook.com/259614427411446/posts...
FACEBOOK
TRUE
2022-08-06T00:10:19Z
4
0
67
67
0
0
4
18
fbpr:259614427411446
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t1.6435...
67
0
0
67
50
67
0
ErnestPackaging
NaN
1141676
Video
4
2
VIDEO
POST
INSTAGRAM_MEDIA
4530999
ibpr:17841401555616258
The best way to celebrate #internationalbeerda...
https://www.instagram.com/p/Cg5YLn9sOhA/
INSTAGRAM
TRUE
2022-08-05T23:38:16Z
10
2
152
152
NaN
NaN
10
47
ibpr:17841401555616258
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t51.288...
138
NaN
NaN
138
NaN
NaN
NaN
ernest_packaging
NaN
1141696
Fun Extra
My Code
out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
'customer_profile_id', 'profile_guid', 'text', 'perma_link',
'network', 'sent', 'created_time',
'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
'metrics.lifetime.impressions',
'metrics.lifetime.impressions_organic',
'metrics.lifetime.post_content_clicks',
'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
'metrics.lifetime.video_views', 'from.guid', 'from_profile',
'from.profile_picture',
'metrics.lifetime.impressions_organic_unique',
'metrics.lifetime.impressions_paid_unique',
'metrics.lifetime.impressions_paid',
'metrics.lifetime.impressions_unique',
'metrics.lifetime.impressions_follower_unique',
'metrics.lifetime.impressions_nonviral_unique',
'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
'title', 'tags', 'Campaigns'], as_index=False)['Campaigns'].agg(list)
My Target Result
index_1
index
content_category
post_category
post_type
customer_profile_id
profile_guid
text
perma_link
network
sent
created_time
metrics.lifetime.likes
metrics.lifetime.comments_count
metrics.lifetime.impressions
metrics.lifetime.impressions_organic
metrics.lifetime.post_content_clicks
metrics.lifetime.shares_count
metrics.lifetime.reactions
metrics.lifetime.video_views
from.guid
from_profile
from.profile_picture
metrics.lifetime.impressions_organic_unique
metrics.lifetime.impressions_paid_unique
metrics.lifetime.impressions_paid
metrics.lifetime.impressions_unique
metrics.lifetime.impressions_follower_unique
metrics.lifetime.impressions_nonviral_unique
metrics.lifetime.impressions_viral_unique
from.screen_name
title
tags
Campaigns
0
0
VIDEO
POST
LINKEDIN_COMPANY_UPDATE
4526462
licp:596893
Let's celebrate International Beer Day by givi...
https://linkedin.com/feed/update/urn:li:ugcPos...
LINKEDIN
TRUE
2022-08-06T02:10:17Z
27
0
1118
1118
5
2
27
221
licp:596893
Ernest Packaging Solutions
https://media-exp2.licdn.com/dms/image/C560BAQ...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1141696
[Fun Extra, video]
1
1
VIDEO
POST
FACEBOOK_POST
4530996
fbpr:259614427411446
The best way to celebrate International Beer D...
https://www.facebook.com/259614427411446/posts...
FACEBOOK
TRUE
2022-08-06T00:10:19Z
4
0
67
67
0
0
4
18
fbpr:259614427411446
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t1.6435...
67
0
0
67
50
67
0
ErnestPackaging
NaN
1141696
[Fun Extra, video]
2
2
VIDEO
POST
INSTAGRAM_MEDIA
4530999
ibpr:17841401555616258
The best way to celebrate #internationalbeerda...
https://www.instagram.com/p/Cg5YLn9sOhA/
INSTAGRAM
TRUE
2022-08-05T23:38:16Z
10
2
152
152
NaN
NaN
10
47
ibpr:17841401555616258
Ernest Packaging Solutions
https://scontent-iad3-1.xx.fbcdn.net/v/t51.288...
138
NaN
NaN
138
NaN
NaN
NaN
ernest_packaging
NaN
1141696
Fun Extra
Any help would be greatly appreciated!!
Thank you so much!!
Try to remove the 'Campaigns' inside the groupby method.
Your code:
out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
'customer_profile_id', 'profile_guid', 'text', 'perma_link',
'network', 'sent', 'created_time',
'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
'metrics.lifetime.impressions',
'metrics.lifetime.impressions_organic',
'metrics.lifetime.post_content_clicks',
'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
'metrics.lifetime.video_views', 'from.guid', 'from_profile',
'from.profile_picture',
'metrics.lifetime.impressions_organic_unique',
'metrics.lifetime.impressions_paid_unique',
'metrics.lifetime.impressions_paid',
'metrics.lifetime.impressions_unique',
'metrics.lifetime.impressions_follower_unique',
'metrics.lifetime.impressions_nonviral_unique',
'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
'title', 'tags', 'Campaigns'], as_index=False)['Campaigns'].agg(list)
My sugestion:
out = post_data_2.groupby(['index', 'content_category', 'post_category', 'post_type',
'customer_profile_id', 'profile_guid', 'text', 'perma_link',
'network', 'sent', 'created_time',
'metrics.lifetime.likes', 'metrics.lifetime.comments_count',
'metrics.lifetime.impressions',
'metrics.lifetime.impressions_organic',
'metrics.lifetime.post_content_clicks',
'metrics.lifetime.shares_count', 'metrics.lifetime.reactions',
'metrics.lifetime.video_views', 'from.guid', 'from_profile',
'from.profile_picture',
'metrics.lifetime.impressions_organic_unique',
'metrics.lifetime.impressions_paid_unique',
'metrics.lifetime.impressions_paid',
'metrics.lifetime.impressions_unique',
'metrics.lifetime.impressions_follower_unique',
'metrics.lifetime.impressions_nonviral_unique',
'metrics.lifetime.impressions_viral_unique', 'from.screen_name',
'title', 'tags'], as_index=False)['Campaigns'].agg(list)
I want to add all the data from charts.zip from https://doi.org/10.5281/zenodo.4778562 in a single DataFrame. The data consist of a file per year that contains multiple CSVs. I made the following code:
header = 0
dfs = []
for file in glob.glob('Charts/*/201?/*.csv'):
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
weekly_chart = pd.read_csv(file, header=header, sep='\t')
weekly_chart['week_start'] = datetime.strptime(dates[0], '%Y-%m-%d')
weekly_chart['week_end'] = datetime.strptime(dates[1], '%Y-%m-%d')
weekly_chart['region'] = region
dfs.append(weekly_chart)
all_charts = pd.concat(dfs)
But, when I run it, python returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_12886/3473678833.py in <module>
9 weekly_chart['region'] = region
10 dfs.append(weekly_chart)
---> 11 all_charts = pd.concat(dfs)
~/Downloads/enter/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
344 ValueError: Indexes have overlapping values: ['a']
345 """
--> 346 op = _Concatenator(
347 objs,
348 axis=axis,
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
401
402 if len(objs) == 0:
--> 403 raise ValueError("No objects to concatenate")
404
405 if keys is None:
ValueError: No objects to concatenate
How can I fix it?
I think the glob.glob might just be over complicating things... This works perfectly for me.
# Gives you a list of EVERY file in the Charts directory
# and sub directories that is a CSV
file_list = []
for path, subdirs, files in os.walk("Charts"):
file_list.extend([os.path.join(path, x) for x in files if x.endswith('.csv')])
dfs = []
for file in file_list:
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
df = pd.read_csv(file, sep='\t')
df['week_start'] = dates[0]
df['week_end'] = dates[1]
df['region'] = region
dfs.append(df)
all_charts = pd.concat(dfs, ignore_index=True)
print(all_charts)
Output:
position song_id song_name artist streams ... peak_position position_status week_start week_end region
0 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 ... 1 0 2017-10-20 2017-10-27 us
1 2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 ... 2 0 2017-10-20 2017-10-27 us
2 3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 ... 3 1 2017-10-20 2017-10-27 us
3 4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 ... 4 1 2017-10-20 2017-10-27 us
4 5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 ... 1 -2 2017-10-20 2017-10-27 us
... ... ... ... ... ... ... ... ... ... ... ...
273595 196 6kex4EBAj0WHXDKZMEJaaF Swalla (feat. Nicki Minaj & Ty Dolla $ign) Jason Derulo 3747830 ... 8 -5 2018-03-02 2018-03-09 global
273596 197 0CokSRCu5hZgPxcZBaEzVE Glorious (feat. Skylar Grey) Macklemore 3725286 ... 14 -8 2018-03-02 2018-03-09 global
273597 198 7oK9VyNzrYvRFo7nQEYkWN Mr. Brightside The Killers 3717326 ... 148 -3 2018-03-02 2018-03-09 global
273598 199 7EUfNvyCVxQV3oN5ScA2Lb Next To Me Imagine Dragons 3681739 ... 122 -77 2018-03-02 2018-03-09 global
273599 200 6u0EAxf1OJTLS7CvInuNd7 Vai malandra (feat. Tropkillaz & DJ Yuri Martins) Anitta 3676542 ... 30 -23 2018-03-02 2018-03-09 global
If you really want the dates to be dates, you can run this on those two columns at the end.
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
Personally, I'd also do the following:
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
all_charts['week_end'] = pd.to_datetime(all_charts['week_end'])
all_charts['region'] = all_charts['region'].astype('category')
all_charts['artist'] = all_charts['artist'].astype('category')
all_charts['song_name'] = all_charts['song_name'].astype('category')
all_charts['song_id'] = all_charts['song_id'].astype('category')
all_charts.set_index(['region', 'week_start', 'week_end', 'position'], inplace=True)
all_charts.position_status = pd.to_numeric(all_charts.position_status, errors='coerce')
print(df.head(10))
Giving:
song_id song_name artist streams last_week_position weeks_on_chart peak_position position_status
region week_start week_end position
us 2017-10-20 2017-10-27 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 1.0 3 1 0.0
2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 2.0 6 2 0.0
3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 4.0 5 3 1.0
4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 5.0 3 4 1.0
5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 3.0 26 1 -2.0
6 5Gd19NupVe5X8bAqxf9Iaz Gorgeous Taylor Swift 6940802 NaN 1 6 NaN
7 0ofbQMrRDsUaVKq2mGLEAb Havana Camila Cabello 6623184 10.0 12 7 3.0
8 2771LMNxwf62FTAdpJMQfM Bodak Yellow Cardi B 6472727 6.0 14 3 -2.0
9 5Z3GHaZ6ec9bsiI5BenrbY Young Dumb & Broke Khalid 5982108 9.0 29 6 0.0
10 7GX5flRQZVHRAGd6B4TmDO XO Tour Llif3 Lil Uzi Vert 5822583 8.0 9 2 -2.0
The complete dataset along with supplementary
information and variable descriptions can be downloaded from the Harvard Dataverse at
https://doi.org/10.7910/DVN/HG7NV7
Codes that i have used :
import pandas as pd
import numpy as np
df_2005=pd.read_csv('2005.csv.bz2')
df_2006=pd.read_csv('2006.csv.bz2')
airport_df=pd.read_csv('airports.csv')
carrier_df=pd.read_csv('carriers.csv')
planes_df=pd.read_csv('plane-data.csv')
variables_df=pd.read_csv('variable-descriptions.csv')
main_df=pd.concat([df_2005,df_2006],ignore_index=True)
grouped_df = main_df.groupby("Month")
grouped_df = grouped_df.agg({"Dest": "unique"})
grouped_df = grouped_df.reset_index()
print(grouped_df)
Output for grouped_df :
Month Dest
0 1 [ORD, BOS, SAT, DAY, MSP, SLC, SFO, DEN, PDX, ...
1 2 [SAT, SJC, BOS, ORD, DAY, SLC, PDX, SFO, PIT, ...
2 3 [ORD, SJC, DEN, MSY, IND, SAN, OAK, SFO, SEA, ...
3 4 [MIA, DEN, LAX, MSP, SJC, CLT, ORD, HDN, MCI, ...
4 5 [DEN, MSP, SNA, SAN, ORD, EWR, SFO, ABQ, BWI, ...
5 6 [SNA, OMA, DEN, LGA, SAN, SJC, CLT, ORD, STL, ...
6 7 [ORD, SEA, DEN, DFW, PVD, SNA, MIA, OMA, SFO, ...
7 8 [STL, DEN, PDX, ORD, SFO, MSY, MIA, PHL, IAH, ...
8 9 [DEN, ORD, EWR, SFO, ABQ, MSP, IAH, PDX, CLT, ...
9 10 [DEN, ORD, ATL, LAX, MSP, DFW, IAD, SAN, DTW, ...
10 11 [PDX, DEN, MCI, SFO, CLT, ORD, ATL, MSP, DFW, ...
11 12 [PDX, LAX, ORD, BDL, DEN, SLC, SFO, BOS, MCI, ...
The question was: How does the number of people flying between different locations change over time?
Hence, I was planning to find the maximum flights to the Destination (under the Dest column) each month.
Expected output:
Month HighestDest CountOfHighestDest
0 1 ORD 55
1 2 SAT 54
2 3 ORD 33
3 4 MIA 45
4 5 DEN 66
5 6 SNA 73
6 7 ORD 54
7 8 STL 23
8 9 DEN 11
9 10 DEN 44
10 11 PDX 45
11 12 PDX 47
Where it shows the maximum frequency of Destination per month and the count of it.
It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))
I have a dataframe, grouped, with multiindex columns as below:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)],
'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
}, columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])
grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()
>> grouped
code colour size scaled_size
sum average size idxmax sum average size idxmax
0 one black 1031 60.647059 17 81 185.153944 10.891408 17 47
1 one white 481 37.000000 13 53 204.139249 15.703019 13 53
2 three black 822 48.352941 17 6 123.269405 7.251141 17 31
3 three white 1614 57.642857 28 50 285.638337 10.201369 28 37
4 two black 523 58.111111 9 85 80.908912 8.989879 9 88
5 two white 669 41.812500 16 78 82.098870 5.131179 16 78
[6 rows x 10 columns]
How can I flatten/merge the column index levels as: "Level1|Level2", e.g. size|sum, scaled_size|sum. etc? If this is not possible, is there a way to groupby() as I did above without creating multi-index columns?
There is potentially a better way, more pythonic way to flatten MultiIndex columns.
1. Use map and join with string column headers:
grouped.columns = grouped.columns.map('|'.join).str.strip('|')
print(grouped)
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 862 53.875000 16 14
1 one white 554 46.166667 12 18
2 three black 842 49.529412 17 90
3 three white 740 56.923077 13 97
4 two black 1541 61.640000 25 50
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 6980 436.250000 16 77
1 6101 508.416667 12 13
2 7889 464.058824 17 64
3 6329 486.846154 13 73
4 12809 512.360000 25 23
2. Use map with format for column headers that have numeric data types.
grouped.columns = grouped.columns.map('{0[0]}|{0[1]}'.format)
Output:
code| colour| size|sum size|average size|size size|idxmax \
0 one black 734 52.428571 14 30
1 one white 1110 65.294118 17 88
2 three black 930 51.666667 18 3
3 three white 1140 51.818182 22 20
4 two black 656 38.588235 17 77
5 two white 704 58.666667 12 17
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 8229 587.785714 14 57
1 8781 516.529412 17 73
2 10743 596.833333 18 21
3 10240 465.454545 22 26
4 9982 587.176471 17 16
5 6537 544.750000 12 49
3. Use list comprehension with f-string for Python 3.6+:
grouped.columns = [f'{i}|{j}' if j != '' else f'{i}' for i,j in grouped.columns]
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 1003 43.608696 23 76
1 one white 1255 59.761905 21 66
2 three black 777 45.705882 17 39
3 three white 630 52.500000 12 23
4 two black 823 54.866667 15 33
5 two white 491 40.916667 12 64
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 12532 544.869565 23 27
1 13223 629.666667 21 13
2 8615 506.764706 17 92
3 6101 508.416667 12 43
4 7661 510.733333 15 42
5 6143 511.916667 12 49
you could always change the columns:
grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]
Based on Scott Boston's answer,
little update(it will be work for 2 or more levels column):
temp.columns.map(lambda x: '|'.join([str(i) for i in x]))
Thank you, Boston!
Full credit to suraj's concise answer: https://stackoverflow.com/a/72616083/317797
df.columns = df.columns.map('_'.join)