Panda having problem merging dataframes together - python

I'm trying to merge two dataframes together using .merge and "inner" to find common column, here are the two dataframes, 1st one,
Year Month Brunei Darussalam ... Australia New Zealand Africa
0 1978 Jan na ... 28421 3612 587
1 1978 Feb na ... 13982 2521 354
2 1978 Mar na ... 16536 2727 405
3 1978 Apr na ... 16499 3197 736
4 1978 May na ... 20690 5130 514
.. ... ... ... ... ... ... ...
474 2017 Jul 5625 ... 104873 15358 6964
475 2017 Aug 4610 ... 75171 11197 6987
476 2017 Sep 5387 ... 100987 12021 5458
477 2017 Oct 4202 ... 90940 11834 5635
478 2017 Nov 5258 ... 81821 9348 6717
2nd one,
Year Month
0 1980 Jul
1 1980 Aug
2 1980 Sep
3 1980 Oct
4 1980 Nov
I tried to use this input to initialize my command,
merge = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
print(merge)
but I keep getting this ERROR,
Traceback (most recent call last):
File "main.py", line 52, in <module>
merge = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 74, in merge
op = _MergeOperation(
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 672, in __init__
self._maybe_coerce_merge_keys()
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1193, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

It means one column Year is numeric, second filled by strings.
So need same types like:
dataframe['Year'] = dataframe['Year'].astype(int)
df['Year'] = df['Year'].astype(int)
df1 = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
Or:
dataframe['Year'] = dataframe['Year'].astype(str)
df['Year'] = df['Year'].astype(str)
df1 = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])

Related

Web scraping with Python and Pandas - Pagination

With this short code I can get data from the table:
import pandas as pd
df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv('2023_I_M_800.csv')
I am trying to get data from all pages or a determinated number of them but since this website doesn't use lu or li elementsIdon'tknow exacxtly how to built it.
Any help or idea would be appreciated.
Since the url contains the page number, why not just making a loop and concat ?
`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular
import pandas as pd
​
F, L = 1, 4 # first and last pages
​
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
sub_df = pd.read_html(url, parse_dates=True)[0]
sub_df.insert(0, "page_number", page)
dico[page] = sub_df
​
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv
NB : You can access each sub_df separately by using key-indexing notation : dico[num_page].
Output :
print(out)
page_number Rank ... Date Results Score
0 1 1 ... 22 JAN 2023 1230
1 1 2 ... 22 JAN 2023 1204
2 1 3 ... 29 JAN 2023 1204
3 1 4 ... 27 JAN 2023 1192
4 1 5 ... 28 JAN 2023 1189
.. ... ... ... ... ...
395 4 394 ... 21 JAN 2023 977
396 4 394 ... 28 JAN 2023 977
397 4 398 ... 27 JAN 2023 977
398 4 399 ... 28 JAN 2023 977
399 4 399 ... 29 JAN 2023 977
[400 rows x 11 columns]
Try this:
for page in range(1, 10):
df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv(f'2023_I_M_800_page_{page}.csv')

How to transform multiple columns of days into week columns

I would like to know how can I transform the day columns into week columns.
I tryed groupby.sum() but there is no column name pattern, I dont know what to groupby for.
So the result should be column name like 'weekX' - "week1(Sum of 7 first days) - week2 - week3" and so on.
Thanks in advance.
You can try:
idx = pd.RangeIndex(len(df.columns[4:])) // 7
out = df.iloc[:, 4:].groupby(idx, axis=1).sum().rename(columns=lambda x:f'Week{x+1}')
out = pd.concat([df.iloc[:, :4], out], axis=1)
print(out)
# Output
Province/State Country/Region Lat ... Week26 Week27 Week28
0 NaN Afghanistan 3.393.911 ... 247210 252460 219855
1 NaN Albania 411.533 ... 28068 32671 32113
2 NaN Algeria 280.339 ... 157675 187224 183841
3 NaN Andorra 425.063 ... 6147 6283 5552
4 NaN Angola -112.027 ... 4741 6341 6978
.. ... ... ... ... ... ... ...
261 NaN Sao Tome and Principe 1.864 ... 5199 5813 5231
262 NaN Yemen 15.552.727 ... 11089 11717 10363
263 NaN Comoros -116.455 ... 2310 2419 2292
264 NaN Tajikistan 38.861 ... 47822 50032 44579
265 NaN Lesotho -29.61 ... 2259 3011 3922
[266 rows x 32 columns]
You can use the melt method to combine all your date columns into a single 'Date' column:
df = df.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='Date', value_name='Value')
From this point it should be straightforward to group by the 'Date' column by week, and then unstack it if you want to have it as multiple columns again.

DataFrame from string that look like table

I have problem with creating DataFrame from string that look like table. Exactly, I want to create same table as my data.This is my data, and below is my code:
0 2017 IX 2018 X 2018 X 2018 X 2018
0 2017 IX 2018 0 2017 IX 2018
UKUPNO 1.053 1.075 1.093 103,8 101,7 1.633 1.669 1.701 104,2 101,9
A Poljoprivreda, šumarstvo i ribolov 907 888 925 102,0 104,2 1.394 1.356 1.420 101,9 104,7
B Vađenje ruda i kamena 913 919 839 91,9 91,3 1.395 1.406 1.297 93,0 92,2
C Prerađivačka industrija 769 764 775 100,8 101,4 1.176 1.169 1.187 100,9 101,5
D Proizvodnja i snabdijevanje 1.574 1.570 1.647 104,6 104,9 2.459 2.455 2.579 104,9 105,1
električnom energijom, plinom,
parom i klimatizacija
E Snabdijevanje vodom; uklanjanje 956 973 954 99,8 98,0 1.462 1.491 1.462 100,0 98,1
otpadnih voda, upravljanje otpadom
TESTDATA = io.StringIO(''' ''')
df=pd.read_csv(TESTDATA,sep='delimiter',header=None,engine='python')
When I read my code, I get this DataFrame
0 Prosječna neto plaća ...
1 u KM ...
2 Index Index ...
3 0 2017 IX 2018 X 2018 X 2018 ...
4 0 2017 IX 2018 ...
5 UKUPNO ...
6 A Poljoprivreda, šumarstvo i ribolov ...
7 B Vađenje ruda i kamena ...
8 C Prerađivačka industrija ...
9 D Proizvodnja i snabdijevanje ...
10 električnom energijom, plinom,

python error when finding count of cells where value was found

I have below code on toy data which works the day i want. Last 2 columns provide how many times value in column Jan was found in column URL and in how many distinct rows value in column Jan was found in column URL
sales = [{'account': '3', 'Jan': 'xxx', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf try bbbbb why try'},
{'account': '1', 'Jan': 'try', 'Feb': '210', 'URL': ''},
{'account': '2', 'Jan': 'bbbbb', 'Feb': '90', 'URL': 'ea2017-104.pdf bbbbb cc for why try' }]
df = pd.DataFrame(sales)
df
df['found_in_column'] = df['Jan'].apply(lambda x: ''.join(df['URL'].tolist()).count(x))
df['distinct_finds'] = df['Jan'].apply(lambda x: sum(df['URL'].str.contains(x)))
why does the same code fails in the last case? How could i change my code to avoid the error. In case of my last example there are special characters in the first column, I felt that they are causing the problem. But when i look at row where index is 3 and 4, they have special characters too and code runs fine
answer2=answer[['Value','non_repeat_pdf']].iloc[0:11]
print(answer2)
Value non_repeat_pdf
0 effect\nive Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 closing ####
2 executing ####
3 order, ####
4 waives: ####
5 right ####
6 notice ####
7 intention ####
8 prohibit ####
9 further ####
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[220]:
0 1
1 0
2 1
3 0
4 1
5 1
6 0
7 0
8 1
9 0
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[10:11]
print(answer2)
Value non_repeat_pdf
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[212]:
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[11:12]
print(answer2)
Value non_repeat_pdf
11 1818(e); ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Traceback (most recent call last):
File "<ipython-input-215-2df7f4b2de41>", line 1, in <module>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1574, in pandas._libs.lib.map_infer
File "<ipython-input-215-2df7f4b2de41>", line 1, in <lambda>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 1562, in contains
regex=regex)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 254, in str_contains
stacklevel=3)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update
I modified my code and removed all special character from the Value column. I am still getting the error...what could be wrong.
Even with the error, the new column gets added to my answer2 dataframe
answer2=answer[['Value','non_repeat_pdf']]
print(answer2)
Value non_repeat_pdf
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
[1582 rows x 2 columns]
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
Traceback (most recent call last):
File "<ipython-input-298-4dc80361895c>", line 1, in <module>
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2404, in _set_item
self._check_setitem_copy()
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 1873, in _check_setitem_copy
warnings.warn(t, SettingWithCopyWarning, stacklevel=stacklevel)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update2
below works
answer2=answer[['Value','non_repeat_pdf']]
xyz= answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
xyz=xyz.to_frame()
xyz.columns=['found_in_all_PDF']
pd.concat([answer2, xyz], axis=1)
Out[305]:
Value non_repeat_pdf \
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
found_in_all_PDF
0 6
1 1
2 4
3 1036
4 9
5 93
6 4
7 2
8 1
9 2
10 6
11 1
12 0
13 1
14 3
15 1
16 0
17 25
18 20
19 3
20 14
21 4
22 358
23 2
24 1
25 2
26 6
27 1
28 1
29 3
...
1552 3
1553 2
1554 0
1555 5
1556 2
1557 3
1558 0
1559 2
1560 1
1561 5
1562 2
1563 7
1564 8
1565 3
1566 0
1567 1
1568 1
1569 4
1570 1
1571 9
1572 2
1573 2
1574 96
1575 1
1576 1
1577 1
1578 0
1579 0
1580 1
1581 0
[1582 rows x 3 columns]
Unfortunately i can't reproduce exactly same error on my environment. But what I see is warning about wrong regexp usage. Your string was interpreted as capturing regular expression because of brackets in the string "1818(e);". Try use str.contains with regex=False.
answer2 =pd.DataFrame({'Value': {11: '1818(e);'}, 'non_repeat_pdf': {11: '####'}})
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x,regex=False)))
Output:
11 0
Name: Value, dtype: int64

getting a particular value in pandas data frame

I have a data frame named df
season seed team
1609 2010 W01 1246
1610 2010 W02 1452
1611 2010 W03 1307
1612 2010 W04 1458
1613 2010 W05 1396
I need to a new data frame in the following format:
team frequency
1246 01
1452 02
1307 03
1458 04
1396 05
The frequency value came by taking the value from the column named seed in data frame df
W01 -> 01
W02 -> 02
W03 -> 03
How do I do this in pandas?
The solution below uses a lambda function to apply a regex to remove non-digit characters.
http://pythex.org/?regex=%5CD&test_string=L16a&ignorecase=0&multiline=0&dotall=0&verbose=0
import pandas as pd
import re
index=[1609,1610,1611,1612,1613,1700]
data = {'season':[2010,2010,2010,2010,2010,2010],
'seed':['W01','W02','W03','W04','W05','L16a'],
'team':[1246,1452,1307,1458,1396,0000]}
df = pd.DataFrame(data,index=index)
df['frequency'] = df['seed'].apply(lambda x: int(re.sub('\D', '', x)))
df2 = df[['team','frequency']].set_index('team')
# Setup your DataFrame
df = pd.DataFrame({'season': [2010]*5, 'seed': ['W0' + str(i) for i in range(1,6)], 'team': [1246, 1452, 1307, 1458, 1396]}, index=range(1609, 1614))
s = pd.Series(df['seed'].str[1:].values, index=df['team'], name='frequency')
print(s)
yields
team
1246 01
1452 02
1307 03
1458 04
1396 05
Name: frequency, dtype: object

Categories