getting a particular value in pandas data frame - python

I have a data frame named df
season seed team
1609 2010 W01 1246
1610 2010 W02 1452
1611 2010 W03 1307
1612 2010 W04 1458
1613 2010 W05 1396
I need to a new data frame in the following format:
team frequency
1246 01
1452 02
1307 03
1458 04
1396 05
The frequency value came by taking the value from the column named seed in data frame df
W01 -> 01
W02 -> 02
W03 -> 03
How do I do this in pandas?

The solution below uses a lambda function to apply a regex to remove non-digit characters.
http://pythex.org/?regex=%5CD&test_string=L16a&ignorecase=0&multiline=0&dotall=0&verbose=0
import pandas as pd
import re
index=[1609,1610,1611,1612,1613,1700]
data = {'season':[2010,2010,2010,2010,2010,2010],
'seed':['W01','W02','W03','W04','W05','L16a'],
'team':[1246,1452,1307,1458,1396,0000]}
df = pd.DataFrame(data,index=index)
df['frequency'] = df['seed'].apply(lambda x: int(re.sub('\D', '', x)))
df2 = df[['team','frequency']].set_index('team')

# Setup your DataFrame
df = pd.DataFrame({'season': [2010]*5, 'seed': ['W0' + str(i) for i in range(1,6)], 'team': [1246, 1452, 1307, 1458, 1396]}, index=range(1609, 1614))
s = pd.Series(df['seed'].str[1:].values, index=df['team'], name='frequency')
print(s)
yields
team
1246 01
1452 02
1307 03
1458 04
1396 05
Name: frequency, dtype: object

Related

Web scraping with Python and Pandas - Pagination

With this short code I can get data from the table:
import pandas as pd
df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv('2023_I_M_800.csv')
I am trying to get data from all pages or a determinated number of them but since this website doesn't use lu or li elementsIdon'tknow exacxtly how to built it.
Any help or idea would be appreciated.
Since the url contains the page number, why not just making a loop and concat ?
`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular
import pandas as pd
​
F, L = 1, 4 # first and last pages
​
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
sub_df = pd.read_html(url, parse_dates=True)[0]
sub_df.insert(0, "page_number", page)
dico[page] = sub_df
​
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv
NB : You can access each sub_df separately by using key-indexing notation : dico[num_page].
Output :
print(out)
page_number Rank ... Date Results Score
0 1 1 ... 22 JAN 2023 1230
1 1 2 ... 22 JAN 2023 1204
2 1 3 ... 29 JAN 2023 1204
3 1 4 ... 27 JAN 2023 1192
4 1 5 ... 28 JAN 2023 1189
.. ... ... ... ... ...
395 4 394 ... 21 JAN 2023 977
396 4 394 ... 28 JAN 2023 977
397 4 398 ... 27 JAN 2023 977
398 4 399 ... 28 JAN 2023 977
399 4 399 ... 29 JAN 2023 977
[400 rows x 11 columns]
Try this:
for page in range(1, 10):
df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv(f'2023_I_M_800_page_{page}.csv')

Make dictionary keys into rows and dict values as columns with one value as column name and one as column value

I have the following data:
print(tables['T10101'].keys())
dict_keys(['Y033RL', 'A007RL', 'A253RL', 'A646RL', 'A829RL', 'A008RL', 'A191RP', 'DGDSRL', 'A822RL', 'A824RL', 'A006RL', 'A825RL', 'A656RL', 'A823RL', 'Y001RL', 'DNDGRL', 'DDURRL', 'A021RL', 'A009RL', 'A020RL', 'DSERRL', 'A011RL', 'DPCERL', 'A255RL', 'A191RL'])
Each key has the following value, a list of tuples:
The dates are the same but the values change.
[('-20.7', '1930'), ('-33.3', '1931'), ('-41.4', '1932'), ('2.5', '1933'), ('38.3', '1934'), ('36.3', '1935'), ('37.2', '1936'), ('16.2', '1937'), ('-30.4', '1938'), ('15.4', '1939'), ('29.6', '1940'), ('17.2', '1941'), ('-42.6', '1942'), ('-10.2', '1943'), ('33.7', '1944'), ('43.2', '1945'), ('24.7', '1946'), ('36.0', '1947'), ('-8.3', '1947Q2'),
I would like to create the following dataframe:
1930 | 1931 | 1932 | 1933 | 1934 |
Y033RL|-20.7| -33.3| -41.4| 2.5 | 38.3 |
A007RL| data| data | data | data | data |
What's the best way to do this? I came up with this roundabout way of joining created dataframes but it is very inefficient as I have lots of data. I would like to have everything in a dictionary first and then convert it to one dataframe.
def dframeCreator(dataname):
dframeList = list(tables[dataname].keys())
df = tables[dataname][dframeList[0]]
for x in range(len(dframeList[1:])):
df = df.join(tables[dataname][dframeList[x+1]])
return df
We can use dict comprehension to normalize the given dictionary in a standard format which is suitable for creating a dataframe
d = tables['T10101']
df = pd.DataFrame({k: dict(map(reversed, v)) for k, v in d.items()}).T
print(df)
1930 1931 1932 1933 ... 1945 1946 1947 1947Q2
Y033RL -20.7 -33.3 -41.4 2.5 ... 43.2 24.7 36.0 -8.3
If tables is your dictionary, you can do:
df = pd.DataFrame(
[{"idx": k, **dict([(b, a) for a, b in v])} for k, v in tables.items()],
).set_index("idx")
df.index.name = None
print(df)
Prints:
1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1947Q2
T10101 -20.7 -33.3 -41.4 2.5 38.3 36.3 37.2 16.2 -30.4 15.4 29.6 17.2 -42.6 -10.2 33.7 43.2 24.7 36.0 -8.3
A007RL xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx

Panda having problem merging dataframes together

I'm trying to merge two dataframes together using .merge and "inner" to find common column, here are the two dataframes, 1st one,
Year Month Brunei Darussalam ... Australia New Zealand Africa
0 1978 Jan na ... 28421 3612 587
1 1978 Feb na ... 13982 2521 354
2 1978 Mar na ... 16536 2727 405
3 1978 Apr na ... 16499 3197 736
4 1978 May na ... 20690 5130 514
.. ... ... ... ... ... ... ...
474 2017 Jul 5625 ... 104873 15358 6964
475 2017 Aug 4610 ... 75171 11197 6987
476 2017 Sep 5387 ... 100987 12021 5458
477 2017 Oct 4202 ... 90940 11834 5635
478 2017 Nov 5258 ... 81821 9348 6717
2nd one,
Year Month
0 1980 Jul
1 1980 Aug
2 1980 Sep
3 1980 Oct
4 1980 Nov
I tried to use this input to initialize my command,
merge = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
print(merge)
but I keep getting this ERROR,
Traceback (most recent call last):
File "main.py", line 52, in <module>
merge = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 74, in merge
op = _MergeOperation(
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 672, in __init__
self._maybe_coerce_merge_keys()
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1193, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
It means one column Year is numeric, second filled by strings.
So need same types like:
dataframe['Year'] = dataframe['Year'].astype(int)
df['Year'] = df['Year'].astype(int)
df1 = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
Or:
dataframe['Year'] = dataframe['Year'].astype(str)
df['Year'] = df['Year'].astype(str)
df1 = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])

Fishers Exact Test from Pandas Dataframe

I'm trying to work out the best way to create a p-value using Fisher's Exact test from four columns in a dataframe. I have already extracted the four parts of a contingency table, with 'a' being top-left, 'b' being top-right, 'c' being bottom-left and 'd' being bottom-right. I have started including additional calculated columns via simple pandas calculations, but these aren't necessary if there's an easier way to just use the 4 initial columns. I have over 1 million rows when including an additional set (x.type = high), so want to use an efficient method. So far this is my code:
import pandas as pd
import glob
import math
path = r'directory_path'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame['a+b'] = frame['a'] + frame['b']
frame['c+d'] = frame['c'] + frame['d']
frame['a+c'] = frame['a'] + frame['c']
frame['b+d'] = frame['b'] + frame['d']
As an example of this data, 'frame' currently shows:
ID(n) a b c d i x.name x.type a+b c+d a+c b+d
0 1258065 5 28 31 1690 1754 Albumin low 33 1721 36 1718
1 1132105 4 19 32 1699 1754 Albumin low 23 1731 36 1718
2 898621 4 30 32 1688 1754 Albumin low 34 1720 36 1718
3 573158 4 30 32 1688 1754 Albumin low 34 1720 36 1718
4 572975 4 23 32 1695 1754 Albumin low 27 1727 36 1718
... ... ... ... ... ... ... ... ... ... ... ... ...
666646 12435 1 0 27 1726 1754 WHR low 1 1753 28 1726
666647 15119 1 0 27 1726 1754 WHR low 1 1753 28 1726
666648 17053 1 2 27 1724 1754 WHR low 3 1751 28 1726
666649 24765 1 3 27 1723 1754 WHR low 4 1750 28 1726
666650 8733 1 1 27 1725 1754 WHR low 2 1752 28 1726
Is the best way to convert these to a numpy array and process it through iteration, or keep it in pandas? I assume that I can't use math functions within a dataframe (I've tried math.comb(), which didn't work in a dataframe). I've also tried using pyranges for its fisher method but it seems it doesn't work with my environment (python 3.8).
Any help would be much appreciated!
Following the answer here which came from the author of pyranges (i think), let's say you data is something like:
import pandas as pd
import scipy.stats as stats
import numpy as np
np.random.seed(111)
df = pd.DataFrame(np.random.randint(1,100,(1000000,4)))
df.columns=['a','b','c','d']
df['ID'] = range(1000000)
df.head()
a b c d ID
0 85 85 85 87 0
1 20 42 67 83 1
2 41 72 58 8 2
3 13 11 66 89 3
4 29 15 35 22 4
You convert it into a numpy array and did it like in the post:
c = df[['a','b','c','d']].to_numpy(dtype='uint64')
from fisher import pvalue_npy
_, _, twosided = pvalue_npy(c[:, 0], c[:, 1], c[:, 2], c[:, 3])
df['odds'] = (c[:, 0] * c[:, 3]) / (c[:, 1] * c[:, 2])
df['pvalue'] = twosided
Or you can fit it directly:
_, _, twosided = pvalue_npy(df['a'].to_numpy(np.uint), df['b'].to_numpy(np.uint),
df['c'].to_numpy(np.uint), df['d'].to_numpy(np.uint))
df['odds'] = (df['a'] * df['d']) / (df['b'] * df['c'])
df['pvalue'] = twosided

how to create a DF from a DF based on a condition

My current DF looks like this
Combinations Count
1 ('IDLY', 'VADA') 3734
6 ('DOSA', 'IDLY') 2020
9 ('CHAPPATHI', 'DOSA') 1297
10 ('IDLY', 'POORI') 1297
11 ('COFFEE', 'TEA') 1179
13 ('DOSA', 'VADA') 1141
15 ('CHAPPATHI', 'IDLY') 1070
16 ('COFFEE', 'SAMOSA') 1061
17 ('COFFEE', 'IDLY') 1016
18 ('POORI', 'VADA') 1008
Lets say I filter by the keyword 'DOSA' from above data frame I get the below OP
Combinations Count
6 ('DOSA', 'IDLY') 2020
9 ('CHAPPATHI', 'DOSA') 1297
13 ('DOSA', 'VADA') 1141
But I would like the output to be like the df below(which has ignored the filter key word as its common,
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141
What concept of pandas needs to be used here? How can this be achieved?
In general, it's not ideal to have list, tuples, sets, etc inside a dataframe. It's better to have multiple records for each instance when needed.
You can use explode turn Combinations into this form and filter on that
keyword = 'DOSA'
s = df.explode('Combinations')
s.loc[s.Combinations.eq('keyword').groupby(level=0).transform('any') & s.Combinations.ne('keyword')]
Or chain the two commands with .loc[lambda ]:
(df.explode('Combinations')
.loc[lambda x: x.Combinations.ne(keyword) &
x.Combinations.eq(keyword).groupby(level=0).transform('any')]
)
Output:
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141
What I will do
x=df.explode('Combinations')
x=x.loc[x.index[x.Combinations=='DOSA']].query('Combinations !="DOSA"')
x
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141
you can also try creating a dataframe as a reference , then mask where keyword matches with stack for dropping NaN:
keyword = 'DOSA'
m = pd.DataFrame(df['Combinations'].tolist(),index=df.index)
c = m.eq(keyword).any(1)
df[m.eq(keyword).any(1)].assign(Combinations=
m[c].where(m[c].ne(keyword)).stack().droplevel(1))
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141
For string type, you can convert into tuple by:
import ast
df['Combinations'] = df['Combinations'].apply(ast.literal_eval)
d = df[df['Combinations'].transform(lambda x: 'DOSA' in x)].copy()
d['Combinations'] = d['Combinations'].apply(lambda x: set(x).difference(['DOSA']).pop())
print(d)
Prints:
ID Combinations Count
1 6 IDLY 2020
2 9 CHAPPATHI 1297
5 13 VADA 1141

Categories