Give scores to dataframe based on id - python

I have a dataframe which is indexed by date, I am trying to provide scores for each accountid based on category, if that category value exist on the index date, this dataframe will look like this.
accountid category Smooth Hard Sharp Narrow
timestamp
2018-03-29 101 Smooth 1 NaN NaN NaN
2018-03-29 102 Hard NaN 1 NaN NaN
2018-03-30 103 Narrow NaN NaN NaN 1
2018-04-30 104 Sharp NaN NaN 1 NaN
2018-04-21 105 Narrow NaN NaN NaN 1
what is the best way to loop through the dataframe per accountid and assign scores for each category unstacked.
here is the dataframe creation script.
import pandas as pd
import datetime
idx = pd.date_range('02-28-2018', '04-29-2018')
df = pd.DataFrame(
[[ '101', '2018-03-29', 'Smooth','NaN','NaN','NaN','NaN'], [
'102', '2018-03-29', 'Hard','NaN','NaN','NaN','NaN'
], [ '103', '2018-03-30', 'Narrow','NaN','NaN','NaN','NaN'], [
'104', '2018-04-30', 'Sharp','NaN','NaN','NaN','NaN'
], [ '105', '2018-04-21', 'Narrow','NaN','NaN','NaN','NaN']],
columns=[ 'accountid', 'timestamp', 'category','Smooth','Hard','Sharp','Narrow'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df=df.set_index(['timestamp'])
print(df)

You can use str accessor with get_dummies:
df[['accountid','category']].assign(**df['category'].str.get_dummies())
Output:
accountid category Hard Narrow Sharp Smooth
timestamp
2018-03-29 101 Smooth 0 0 0 1
2018-03-29 102 Hard 1 0 0 0
2018-03-30 103 Narrow 0 1 0 0
2018-04-30 104 Sharp 0 0 1 0
2018-04-21 105 Narrow 0 1 0 0
And replace 0 with nan,
df[['accountid','category']].assign(**df['category'].str.get_dummies())\
.replace(0,np.nan)
Output:
accountid category Hard Narrow Sharp Smooth
timestamp
2018-03-29 101 Smooth NaN NaN NaN 1.0
2018-03-29 102 Hard 1.0 NaN NaN NaN
2018-03-30 103 Narrow NaN 1.0 NaN NaN
2018-04-30 104 Sharp NaN NaN 1.0 NaN
2018-04-21 105 Narrow NaN 1.0 NaN NaN

Related

How to create a combined wide and long format dataframe from a long input dataset?

I have a dataframe in long format which can be created using the code below:
import pandas as pd
import numpy as np
long_dict = {'Period':['2021-01-01','2021-02-01','2021-01-03','2021-02-04','2022-02-01','2022-03-01','2021-03-01'],
'Indicator':['number of tracks','number of tracks','defects','defects','gears and bobs','gears and bobs','staff'],
'indicator_status':['active','active','active','active','active','active','active'],
'mech_code':['73','73','44','44','106','107','106'],
'Value':[100,120,7,17,25,99,81]}
df = pd.DataFrame(long_dict)
The client/customer wants to specifically have a unified long and wide view using the Indicator field (for proprietary software visualisation purposes). What is the best way which results in the output below:
Try:
df = pd.concat([df, df.pivot(columns="Indicator", values="Value")], axis=1)
print(df)
Prints:
Period Indicator indicator_status mech_code Value defects gears and bobs number of tracks staff
0 2021-01-01 number of tracks active 73 100 NaN NaN 100.0 NaN
1 2021-02-01 number of tracks active 73 120 NaN NaN 120.0 NaN
2 2021-01-03 defects active 44 7 7.0 NaN NaN NaN
3 2021-02-04 defects active 44 17 17.0 NaN NaN NaN
4 2022-02-01 gears and bobs active 106 25 NaN 25.0 NaN NaN
5 2022-03-01 gears and bobs active 107 99 NaN 99.0 NaN NaN
6 2021-03-01 staff active 106 81 NaN NaN NaN 81.0

pandas.read_html tables not found

I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?
You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN

Select a (non-indexed) column based on text content of a cell in a python/pandas dataframe

TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky
Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc

Pivoting DataFrame with multiple columns for the index

I have a dataframe and I want to transpose only few rows to column.
This is what I have now.
Entity Name Date Value
0 111 Name1 2018-03-31 100
1 111 Name2 2018-02-28 200
2 222 Name3 2018-02-28 1000
3 333 Name1 2018-01-31 2000
I want to create date as the column and then add value. Something like this:
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
I can have identical Name for two different Entitys. Here is an updated dataset.
Code:
import pandas as pd
import datetime
data1 = {
'Entity': [111,111,222,333],
'Name': ['Name1','Name2', 'Name3','Name1'],
'Date': [datetime.date(2018,3, 31), datetime.date(2018,2,28), datetime.date(2018,2,28), datetime.date(2018,1,31)],
'Value': [100,200,1000,2000]
}
df1 = pd.DataFrame(data1, columns= ['Entity','Name','Date', 'Value'])
How do I achieve this? Any pointers? Thanks all.
Based on your update, you'd need pivot_table with two index columns -
v = df1.pivot_table(
index=['Entity', 'Name'],
columns='Date',
values='Value'
).reset_index()
v.index.name = v.columns.name = None
v
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
From unstack
df1.set_index(['Entity','Name','Date']).Value.unstack().reset_index()
Date Entity Name 2018-01-31 00:00:00 2018-02-28 00:00:00 \
0 111 Name1 NaN NaN
1 111 Name2 NaN 200.0
2 222 Name3 NaN 1000.0
3 333 Name1 2000.0 NaN
Date 2018-03-31 00:00:00
0 100.0
1 NaN
2 NaN
3 NaN

Merge pandas df based on 2 keys

I have 2 df and I would like to merge them based on 2 keys - ID and date:
I following is just a small slice of the entire df
df_pw6
ID date pw10_0 pw50_0 pw90_0
0 153 2018-01-08 27.88590 43.2872 58.2024
0 2 2018-01-05 11.03610 21.4879 31.6997
0 506 2018-01-08 6.98468 25.3899 45.9486
df_ex
date ID measure f188 f187 f186 f185
0 2017-07-03 501 NaN 1 0.5 7 4.0
1 2017-07-03 502 NaN 0 2.5 5 3.0
2 2018-01-08 506 NaN 5 9.0 9 1.2
As you can see, only the third row has a match.
When I type:
#check date
df_ex.iloc[2,0]== df_pw6.iloc[1,1]
True
#check ID
df_ex.iloc[2,1] == df_pw6.iloc[2,0]
True
Now I try to merge them:
df19 = pd.merge(df_pw6,df_ex,on=['date','ID'])
I get an empty df
When I try:
df19 = pd.merge(df_pw6,df_ex,how ='left',on=['date','ID'])
I get:
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 153 2018-01-08 00:00:00 27.88590 43.2872 58.2024 NaN NaN NaN NaN NaN
1 2 2018-01-05 00:00:00 11.03610 21.4879 31.6997 NaN NaN NaN NaN NaN
2 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN NaN NaN NaN NaN
My desired result should be:
> ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
>
> 0 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2
I have run your codes post your edit, and I succeeded in getting the desired result.
import pandas as pd
# copy paste your first df by hand
pw = pd.read_clipboard()
# copy paste your second df by hand
ex = pd.read_clipboard()
pd.merge(pw,ex,on=['date','ID'])
# output [edited. now it is the correct result OP wanted.]
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 506 2018-01-08 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2

Categories