I have a df which looks like this:
Label PG & PCF EE-LV Centre UMP FN
Label
Très favorable 2.68 3.95 4.20 3.33 3.05
Plutôt favorable 12.45 42.10 19.43 2.05 1.77
Plutôt défavorable 43.95 41.93 34.93 20.15 15.97
Très défavorable 37.28 9.11 41.44 70.26 75.99
Je ne sais pas 3.63 2.91 0.10 4.21 3.22
I would simply like to replace "&" with "and" where found, either at the column labels or index labels.
I would've imaged that something like this wouldve worked but it didn't...
dataframe.columns.replace("&","and")
Any ideas?
What you tried won't work as Index has no such attribute, however what we can do is convert the column into a Series and then use str and replace to do what you want, you should be able to do the analagous operation on the index too:
df.columns = pd.Series(df.columns).str.replace('&', 'and')
df
Out[307]:
PG and PCF EE-LV Centre UMP
Label
Très favorable 2.68 3.95 4.20 3.33
Plutôt favorable 12.45 42.10 19.43 2.05
Plutôt défavorable 43.95 41.93 34.93 20.15
Très défavorable 37.28 9.11 41.44 70.26
Je ne sais pas 3.63 2.91 0.00 4.21
Related
Due a dataframe like:
COUNTRY_CODE
PDP_SOURCE
TREAT_SLIT
ATC_METRIC
AR
Aisles
ex_1
12.78
AR
Aisles
ex_2
11.28
AR
Aisles
ex_3
11.96
AR
Favorites
ex_1
12.78
AR
Favorites
ex_2
12.28
AR
Favorites
ex_3
13.96
BR
Favorites
ex_1
11.2
BR
Favorites
ex_2
10.28
BR
Favorites
ex_3
10.96
I need to groupby this by COUNTRY_CODE,PDP_SOURCE and find the TREATMENT_SPLIT which has the max ATC_METRIC value. Is there a way to do this with .groupby in pandas? the result should be something like:
COUNTRY_CODE
PDP_SOURCE
TREAT_SLIT
ATC_METRIC
AR
Aisles
ex_1
12.78
AR
Favorites
ex_3
13.96
BR
Favorites
ex_1
11.2
here is one way to do it
# use groupby to get the index where ATC_METRIC is max, then
# use loc to return those rows
df.loc[df.groupby(['COUNTRY_CODE','PDP_SOURCE'] )['ATC_METRIC'] .idxmax()]
Result
COUNTRY_CODE PDP_SOURCE TREAT_SLIT ATC_METRIC
0 AR Aisles ex_1 12.78
5 AR Favorites ex_3 13.96
6 BR Favorites ex_1 11.20
Here's how I parse the xml response from this url
import requests
from xml.etree import ElementTree as ET
response = requests.get('https://www.lb.lt/webservices/FxRates/FxRates.asmx/getCurrencyList')
tree = ET.fromstring(response.content)
The response contains (truncated):
b'<?xml version="1.0" encoding="utf-8"?>\r\n<CcyTbl xmlns="http://www.lb.lt/WebServices/FxRates">\r\n <CcyNtry>\r\n <Ccy>ADP</Ccy>\r\n <CcyNm lang="LT">Andoros peseta</CcyNm>\r\n <CcyNm lang="EN">Andorran peseta</CcyNm>\r\n <CcyNbr>020</CcyNbr>\r\n <CcyMnrUnts>0</CcyMnrUnts>\r\n </CcyNtry>\r\n <CcyNtry>\r\n <Ccy>AED</Ccy>\r\n <CcyNm lang="LT">Jungtini\xc5\xb3 Arab\xc5\xb3 Emirat\xc5\xb3 dirhamas</CcyNm>\r\n <CcyNm lang="EN">UAE dirham</CcyNm>\r\n <CcyNbr>784</CcyNbr>\r\n <CcyMnrUnts>2</CcyMnrUnts>\r\n </CcyNtry>\r\n <CcyNtry>\r\n <Ccy>AFN</Ccy>\r\n <CcyNm lang="LT">Afganistano afganis</CcyNm>\r\n <CcyNm lang="EN">Afghani</CcyNm>\r\n <CcyNbr>971</CcyNbr>\r\n <CcyMnrUnts>2</CcyMnrUnts>\r\n </CcyNtry>\r\n <CcyNtry>\r\n <Ccy>ALL</Ccy>\r\n <CcyNm lang="LT">Albanijos lekas</CcyNm>\r\n <CcyNm lang="EN">Albanian lek</CcyNm>\r\n <CcyNbr>008</CcyNbr>\r\n <CcyMnrUnts>2</CcyMnrUnts>\r\n </CcyNtry>\r\n <CcyNtry>\r\n <Ccy>AMD</Ccy>\r\n ...'
From which I need to extract the available currencies in <Ccy> tags. However it's not clear to me why the tags are not found:
>>> tree.findall('CcyNtry')
[]
>>> tree.findall('Ccy')
[]
What I'm trying to do is to access the results printed by the below
>>> for element in tree:
... print(element[0].text)
ADP
AED
AFN
ALL
AMD
ANG
AOA
ARS
ATS
AUD
AWG
AZN
BAM
BBD
BDT
BEF
BGN
BHD
BIF
BYN
BYR
BMD
BND
BOB
BOV
BRL
BSD
BTN
BWP
BZD
CAD
CDF
CHF
CYP
CLF
CLP
CNY
COP
CRC
CUP
CVE
CZK
DEM
DJF
DKK
DOP
DZD
ECS
ECV
EEK
EGP
ERN
ESP
ETB
EUR
FIM
FJD
FKP
FRF
GBP
GEL
GHS
GYD
GIP
GMD
GNF
GRD
GTQ
GWP
HKD
HNL
HRK
HTG
HUF
IDR
IEP
YER
ILS
INR
IQD
IRR
ISK
ITL
JMD
JOD
JPY
KES
KGS
KHR
KYD
KMF
KPW
KRW
KWD
KZT
LAK
LBP
LYD
LKR
LRD
LSL
LTL
LUF
LVL
MAD
MDL
MDR
MGA
MYR
MKD
MMK
MNT
MOP
MRO
MTL
MUR
MVR
MWK
MXN
MXV
MZN
NAD
NGN
NIO
NLG
NOK
NPR
NZD
OMR
PAB
PEN
PGK
PHP
PYG
PKR
PLN
PTE
QAR
RON
RSD
RUB
RWF
SAR
SBD
SCR
SDD
SEK
SGD
SHP
SYP
SIT
SKK
SLL
SOS
SRD
SSP
STD
SVC
SZL
THB
TJR
TJS
TMT
TND
TOP
TPE
TRY
TTD
TWD
TZS
UAH
UGX
UYU
USD
USN
UZS
VEB
VES
VND
VUV
WST
XAF
XAG
XAU
XBA
XBB
XBC
XBD
XCD
XDR
XFO
XFU
XOF
XPD
XPF
XPT
XXX
ZAR
ZMW
ZRN
ZWL
Unfortunately, you have to deal with the namespace in the file. So try it this way:
namespaces = {'xx': 'http://www.lb.lt/WebServices/FxRates'}
for element in tree.findall('.//xx:CcyNtry/xx:Ccy', namespaces):
print(element.text)
The output should be what you're looking for.
How to extract only the pin-code and city from the address given and in particular column and assign it into two new pandas columns named as 'city' and 'address'. This is working fine with the regex in python pandas , is there any other quick way to run since it takes more than 6 min for 10000 rows.
Example address:87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 110059 Delhi
pincoderegex=re.compile(r'([\w]*)[\s]([\d]{6})')
pincoderegex.search(ref).group() --- > o/p : 'Delhi 110059'
pincoderegex.search(data_rnr['BORROWER ADDRESS'][80]).groups()[1] ---> o/p:'700105'
data_rnr['BORROWER CITY_NAME']='default value'
data_rnr['BORROWER CITY_PINCODE']='default value'
for i in range(0,len(data_rnr['BORROWER ADDRESS'])):
try:
data_rnr['BORROWER CITY_NAME'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[0]
data_rnr['BORROWER CITY_PINCODE'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[1]
except TypeError:
print('TypeError')
except NameError:
print('NameError')
except AttributeError:
print('AttributeError')
except:
pass
The output will be added in the new Df columns data_rnr['BORROWER CITY_NAME'] and data_rnr['BORROWER CITY_PINCODE']
([\w]*)[\s]([\d]{6}) need 398 steps
([\w]+)\s([\d]{6}) need 290 steps
\b([\w]+)\s([\d]{6}) need 174 steps
\s([\w]+)\s([\d]{6}) need 131 steps
so, you can used \s([\w]+)\s([\d]{6}) to improve efficiency
https://regex101.com/r/iLIXDI/1
Based on #Olivier Hao's answer which gives you the best pattern: \s([\w]+)\s([\d]{6}), you could have a faster one line code using only Pandas:
pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})')], axis=1)
Notice that I directly named the groups in the regex pattern to create the new columns.
The only difference with your code is that instead of default value in the new column create, you would have NaN values where the pattern was not found.
I used this sample of Data:
data = [
"87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 110059 Delhi",
"87 F/F Place Opp. C-2, Uttam Nagar NA Paris 930000 Paris",
"87 F/F Place Opp. C-2, Uttam Nagar NA Somewhere 115800 Somewhere",
"Wrong stuff",
"87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 148444 Bombay",
]
Using your code and after changing the pattern and removing the prints that take a lot of computation time I got this result:
def regex():
data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
pincoderegex=re.compile(r'\s([\w]+)\s([\d]{6})')
data_rnr['BORROWER CITY_NAME']='default value'
data_rnr['BORROWER CITY_PINCODE']='default value'
for i in range(0,len(data_rnr['BORROWER ADDRESS'])):
try:
data_rnr['BORROWER CITY_NAME'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[0]
data_rnr['BORROWER CITY_PINCODE'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[1]
except:
pass
return data_rnr
%timeit regex()
2.1 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
BORROWER ADDRESS BORROWER CITY_NAME BORROWER CITY_PINCODE
0 87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11... Delhi 110059
1 87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93... Paris 930000
2 87 F/F Place Opp. C-2, Uttam Nagar NA Somewher... Somewhere 115800
3 Wrong stuff default value default value
4 87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1... Bombay 148444
Using the one line code I got this result:
def pandasExtract():
data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
return pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})')], axis=1)
%timeit pandasExtract()
1.1 ms ± 6.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
BORROWER ADDRESS BORROWER_CITY_NAME BORROWER_CITY_PINCODE
0 87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11... Delhi 110059
1 87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93... Paris 930000
2 87 F/F Place Opp. C-2, Uttam Nagar NA Somewher... Somewhere 115800
3 Wrong stuff NaN NaN
4 87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1... Bombay 148444
But if you absolutely want to fill the NaN values it takes more time (still faster than your code):
def pandasExtractWithoutNan():
data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
return pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})').fillna('default value')], axis=1)
%timeit pandasExtractWithoutNan()
1.57 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
BORROWER ADDRESS BORROWER_CITY_NAME BORROWER_CITY_PINCODE
0 87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11... Delhi 110059
1 87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93... Paris 930000
2 87 F/F Place Opp. C-2, Uttam Nagar NA Somewher... Somewhere 115800
3 Wrong stuff default value default value
4 87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1... Bombay 148444
The documentation of the Pandas functions I used:
str.extract: extract the patterns found in the Series.
fillna: fill the missing values by the value given.
concat: concat a list of DataFrames on the axis given.
I'm trying to create a dataframe from a .data file that's not greatly formatted.
Here's the raw text data:
FICHE CLIMATOLOGIQUE;
;
Statistiques 1981-2010 et records;
PARIS-MONTSOURIS (75) Indicatif : 75114001, alt : 75m, lat : 48°49'18"N, lon : 02°20'12"E;
Edité le : 18/12/2017 dans l'état de la base;
; Janv.; Févr.; Mars; Avril; Mai; Juin; Juil.; Août; Sept.; Oct.; Nov.; Déc.; Année;
La température la plus élevée (°C);
(Records établis sur la période du 01-06-1872 au 03-12-2017);
; 16.1; 21.4; 25.7; 30.2; 34.8; 37.6; 40.4; 39.5; 36.2; 28.9; 21.6; 17.1; 40.4;
Date ; 05-1999; 28-1960; 25-1955; 18-1949; 29-1944; 26-1947; 28-1947; 11-2003; 07-1895; 01-2011; 07-2015; 16-1989; 1947;
Température maximale (Moyenne en °C);
; 7.2; 8.3; 12.2; 15.6; 19.6; 22.7; 25.2; 25; 21.1; 16.3; 10.8; 7.5; 16;
Température moyenne (Moyenne en °C);
; 4.9; 5.6; 8.8; 11.5; 15.2; 18.3; 20.5; 20.3; 16.9; 13; 8.3; 5.5; 12.4;
Température minimale (Moyenne en °C);
; 2.7; 2.8; 5.3; 7.3; 10.9; 13.8; 15.8; 15.7; 12.7; 9.6; 5.8; 3.4; 8.9;
My first attempt didn't consider delimiters other than ';'. I used pd.read_table() :
df = pd.read_table("./file.data", sep=';', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True)
This is the result I got :
As you can see, nearly all indexes are shifted, creating empty rows, and putting 'NaN' as index for the rows actually containing the data I want.
I figured this is due to some delimiters looking like this : ; ;.
So I tried giving the sep parameter a regex that matches both cases, ensuring the use of the python engine:
df = pd.read_table("./file.data", sep=';(\s+;)?', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True, engine='python')
But the result is unsatisfying, as you can see below. (I took only a part of the dataframe, but the idea stays the same).
I've tried other slightly different regexes with similar result.
So I would basically like to have the labels indexing the empty rows shifted to one row below. I didn't try directly modifying the file for efficiency matters because I have around a thousand similar files to get into dataframes. For the same reason, I can't juste rename the index, as some file won't have the same number of rows and such.
Is there a way to do this using pandas ? Thanks a lot.
You could do your manipulations after import:
from io import StringIO
import numpy as np
datafile = StringIO("""FICHE CLIMATOLOGIQUE;
;
Statistiques 1981-2010 et records;
PARIS-MONTSOURIS (75) Indicatif : 75114001, alt : 75m, lat : 48°49'18"N, lon : 02°20'12"E;
Edité le : 18/12/2017 dans l'état de la base;
; Janv.; Févr.; Mars; Avril; Mai; Juin; Juil.; Août; Sept.; Oct.; Nov.; Déc.; Année;
La température la plus élevée (°C);
(Records établis sur la période du 01-06-1872 au 03-12-2017);
; 16.1; 21.4; 25.7; 30.2; 34.8; 37.6; 40.4; 39.5; 36.2; 28.9; 21.6; 17.1; 40.4;
Date ; 05-1999; 28-1960; 25-1955; 18-1949; 29-1944; 26-1947; 28-1947; 11-2003; 07-1895; 01-2011; 07-2015; 16-1989; 1947;
Température maximale (Moyenne en °C);
; 7.2; 8.3; 12.2; 15.6; 19.6; 22.7; 25.2; 25; 21.1; 16.3; 10.8; 7.5; 16;
Température moyenne (Moyenne en °C);
; 4.9; 5.6; 8.8; 11.5; 15.2; 18.3; 20.5; 20.3; 16.9; 13; 8.3; 5.5; 12.4;
Température minimale (Moyenne en °C);
; 2.7; 2.8; 5.3; 7.3; 10.9; 13.8; 15.8; 15.7; 12.7; 9.6; 5.8; 3.4; 8.9;""")
df = pd.read_table(datafile, sep=';', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True)
df1 = pd.DataFrame(df.values[~df.isnull().all(axis=1),:], index=df.index.dropna()[np.r_[0,2:6]], columns=df.columns)
df_out = df1.dropna(how='all',axis=1)
print(df_out)
Output:
Janv. Févr. Mars Avril \
La température la plus élevée (°C) 16.1 21.4 25.7 30.2
Date 05-1999 28-1960 25-1955 18-1949
Température maximale (Moyenne en °C) 7.2 8.3 12.2 15.6
Température moyenne (Moyenne en °C) 4.9 5.6 8.8 11.5
Température minimale (Moyenne en °C) 2.7 2.8 5.3 7.3
Mai Juin Juil. Août \
La température la plus élevée (°C) 34.8 37.6 40.4 39.5
Date 29-1944 26-1947 28-1947 11-2003
Température maximale (Moyenne en °C) 19.6 22.7 25.2 25
Température moyenne (Moyenne en °C) 15.2 18.3 20.5 20.3
Température minimale (Moyenne en °C) 10.9 13.8 15.8 15.7
Sept. Oct. Nov. Déc. Année
La température la plus élevée (°C) 36.2 28.9 21.6 17.1 40.4
Date 07-1895 01-2011 07-2015 16-1989 1947
Température maximale (Moyenne en °C) 21.1 16.3 10.8 7.5 16
Température moyenne (Moyenne en °C) 16.9 13 8.3 5.5 12.4
Température minimale (Moyenne en °C) 12.7 9.6 5.8 3.4 8.9
The following piece of code...
data = np.array([['','state','zip_code','collection_status'],
['42394','CA','92637-2854', 'NaN'],
['58955','IL','60654', 'NaN'],
['108365','MI','48021-1319', 'NaN'],
['109116','MI','48228', 'NaN'],
['110833','IL','60008-4227', 'NaN']])
print(pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:]))
... gives the following data frame:
state zip_code collection_status
42394 CA 92637-2854 NaN
58955 IL 60654 NaN
108365 MI 48021-1319 NaN
109116 MI 48228 NaN
110833 IL 60008-4227 NaN
The goal is to homogenise the "zip_code" column into a 5-digits format–i.e. I want to remove the last four last digits from zip_code when that particular data point has 9 digits instead of 5. BTW, zip_code's type is "object" type.
Any idea?
Use indexing with str only, thanks John Galt:
df['collection_status'] = df['zip_code'].str[:5]
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008
If need add conditions use where or numpy.where:
df['collection_status'] = df['zip_code'].where(df['zip_code'].str.len() == 5,
df['zip_code'].str[:5])
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008
df['collection_status'] = np.where(df['zip_code'].str.len() == 5,
df['zip_code'],
df['zip_code'].str[:5])
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008