Splitting columns in a DF based on multiple conditions - python

I have a df with multiple columns. I need to separate one of those columns into two columns, one based on ID and another one based on the description.
for example on row 34:
data['cpv'][34] = '45232460-4 - Obras de saneamento'
I would need to obtain column cpvid to be 45232460-4 and column cpvdescription to be Obras de saneamento.
This would be fairly easy to do with a string split.
However there are some some cases where
df['cpv'][45] = '45112500-0 - Movimento de terras | 45232411-6 - Construção de condutas para águas residuais | 45232423-3 - Construção de estações de bombagem de águas residuais'
Meaning there are multiple ID's and multiple descriptions on the same row. I was wondering if there is any efficient way to split the columns based on more than one condition. Meaning the first condition would be '-' (under brackets for space) and another condition for '|'.
Could anyone please assist? I'm still a newbie I tried to find some similar posts however none seem to fit my desired output.
Thanks!

If you want the long format you can make use of a string split combined with the explode method (I've create a dummy df based on your data):
df = pd.DataFrame({
'cpv':['45232460-4 - Obras de saneamento', '45112500-0 - Movimento de terras | 45232411-6 - Construção de condutas para águas residuais | 45232423-3 - Construção de estações de bombagem de águas residuais'],
'val':[1,2]
})
df = df.assign(cpv=df.cpv.str.split(r' \| ')).explode('cpv')
df = pd.concat([df, df.cpv.str.split(r' - ', expand=True).rename(columns={0:'cpvid', 1:'cpvdescription'})], axis=1).drop('cpv', axis=1)
print(df)
val cpvid cpvdescription
0 1 45232460-4 Obras de saneamento
1 2 45112500-0 Movimento de terras
1 2 45232411-6 Construção de condutas para águas residuais
1 2 45232423-3 Construção de estações de bombagem de águas re...
If you want the wide format you can try:
df = pd.DataFrame({
'cpv':['45232460-4 - Obras de saneamento', '45112500-0 - Movimento de terras | 45232411-6 - Construção de condutas para águas residuais | 45232423-3 - Construção de estações de bombagem de águas residuais'],
'val':[1,2]
})
cpv_df = pd.DataFrame(df.assign(cpv=df.cpv.str.split(r' \| ')).cpv.to_list())
df = pd.concat([df]+[cpv_df[col].str.split(r' - ', expand=True).rename(columns={0:f'cpvid_{col}', 1:f'cpvdescription_{col}'}) for col in cpv_df], axis=1).drop('cpv', axis=1)
print(df)
val cpvid_0 cpvdescription_0 cpvid_1 \
0 1 45232460-4 Obras de saneamento None
1 2 45112500-0 Movimento de terras 45232411-6
cpvdescription_1 cpvid_2 \
0 None None
1 Construção de condutas para águas residuais 45232423-3
cpvdescription_2
0 None
1 Construção de estações de bombagem de águas re...

Related

Transpose rows using gropby with multiple values

I have this below DataFrame:
df = pd.DataFrame({'DC':['Alexandre', 'Alexandre', 'Afonso de Sousa', 'Afonso de Sousa','Afonso de Sousa'],
'PN':['Farmacia Ekolelo', 'Farmacia Havemos De Voltar', 'Farmacia Gloria', 'Farmacia Mambofar','Farmacia Metamorfose'],
'PC':['C-HO-002815', 'C-HO-005192', 'C-HO-002719', 'C-HO-003030','C-SCC-012430'],
'KCP':['NA', 'DOMINGAS PAULO', 'HELDER', 'Mambueno','RITA'],
'MBN':['NA', 'NA', 29295486345, 9.40407E+11,2.92955E+11]})
Trying to convert data into below format.
By grouping DC column needs to transpose other columns as per above format.
You can group by DC then aggregate with list. From there you can concat the dataframes created from the aggregated lists:
import pandas as pd
df = pd.DataFrame({'DC':['Alexandre', 'Alexandre', 'Afonso de Sousa', 'Afonso de Sousa','Afonso de Sousa'],
'PN':['Farmacia Ekolelo', 'Farmacia Havemos De Voltar', 'Farmacia Gloria', 'Farmacia Mambofar','Farmacia Metamorfose'],
'PC':['C-HO-002815', 'C-HO-005192', 'C-HO-002719', 'C-HO-003030','C-SCC-012430'],
'KCP':['NA', 'DOMINGAS PAULO', 'HELDER', 'Mambueno','RITA'],
'MBN':['NA', 'NA', 29295486345, 9.40407E+11,2.92955E+11]})
df = df.groupby('DC', as_index=False).agg(list)
#print(df)
df_out = pd.concat([df[['DC']]] +
[
pd.DataFrame(l := df[col].to_list(),
columns=[f'{col}_{i}' for i in range(1, max(len(s) for s in l) + 1)]
) for col in df.columns[1:]
],
axis=1)
Note: the assignment in the comprehension l := df[col].to_list() only works for Python versions >= 3.8.
This will give you:
DC PN_1 PN_2 PN_3 ... KCP_3 MBN_1 MBN_2 MBN_3
0 Afonso de Sousa Farmacia Gloria Farmacia Mambofar Farmacia Metamorfose ... RITA 29295486345 940407000000.0 2.929550e+11
1 Alexandre Farmacia Ekolelo Farmacia Havemos De Voltar None ... None NA NA NaN
You can then sort the columns with your own function:
def sort_columns(col_lbl):
col, ind = col_lbl.split('_')
return (int(ind), df.columns.to_list().index(col))
df_out.columns = ['DC'] + sorted(df_out.columns[1:].to_list(), key=sort_columns)
Output:
DC PN_1 PC_1 KCP_1 ... PN_3 PC_3 KCP_3 MBN_3
0 Afonso de Sousa Farmacia Gloria Farmacia Mambofar Farmacia Metamorfose ... RITA 29295486345 940407000000.0 2.929550e+11
1 Alexandre Farmacia Ekolelo Farmacia Havemos De Voltar None ... None NA NA NaN

Counting the characters of dictionary values inside a dataframe column

After downloading Facebook data, they provide json files with your post information. I read the json and dataframe with pandas. Now I want to count the characters of every post I made. The posts are in: df['data'] like: [{'post': 'Happy bday Raul'}].
I want the output to be the count of characters of: "Happy bday Raul" which will be 15 in this case or 7 in the case of "Morning" from [{'post': 'Morning'}].
df=pd.read_json('posts_1.json')
The columns are Date and Data with this format:
Date Data
01-01-2020 *[{'post': 'Morning'}]*
10-03-2020 *[{'post': 'Happy bday Raul'}]*
17-03-2020 *[{'post': 'This lockdown is sad'}]*
I tried to count the characters of this [{'post': 'Morning'}] by doing this
df['count']=df['data'].str.len()
But it's not working as result in "1".
I need to extract the value of the dictionary and do the len to count the characters. The output will be:
Date Data COUNT
01-01-2020 *[{'post': 'Morning'}]* 5
10-03-2020 *[{'post': 'Happy bday Raul'}]* 15
17-03-2020 *[{'post': 'This lockdown is sad'}]* 20
EDITED:
Used to_dict()
df11=df_post['data'].to_dict()
Output
{0: [{'post': 'Feliz cumpleaños Raul'}],
1: [{'post': 'Muchas felicidades Tere!!! Espero que todo vaya genial y siga aún mejor! Un beso desde la Escandinavia profunda'}],
2: [{'post': 'Hola!\nUna investigadora vendrá a finales de mayo, ¿Alguien tiene una habitación libre en su piso para ella? Many Thanks!'}],
3: [{'post': '¿Cómo va todo? Se que muchos estáis o estábais por Galicia :D\n\nOs recuerdo, el proceso de Matriculación tiene unos plazos concretos: desde el lunes 13 febrero hasta el viernes 24 de febrero.'}]
}
You can access the value of the post key for each row using list comprehension and count the length with str.len():
In one line of code, it would look like this:
df[1] = pd.Series([x['post'] for x in df[0]]).str.len()
This would also work, but I think it would be slower to execute:
df[1] = df[0].apply(lambda x: x['post']).str.len()
Full reproducible code below:
df = pd.DataFrame({0: [{'post': 'Feliz cumpleaños Raul'}],
1: [{'post': 'Muchas felicidades Tere!!! Espero que todo vaya genial y siga aún mejor! Un beso desde la Escandinavia profunda'}],
2: [{'post': 'Hola!\nUna investigadora vendrá a finales de mayo, ¿Alguien tiene una habitación libre en su piso para ella? Many Thanks!'}],
3: [{'post': '¿Cómo va todo? Se que muchos estáis o estábais por Galicia :D\n\nOs recuerdo, el proceso de Matriculación tiene unos plazos concretos: desde el lunes 13 febrero hasta el viernes 24 de febrero.'}]
})
df = df.T
df[1] = [x['post'] for x in df[0]]
df[2] = df[1].str.len()
df
Out[1]:
0 \
0 {'post': 'Feliz cumpleaños Raul'}
1 {'post': 'Muchas felicidades Tere!!! Espero qu...
2 {'post': 'Hola!
Una investigadora vendrá a fi...
3 {'post': '¿Cómo va todo? Se que muchos está...
1 2
0 Feliz cumpleaños Raul 22
1 Muchas felicidades Tere!!! Espero que todo vay... 112
2 Hola!\nUna investigadora vendrá a finales de ... 123
3 ¿Cómo va todo? Se que muchos estáis o está... 195

How to compare two dataframes by a key and create a new one, but just keeping the keys that are not in the first?

In python 3 and pandas I have two dataframes with the same structure:
data_1 = {
'numero_cnj' : ['0700488-61.2018.8.07.0017', '0003557-92.2008.4.01.3801', '1009486-37.2017.8.26.0053', '5005742-49.2017.4.04.9999', '0700488-61.2018.8.07.0017'],
'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA', 'SAO PAULO PREVIDENCIA - SPPREV', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_1 = pd.DataFrame(data_1)
data_2 = {
'numero_cnj' : ['0700488-61.2018.8.07.0017', '5005742-49.2017.4.04.9999', '1009486-37.2017.8.26.0053', '0700488-61.2018.8.07.0017'],
'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'SAO PAULO PREVIDENCIA - SPPREV', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_2 = pd.DataFrame(data_2)
The "numero_cnj" column is an identifying key for the same item, but it can be repeated because more than one person/name can refer to that item.
I want to compare the two dataframes by the key "numero_cnj" and create a new dataframe from df_1, but just keeping the rows or keys that are in df_2 but not in df_1 - keep all keys from df_1 that were not found in df_2
For example
df_1
numero_cnj nome_normalizado
0 0700488-61.2018.8.07.0017 MARIA DOS REIS DE OLIVEIRA SILVA
1 0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
2 1009486-37.2017.8.26.0053 SAO PAULO PREVIDENCIA - SPPREV
3 5005742-49.2017.4.04.9999 INSTITUTO NACIONAL DO SEGURO SOCIAL
4 0700488-61.2018.8.07.0017 GERALDO CAVALCANTE DA SILVEIRA
df_2
numero_cnj nome_normalizado
0 0700488-61.2018.8.07.0017 MARIA DOS REIS DE OLIVEIRA SILVA
1 5005742-49.2017.4.04.9999 INSTITUTO NACIONAL DO SEGURO SOCIAL
2 1009486-37.2017.8.26.0053 SAO PAULO PREVIDENCIA - SPPREV
3 0700488-61.2018.8.07.0017 GERALDO CAVALCANTE DA SILVEIRA
In this case, the new dataframe would have only the line:
0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
Please, does anyone know the best strategy to do this?
If I'm reading your question correctly, you should use join (merge) with how=outer:
merge = pd.merge(df_1, df_2, on = "numero_cnj", suffixes = ["", "_y"], how = "outer", indicator=True)
merge[merge._merge == "left_only"][["numero_cnj", "nome_normalizado"]]
The output is:
numero_cnj nome_normalizado
4 0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA

TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid

my data which called car_A :
Source
0 CAULAINCOURT
1 MARCHE DE L'EUROPE
2 AU MAIRE
I would like to find from all path from sources to destination something like:
Source Destination
0 CAULAINCOURT MARCHE DE L'EUROPE
2 CAULAINCOURT AU MAIRE
3 MARCHE DE L'EUROPE AU MAIRE
.
.
.
I already have tried
for i in car_A['Names']:
for j in range(len(car_A)-1):
car_A = car_A.append(car_A.iloc[j+1,0])
But i got
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
How can i get mentioned dataset?
A small variation on the fine answer from #James. itertools.permutations removes the duplicates for you.
import pandas as pd
from itertools import permutations
df = pd.DataFrame({'sources': [
"CAULAINCOURT",
"MARCHE DE L'EUROPE",
"AU MAIRE"
]})
df_pairs = pd.DataFrame(
[x for x in permutations(df.sources, 2)],
columns=['source', 'dest'])
df_pairs
# returns
source dest
0 CAULAINCOURT MARCHE DE L'EUROPE
1 CAULAINCOURT AU MAIRE
2 MARCHE DE L'EUROPE CAULAINCOURT
3 MARCHE DE L'EUROPE AU MAIRE
4 AU MAIRE CAULAINCOURT
5 AU MAIRE MARCHE DE L'EUROPE
Another solution, using DataFrame.merge():
import pandas as pd
df = pd.DataFrame({'Source': [
"CAULAINCOURT",
"MARCHE DE L'EUROPE",
"AU MAIRE"
]})
df = df.assign(key=1).merge(df.assign(key=1), on='key').drop('key', 1).rename(columns={'Source_x':'Source', 'Source_y':'Destination'})
df = df[df.Source != df.Destination]
print(df)
Prints:
Source Destination
1 CAULAINCOURT MARCHE DE L'EUROPE
2 CAULAINCOURT AU MAIRE
3 MARCHE DE L'EUROPE CAULAINCOURT
5 MARCHE DE L'EUROPE AU MAIRE
6 AU MAIRE CAULAINCOURT
7 AU MAIRE MARCHE DE L'EUROPE
You can use itertools.product to build a set of all of the pairs, filter to remove when the source and destination are the same location, and then construct a new data frame.
import pandas as pd
from itertools import product
df = pd.DataFrame({'sources': [
"CAULAINCOURT",
"MARCHE DE L'EUROPE",
"AU MAIRE"
]})
df_pairs = pd.DataFrame(
filter(lambda x: x[0]!=x[1], product(df.sources, df.sources)),
columns=['source', 'dest']
)
df_pairs
# returns:
source dest
0 CAULAINCOURT MARCHE DE L'EUROPE
1 CAULAINCOURT AU MAIRE
2 MARCHE DE L'EUROPE CAULAINCOURT
3 MARCHE DE L'EUROPE AU MAIRE
4 AU MAIRE CAULAINCOURT
5 AU MAIRE MARCHE DE L'EUROPE

How to fix this regex in order to preserve a given id order?

I have this large string:
s = '''Vaya ir VMM03S0 0.427083
mañanita mañana RG 0.796611
, , Fc 1
buscando buscar VMG0000 1
la lo PP3FSA00 0.0277039
encontramos encontrar VMIP1P0 0.65
. . Fp 1
Pero pero CC 0.999764
vamos ir VMIP1P0 0.655914
a a SPS00 0.996023
lo el DA0NS0 0.457533
que que PR0CN000 0.562517
interesa interesar VMIP3S0 0.994868
LO_QUE_INTERESA_La lo_que_interesa_la NP00000 1
lavadora lavador AQ0FS0 0.585262
tiene tener VMIP3S0 1
una uno DI0FS0 0.951575
clasificación clasificación NCFS000 1
A+ a+ NP00000 1
, , Fc 1
de de SPS00 0.999984
las el DA0FP0 0.970954
que que PR0CN000 0.562517
ahorran ahorrar VMIP3P0 1
energía energía NCFS000 1
, , Fc 1
si si CS 0.99954
me me PP1CS000 0.89124
no no RN 0.998134
equivoco equivocar VMIP1S0 1
. . Fp 1
Lava lavar VMIP3S0 0.397388
hasta hasta SPS00 0.957698
7 7 Z 1
kg kilogramo NCMN000 1
, , Fc 1
no no RN 0.998134
está estar VAIP3S0 0.999201
, , Fc 1
se se P00CN000 0.465639
le le PP3CSD00 1
veía ver VMII3S0 0.62272
un uno DI0MS0 0.987295
gran gran AQ0CS0 1
tambor tambor NCMS000 1
( ( Fpa 1
de de SPS00 0.999984
acero acero NCMS000 0.973481
inoxidable inoxidable AQ0CS0 1
) ) Fpt 1
y y CC 0.999962
un uno DI0MS0 0.987295
error error NCFSD23 0.234930
error error VMDFG34 0.98763
consumo consumo NCMS000 0.948927
máximo máximo AQ0MS0 0.986111
de de SPS00 0.999984
49 49 Z 1
litros litro NCMP000 1
error error DI0S3DF 1
Mandos mandos NP00000 1
intuitivos intuitivo AQ0MP0 1
, , Fc 1
todo todo PI0MS000 0.43165
muy muy RG 1
bien bien RG 0.902728
explicado explicar VMP00SM 1
, , Fc 1
jamas jamas RG 0.343443
nada nada PI0CS000 0.850279
que que PR0CN000 0.562517
de de SPS00 0.999984
nunca nunca RG 0.903
casa casa NCFS000 0.979058
de de SPS00 0.999984
mis mi DP1CPS 0.995868
error error VM9032 0.234323
string string VMWEOO 0.03444
padres padre NCMP000 1
Además además NP00000 1
incluye incluir VMIP3S0 0.994868
la el DA0FS0 0.972269
tecnología tecnología NCFS000 1
error errpr RG2303 1
Textileprotec textileprotec NP00000 1
que que PR0CN000 0.562517
protege proteger VMIP3S0 0.994868
nuestras nuestro DP1FPP 0.994186
ninguna ninguno DI0S3DF 0.345344
falla falla NCFSD23 1
prendas prenda NCFP000 0.95625
más más RG 1
preciadas preciar VMP00PF 1
jamas jamas RG2303 1
string string VM9032 0.234323
nunca nunca RG 0.293030
string string VM 0.902333
no no RN
le le PP004DF 0.390230
falla fallar VM0FD00 0.99033
. . Fp 1'''
I would like to extract in a list the second word from left to right and its id that holds this ids pattern: RN_ _ _ _ _, PP_ _ _ _ _, VM_ _ _ _ _. This ids must be together. For example:
no no RN 0.90383
le le PPSDF23 0.902339
falla fallar VM00DKE 0.9045
This is the pattern I would like to match, since they are together and the ids have the RN_ _ _ _ _, PP_ _ _ _ _, VM_ _ _ _ _ order this should be the output given the s string:
[('no RN', 'le PP004DF', 'fallar VM0FD00')]
This is what I tried:
together__ = re.findall(r'(?s)(\w+\s+RN)(?:(?!\s(?:RN|PP|VM)).)*?(\w+\s+PP\w+)(?:(?!\s(?:RN|PP|VM)).)*?(\w+\s+VM\w+)', s)
but I get this with the above regex:
print together__
output:
[('no RN', 'le PP3CSD00', 'ver VMII3S0'), ('no RN', 'le PP004DF', 'fallar VM0FD00')]
Which is wrong since the ids are not consecutevely in the string s (RN, PP, VM). How can I fix this regex?. Thanks in advance guys.
You better
You can do this simply with:
list = re.findall(r'\n?\s*\S+\s+(\w+\W+RN\w*)[^\n]*[^\n]*?\n\s*\S+\s+(\w+\W+PP\w*)[^\n]*[^\n]*?\n\s*\S+\s+(\w+\W+VM\w*)[^\n]*', s)
Resulting in:
[('no RN', 'le PP004DF', 'fallar VM0FD00')]
because ver VM is first.
In case you don't get a decent answer soon enough, I think that I may be close to what you want with this:
re.findall(r'\n[^\n]*?\s(.*?\sRN[^\n]*)\n[^\n]*?\s(.*?\sPP[^\n]*)\n[^\n]*?\s(.*?\sVM[^\n]*)\n', s)
I don't know why I get both words in the first item... might be because of the "#" and I did not bother cutting after RN, PP and VM. I guess that if the first step works decently enough, the rest can be easily fixed in post-processing.

Categories