Python pandas 'reverse' split in a dataframe - python

i have a dataframe with a column called details that have this data :
130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -
when i want to get the first data 130 i did this :
df['superficie'] = df['details'].str.split('m²').str[0]
its gives me 130 in a new column that called "superficie"
for the the seconde data i did this :
df['nbPieces']= (df['details'].str.split('-').str[1].str.split('Pièces').str[0])
it gives me 3 in a new column that called "nbPieces"
but my problème is if i want to get the 2 of the champbres and 2 of the salles de bains and the 20-30 near of "ans" , how can i do that, i need to add them to new columns (nbChambre , nbSalleDeBain, NbAnnee)
thanks in advance .

I suggest you to use regular expressions in pandas for this kind of operations:
import pandas as pd
df = pd.DataFrame()
df['details'] = ["130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -"]
df['nb_chbr'] = df['details'].str.split(" - ").str[2].str.findall(r'\d+').str[0].astype('int64')
df['nb_sdb'] = df['details'].str.split(" - ").str[3].str.findall(r'\d+').str[0].astype('int64')
df['nb_annee'] = df['details'].str.split(" - ").str[5].str.findall(r'\d+').str[0].astype('int64')
print(df)
Output:
details nb_chbr nb_sdb nb_annee
0 130 m² - 3 Pièces - 2 Chambres - 2 Salles de b... 2 2 20
Moreover, I used " - " as a split string. It returns a better list in your case. And for the "Nombre d'années" case I simply took the first integer that appears in the list, I don't know if it suits you.
Finally there may be a problem in your dataframe, 2 chambres and 2 salles de bain should be a 4 pièces flat ^^

Related

group by a string column and datetime64[ns] column

I have the following data and I would like to know: Who was the first and last customer that each Driver pick-up for each day?
Data
This is how far I just got:
#Import libraries
import pandas as pd
import numpy as np
#Open and clean the data
df = pd.read_csv('Data.csv')
df = df.drop(['Cod'], axis=1)
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
#the following code is to respond the following question:
#Who was the first and last customer that each Driver picked-up for each day?
#link to access the data: https://drive.google.com/file/d/194byxNkgr2e9r-IOEmSuu9gpZyw27G7j/view?usp=sharing
unique_drivers = df['Driver'].value_counts()
for driver in unique_drivers:
d= vdf.groupby('Driver').get_group(driver)
time = d['Start'][0]
first_customer = d['Customer'][0]
end = d['End'][0]
last_customer = d['Customer'][-1]
You can first sort by the column Start which includes the hour and minutes, ensuring that
multiple same day events are sorted correctly for the next step. Group the frame by Driver to find
the drivers pick up for each day.
Using drop_duplicates drop repeated values using the flag keep="first" to preserve only
the first values during the evaluation, similarly use keep="last" to preserve only the last (from a cluster of repeated values). This will produce unique dates for each driver, the first pick up and the last one for each day, then use the index from those days over the Customer column to get the customer's name.
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
# sort including HH:MM
df = df.sort_values("Start")
drivers_df = []
for gname, group in df.groupby("Driver"):
dn = pd.DataFrame()
# split to get date and time in two columns
ts = group["Start"].str.split(expand=True)
# remove duplicate days keeping the first occurance
t_first = ts.drop_duplicates(subset=[0], keep="first")
# remove duplicate days keeping the last occurance
t_last = ts.drop_duplicates(subset=[0], keep="last")
dn["Date"] = t_first[0]
dn["Driver"] = gname
dn["Num_Customers"] = ts[0].groupby(ts[0]).count().values
# use the previous obtained indices over the "Customer" column
dn["First_Customer"] = df.loc[t_first.index, "Customer"].values
dn["Last_Customer"] = df.loc[t_last.index, "Customer"].values
drivers_df.append(dn)
dn = pd.concat(drivers_df)
# remove to sort by driver's name
dn = dn.sort_values("Date")
dn = dn.reset_index(drop=True)
print(dn)
Output from dn
Date Driver Num_Customers First_Customer Last_Customer
0 5/10/2020 Javier Pulgar 1 100998 - MARA MIRIAN BEATRIZ 100998 - MARA MIRIAN BEATRIZ
1 5/10/2020 Santiago Muruaga 1 103055 - ZANOTTO VALERIA 103055 - ZANOTTO VALERIA
2 5/10/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
3 5/10/2020 Pablo Aguilar 2 102737 - GONZALVE DE ACEVEDO 102737 - GONZALVE DE ACEVEDO
4 5/10/2020 Carlos Medina 1 102750 - COOP.DE TRABAJO 102750 - COOP.DE TRABAJO
5 5/11/2020 Facundo Papaleo 6 101209 - FARMACIA NAZCA 2602 105093 - BIO HERLPER
6 5/11/2020 Franco Chiarrappa 15 100288 - SAVINI LUCIANA MARIA 102690 - GIOIA ELIZABETH
7 5/11/2020 Hernán Navarro 14 106367 - FARMACIA BERAPHAR... 102631 - SPALVIERI MARINA
8 5/11/2020 Pablo Aguilar 9 102510 - CAZADORFARM SCS 101482 - JOAQUIN MARCIAL
9 5/11/2020 Daniel Godino 7 103572 - GIRALDEZ ALICIA OLGA 103363 - CADELLI ROBERTO JOSE
10 5/11/2020 Hernán Urquiza 1 105323 - GARCIA GERMAN REI... 105323 - GARCIA GERMAN REI...
11 5/11/2020 Héctor Naselli 19 103545 - FARMACIA DESANTI 102257 - FARMA NUOVA S.C.S.
12 5/11/2020 Santiago Muruaga 12 101735 - ALEGRE LEONARDO 500014 - Drogueria DIMEC
13 5/11/2020 Javier Pulgar 2 101009 - MIGUEL ANGEL MARA 103462 - DRAGONE CARLOS AL...
14 5/11/2020 Atilano Aguilera 1 104003 - FARMACIA SANTA 104003 - FARMACIA SANTA
15 5/11/2020 Muletto 3 101359 - FARMACIA COSENTINO 105886 - NEGRI GREGORIO
16 5/11/2020 Martín Venturino 8 102587 - JANISZEWSKI MATIL... 102672 - BORSOTTI GUSTAVO
17 5/11/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
18 5/11/2020 José Vallejos 13 102229 - LANDRIEL MARIA LO... 105721 - SOSA NANCY EDITH ...
19 5/11/2020 Edgardo Andrade 9 101524 - FARMACIA M Y A 101217 - MARISA TESORO
20 5/11/2020 Carlos Medina 14 105126 - QUISPE MURILLO RODY 100538 - MAXIMILIANO CAMPO...
21 5/11/2020 Javier Torales 1 200666 - CLINICA BOEDO SRL 200666 - CLINICA BOEDO SRL
22 5/12/2020 Hernán Urquiza 8 105293 - BENSAK MARIA EUGENIA 103005 - BONVISSUTO SANDRA
23 5/12/2020 Miguel Quilici 17 102918 - BRITO NICOLAS 102533 - SAMPEDRO PURA
...
...

How to create a duplicate flag (column) that counts duplicate rows based on two columns?

I have the following dataframe and would like to create a column at the end called "dup" showing the number of times the row shows up based on the "Seasons" and "Actor" columns. Ideally the dup column would look like this:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 1
This should do what you need:
df['dup'] = df.groupby(['Seasons', 'Actor']).cumcount() + 1
Output:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 2
As Scott Boston mentioned, according to your criteria the last row should also be 2 in the dup column.
Here is a similar post that can provide you more information. SQL-like window functions in PANDAS

Merge 2 columns in python

I need to do the same as what I can do with my function: df_g['Bidfloor'] = df_g[['Sitio', 'Country']].merge(df_seg, how='left').Precio but on the Country instead of the exactly same row only the first 2 keys because I can't change the language of the data. So I want to read only the 2 first keys of Countrycolumn instead of all keys of Countrycolumn
df_g:
Sitio,Country
Los Andes Online,HN - Honduras
Guarda14,US - Estados Unidos
Guarda14,PE - Peru
df_seg:
Sitio,Country,Precio
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
What I need:
Sitio,Country,Bidfloor
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
Guarda14,PE - Peru,NULL
You need additional key for help the merge , I am using cumcount to distinguish the repeat value
df1.assign(key=df1.groupby('Sitio').cumcount()).\
merge(df2.assign(key=df2.groupby('Sitio').cumcount()).
drop('Country',1),
how='left',
on=['Sitio','key'])
Out[1491]:
Sitio Country key Precio
0 Los Andes Online HN - Honduras 0 0.5
1 Guarda14 US - Estados Unidos 0 2.1
2 Guarda14 PE - Peru 1 NaN
Just add and drop a merge column and you are done:
df_seg['merge_col'] = df_seg.Country.apply(lambda x: x.split('-')[0])
df_g['merge_col'] = df_g.Country.apply(lambda x: x.split('-')[0])
then do:
df = pd.merge(df_g, df_seg[['merge_col', 'Precio']], on='merge_col', how='left').drop('merge_col', 1)
returns
Sitio Country Precio
0 Los Andes Online HN - Honduras 0.5
1 Guarda14 US - Estados Unidos 2.1
2 Guarda14 PE - Peru NaN

Selecting rows by last 3 characters in a column with strings

I have this dataframe
name year ...
0 Carlos - xyz 2019
1 Marcos - yws 2031
3 Fran - xxz 2431
4 Matt - yre 1985
...
I want to create a new column, called type.
If the name of the person ends with "xyz" or "xxz" I want type to be "big"
So, it should look like this:
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
...
Any suggestions?
Option 1
Use str.contains to generate a mask:
m = df.name.str.contains(r'x[yx]z$')
Or,
sub_str = ['xyz', 'xxz']
m = df.name.str.contains(r'{}$'.format('|'.join(sub_str)))
Now, you may either create your column with np.where,
df['type'] = np.where(m, 'big', '')
Or, loc in place of np.where;
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
Option 2
As an alternative, consider str.endswith + np.logical_or.reduce
sub_str = ['xyz', 'xxz']
m = np.logical_or.reduce([df.name.str.endswith(s) for s in sub_str])
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
Here is one way using pandas.Series.str.
df = pd.DataFrame([['Carlos - xyz', 2019], ['Marcos - yws', 2031],
['Fran - xxz', 2431], ['Matt - yre', 1985]],
columns=['name', 'year'])
df['type'] = np.where(df['name'].str[-3:].isin({'xyz', 'xxz'}), 'big', '')
Alternatively, you can use .loc accessor instead of numpy.where:
df['type'] = ''
df.loc[df['name'].str[-3:].isin({'xyz', 'xxz'}), 'type'] = 'big'
Result
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
2 Fran - xxz 2431 big
3 Matt - yre 1985
Explanation
Extract last 3 letters using pd.Series.str.
Compare to a specified set of values for O(1) complexity lookup.
Use numpy.where to perform conditional assignment for new series.

How to merge two data frames in pandas?

I have two pandas dataframes
Unnamed: 0 sentiment numberagreed tweetid tweet
0 0 2 6 219584 Apple processa a Samsung no Japão - Notícias -...
1 1 1 3 399249 É O JACKI CHAN !!! RT #user ESSE É DOS MEUS!!!...
2 2 3 3 387155 Eras o samsung galaxy tab e muito lerdo para t...
3 3 3 3 205458 Dizem que a coisa mais triste que o homem enfr...
4 4 3 3 2054404 RAIVA vou ter que ir com meu nike dinovo pra e...
tweetid sent
219584 0.494428
399249 0.789241
387155 0.351972
205458 0.396907
2054404 0.000000
They are not the same length and there are some missing values in the second data frame
I want to merge the two data frames based on the tweetid and drop the missing values
Use pd.merge
pd.merge(left=df1, right=df2, on='tweetid', how='inner')
Because you take the inner, non-overlapping parts will be thrown away. on='tweetid' merges it on tweetid.
There is probably an extra character somewhere at the beginning of your file. Are you reading the data from a csv file? Post the source code of how you are reading the data.
or name your columns on both dataframe.
df_tweets.columns = ("tweetid", "sent")

Categories