I need to do the same as what I can do with my function: df_g['Bidfloor'] = df_g[['Sitio', 'Country']].merge(df_seg, how='left').Precio but on the Country instead of the exactly same row only the first 2 keys because I can't change the language of the data. So I want to read only the 2 first keys of Countrycolumn instead of all keys of Countrycolumn
df_g:
Sitio,Country
Los Andes Online,HN - Honduras
Guarda14,US - Estados Unidos
Guarda14,PE - Peru
df_seg:
Sitio,Country,Precio
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
What I need:
Sitio,Country,Bidfloor
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
Guarda14,PE - Peru,NULL
You need additional key for help the merge , I am using cumcount to distinguish the repeat value
df1.assign(key=df1.groupby('Sitio').cumcount()).\
merge(df2.assign(key=df2.groupby('Sitio').cumcount()).
drop('Country',1),
how='left',
on=['Sitio','key'])
Out[1491]:
Sitio Country key Precio
0 Los Andes Online HN - Honduras 0 0.5
1 Guarda14 US - Estados Unidos 0 2.1
2 Guarda14 PE - Peru 1 NaN
Just add and drop a merge column and you are done:
df_seg['merge_col'] = df_seg.Country.apply(lambda x: x.split('-')[0])
df_g['merge_col'] = df_g.Country.apply(lambda x: x.split('-')[0])
then do:
df = pd.merge(df_g, df_seg[['merge_col', 'Precio']], on='merge_col', how='left').drop('merge_col', 1)
returns
Sitio Country Precio
0 Los Andes Online HN - Honduras 0.5
1 Guarda14 US - Estados Unidos 2.1
2 Guarda14 PE - Peru NaN
Related
i have a dataframe with a column called details that have this data :
130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -
when i want to get the first data 130 i did this :
df['superficie'] = df['details'].str.split('m²').str[0]
its gives me 130 in a new column that called "superficie"
for the the seconde data i did this :
df['nbPieces']= (df['details'].str.split('-').str[1].str.split('Pièces').str[0])
it gives me 3 in a new column that called "nbPieces"
but my problème is if i want to get the 2 of the champbres and 2 of the salles de bains and the 20-30 near of "ans" , how can i do that, i need to add them to new columns (nbChambre , nbSalleDeBain, NbAnnee)
thanks in advance .
I suggest you to use regular expressions in pandas for this kind of operations:
import pandas as pd
df = pd.DataFrame()
df['details'] = ["130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -"]
df['nb_chbr'] = df['details'].str.split(" - ").str[2].str.findall(r'\d+').str[0].astype('int64')
df['nb_sdb'] = df['details'].str.split(" - ").str[3].str.findall(r'\d+').str[0].astype('int64')
df['nb_annee'] = df['details'].str.split(" - ").str[5].str.findall(r'\d+').str[0].astype('int64')
print(df)
Output:
details nb_chbr nb_sdb nb_annee
0 130 m² - 3 Pièces - 2 Chambres - 2 Salles de b... 2 2 20
Moreover, I used " - " as a split string. It returns a better list in your case. And for the "Nombre d'années" case I simply took the first integer that appears in the list, I don't know if it suits you.
Finally there may be a problem in your dataframe, 2 chambres and 2 salles de bain should be a 4 pièces flat ^^
I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
I want to split a column into two columns by using the value of a second column from the same row, hence the second column value serves as the split delimiter.
I'm receiving the error TypeError: 'Series' objects are mutable, thus they cannot be hashed which makes sense its receiving a series, not a single value, but I'm unsure how to isolate to the single row value of that second column.
Sample data:
title_location delimiter
0 Doctor - ABC - Los Angeles, CA - ABC -
1 Lawyer - ABC - Atlanta, GA - ABC -
2 Athlete - XYZ - Jacksonville, FL - XYZ -
Code:
bigdata[['title', 'location']] = bigdata['title_location'].str.split(bigdata['delimiter'], expand=True)
Desired output:
title_location delimiter title location
0 Doctor - ABC - Los Angeles, CA - ABC - Doctor Los Angeles, CA
1 Lawyer - ABC - Atlanta, GA - ABC - Lawyer Atlanta, GA
2 Athlete - XYZ - Jacksonville, FL - XYZ - Athlete Jacksonville, FL
Try apply.
bigdata[['title', 'location']]=bigdata.apply(func=lambda row: row['title_location'].split(row['delimiter']), axis=1, result_type="expand")
Let us try zip for then join back
df = df.join(pd.DataFrame([x.split(y) for x ,y in zip(df.title_location,df.delimiter)],index=df.index,columns=['Title','Location']))
df
Out[200]:
title_location delimiter Title Location
0 Doctor - ABC - Los Angeles, CA - ABC - Doctor Los Angeles, CA
1 Lawyer - ABC - Atlanta, GA - ABC - Lawyer Atlanta, GA
2 Athlete - XYZ - Jacksonville, FL - XYZ - Athlete Jacksonville, FL
I have the following data and I would like to know: Who was the first and last customer that each Driver pick-up for each day?
Data
This is how far I just got:
#Import libraries
import pandas as pd
import numpy as np
#Open and clean the data
df = pd.read_csv('Data.csv')
df = df.drop(['Cod'], axis=1)
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
#the following code is to respond the following question:
#Who was the first and last customer that each Driver picked-up for each day?
#link to access the data: https://drive.google.com/file/d/194byxNkgr2e9r-IOEmSuu9gpZyw27G7j/view?usp=sharing
unique_drivers = df['Driver'].value_counts()
for driver in unique_drivers:
d= vdf.groupby('Driver').get_group(driver)
time = d['Start'][0]
first_customer = d['Customer'][0]
end = d['End'][0]
last_customer = d['Customer'][-1]
You can first sort by the column Start which includes the hour and minutes, ensuring that
multiple same day events are sorted correctly for the next step. Group the frame by Driver to find
the drivers pick up for each day.
Using drop_duplicates drop repeated values using the flag keep="first" to preserve only
the first values during the evaluation, similarly use keep="last" to preserve only the last (from a cluster of repeated values). This will produce unique dates for each driver, the first pick up and the last one for each day, then use the index from those days over the Customer column to get the customer's name.
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
# sort including HH:MM
df = df.sort_values("Start")
drivers_df = []
for gname, group in df.groupby("Driver"):
dn = pd.DataFrame()
# split to get date and time in two columns
ts = group["Start"].str.split(expand=True)
# remove duplicate days keeping the first occurance
t_first = ts.drop_duplicates(subset=[0], keep="first")
# remove duplicate days keeping the last occurance
t_last = ts.drop_duplicates(subset=[0], keep="last")
dn["Date"] = t_first[0]
dn["Driver"] = gname
dn["Num_Customers"] = ts[0].groupby(ts[0]).count().values
# use the previous obtained indices over the "Customer" column
dn["First_Customer"] = df.loc[t_first.index, "Customer"].values
dn["Last_Customer"] = df.loc[t_last.index, "Customer"].values
drivers_df.append(dn)
dn = pd.concat(drivers_df)
# remove to sort by driver's name
dn = dn.sort_values("Date")
dn = dn.reset_index(drop=True)
print(dn)
Output from dn
Date Driver Num_Customers First_Customer Last_Customer
0 5/10/2020 Javier Pulgar 1 100998 - MARA MIRIAN BEATRIZ 100998 - MARA MIRIAN BEATRIZ
1 5/10/2020 Santiago Muruaga 1 103055 - ZANOTTO VALERIA 103055 - ZANOTTO VALERIA
2 5/10/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
3 5/10/2020 Pablo Aguilar 2 102737 - GONZALVE DE ACEVEDO 102737 - GONZALVE DE ACEVEDO
4 5/10/2020 Carlos Medina 1 102750 - COOP.DE TRABAJO 102750 - COOP.DE TRABAJO
5 5/11/2020 Facundo Papaleo 6 101209 - FARMACIA NAZCA 2602 105093 - BIO HERLPER
6 5/11/2020 Franco Chiarrappa 15 100288 - SAVINI LUCIANA MARIA 102690 - GIOIA ELIZABETH
7 5/11/2020 Hernán Navarro 14 106367 - FARMACIA BERAPHAR... 102631 - SPALVIERI MARINA
8 5/11/2020 Pablo Aguilar 9 102510 - CAZADORFARM SCS 101482 - JOAQUIN MARCIAL
9 5/11/2020 Daniel Godino 7 103572 - GIRALDEZ ALICIA OLGA 103363 - CADELLI ROBERTO JOSE
10 5/11/2020 Hernán Urquiza 1 105323 - GARCIA GERMAN REI... 105323 - GARCIA GERMAN REI...
11 5/11/2020 Héctor Naselli 19 103545 - FARMACIA DESANTI 102257 - FARMA NUOVA S.C.S.
12 5/11/2020 Santiago Muruaga 12 101735 - ALEGRE LEONARDO 500014 - Drogueria DIMEC
13 5/11/2020 Javier Pulgar 2 101009 - MIGUEL ANGEL MARA 103462 - DRAGONE CARLOS AL...
14 5/11/2020 Atilano Aguilera 1 104003 - FARMACIA SANTA 104003 - FARMACIA SANTA
15 5/11/2020 Muletto 3 101359 - FARMACIA COSENTINO 105886 - NEGRI GREGORIO
16 5/11/2020 Martín Venturino 8 102587 - JANISZEWSKI MATIL... 102672 - BORSOTTI GUSTAVO
17 5/11/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
18 5/11/2020 José Vallejos 13 102229 - LANDRIEL MARIA LO... 105721 - SOSA NANCY EDITH ...
19 5/11/2020 Edgardo Andrade 9 101524 - FARMACIA M Y A 101217 - MARISA TESORO
20 5/11/2020 Carlos Medina 14 105126 - QUISPE MURILLO RODY 100538 - MAXIMILIANO CAMPO...
21 5/11/2020 Javier Torales 1 200666 - CLINICA BOEDO SRL 200666 - CLINICA BOEDO SRL
22 5/12/2020 Hernán Urquiza 8 105293 - BENSAK MARIA EUGENIA 103005 - BONVISSUTO SANDRA
23 5/12/2020 Miguel Quilici 17 102918 - BRITO NICOLAS 102533 - SAMPEDRO PURA
...
...
The dataframe has 122,145 rows.
Following is snippet of data :
country_name,subdivision_1_name,subdivision_2_name,city_name
Spain,Madrid,Madrid,Sevilla La Nueva
Spain,Principality of Asturias,Asturias,Sevares
Spain,Catalonia,Barcelona,Seva
Spain,Cantabria,Cantabria,Setien
Spain,Basque Country,Biscay,Sestao
Spain,Navarre,Navarre,Sesma
Spain,Catalonia,Barcelona,Barcelona
I want to substitute city_name with subdivision_2_name whenever both the following conditions are satisfied:
subdivision_2_name and city_name have same country_name and same
subdivision_1_name , and
subdivision_2_name is present in city_name.
ex: For city_name "Seva" the subdivison_2_name "Barcelona" is present as a city_name as well in the dataframe with the same country_name "Spain" and same subdivision_1_name "Catalonia" , so I will replace "Seva" with "Barcelona".
I am able to create a proper apply func. I have prepared a loop:
for i in range(df.shape[0]):
if df.subdivision_2_name[i] in set(df.city_name[(df.country_name == df.country_name[i]) & (df.subdivision_1_name == df.subdivision_1_name[i])]):
df.city_name[i] = df.subdivision_2_name[i]
Edit : This loop took 1637 seconds(~28 min) to run
Suggest me a better method.
Use:
def f(x):
if x['subdivision_2_name'].isin(x['city_name']).any():
x['city_name'] = x['subdivision_2_name']
return (x)
df1 = df.groupby(['country_name','subdivision_1_name','subdivision_2_name']).apply(f)
print (df1)
country_name subdivision_1_name subdivision_2_name city_name
0 Spain Madrid Madrid Sevilla La Nueva
1 Spain Principality of Asturias Asturias Sevares
2 Spain Catalonia Barcelona Barcelona
3 Spain Cantabria Cantabria Setien
4 Spain Basque Country Biscay Sestao
5 Spain Navarre Navarre Sesma
6 Spain Catalonia Barcelona Barcelona