Wildcard search in python string and then updating the string - python

I have a column named as city. I want to bring the city names to one format ex.
Column sample data :
City
Sydney
Sydney-EZ
Bangalore
Bengalore SEZ
Delhi
New Delhi
Sydney and Sydney-EZ or any other row containing word Sydney should be replaced by Sydney. Bangalore and Bangalore SEZ ( or any other row containing word Bangalore ) should be replaced by Bangalore . Delhi and New Delhi ( or any other row containing word Delhi ) should be replaced by Delhi.

Using apply with lambda
Ex:
import pandas as pd
df = pd.DataFrame({"City": ["Sydney", "Sydney-EZ", "Bangalore", "Bengalore SEZ"]})
toUpdate = "Sydney"
df["City"] = df["City"].apply(lambda x:toUpdate if toUpdate in x else x )
print(df)
Output:
City
0 Sydney
1 Sydney
2 Bangalore
3 Bengalore SEZ

Related

How to convert pandas DataFrame to multiple DataFrame?

My DataFrame
df= pandas.DataFrame({
"City" :["Chennai","Banglore","Mumbai","Delhi","Chennai","Banglore","Mumbai","Delhi"],
"Name" :["Praveen","Dhansekar","Naveen","Kumar","SelvaRani","Nithya","Suji","Konsy"]
"Gender":["M","M","M","M","F","F","F","F"]})
when printed it appears like this, df=
City
Name
Gender
Chennai
Praveen
M
Banglore
Dhansekar
M
Mumbai
Naveen
M
Delhi
Kumar
M
Chennai
SelvaRani
F
Banglore
Nithya
F
Mumbai
Suji
F
Delhi
Konsy
F
I want to save the data in separate DataFrame as follows:
Chennai=
City
Name
Gender
Chennai
Praveen
M
Chennai
SelvaRani
F
Banglore=
City
Name
Gender
Banglore
Dhansekar
M
Banglore
Nithya
F
Mumbai=
City
Name
Gender
Mumbai
Naveen
M
Mumbai
Suji
F
Delhi=
City
Name
Gender
Delhi
Kumar
M
Delhi
Konsy
F
My code is:
D_name= sorted(df['City'].unique())
for i in D_name:
f"{i}"=df[df['City']==I]
The dataset have more than 100 Cities.How do I write a for loop in python to get output as multiple data frame?
You can groupby and create a dictionary like so:
dict_dfs = dict(iter(df.groupby("City")))
Then you can directly access individual cities:
Delhi = dict_dfs["Delhi"]
print(Delhi)
# result:
City Name Gender
3 Delhi Kumar M
7 Delhi Konsy F
You could do something like this:
groups = df.groupby(by='City')
Bangalore = groups.get_group('Bangalore')

How to replace a list with first element of list in pandas dataframe column?

I have a pandas dataframe df, which look like this:
df = pd.DataFrame({'Name':['Harry', 'Sam', 'Raj', 'Jamie', 'Rupert'],
'Country':['USA', "['USA', 'UK', 'India']", "['India', 'USA']", 'Russia', 'China']})
Name Country
Harry USA
Sam ['USA', 'UK', 'India']
Raj ['India', 'USA']
Jamie Russia
Rupert China
Some values in Country column are list, and I want to replace those list with the first element in the list, so that it will look like this:
Name Country
Harry USA
Sam USA
Raj India
Jamie Russia
Rupert China
As you have strings, you could use a regex here:
df['Country'] = df['Country'].str.extract('((?<=\[["\'])[^"\']*|^[^"\']+$)')
output (as a new column for clarity):
Name Country Country2
0 Harry USA USA
1 Sam ['USA', 'UK', 'India'] USA
2 Raj ['India', 'USA'] India
3 Jamie Russia Russia
4 Rupert China China
regex:
( # start capturing
(?<=\[["\']) # if preceded by [" or ['
[^"\']* # get all text until " or '
| # OR
^[^"\']+$ # get whole string if it doesn't contain " or '
) # stop capturing
Try something like:
import ast
def changeStringList(value):
try:
myList = ast.literal_eval(value)
return myList[0]
except:
return value
df["Country"] = df["Country"].apply(changeStringList)
df
Output
Name
Country
0
Harry
USA
1
Sam
USA
2
Raj
India
3
Jamie
Russia
4
Rupert
China
Note that, by using the changeStringList function, we try to reform the string list to an interpretable list of strings and return the first value. If it is not a list, then it returns the value itself.
Try this:
import ast
df['Country'] = df['Country'].where(df['Country'].str.contains('[', regex=False), '[\'' + df['Country'] + '\']').apply(ast.literal_eval).str[0]
Output:
>>> df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
A regex solution.
import re
tempArr = []
for val in df["Country"]:
if val.startswith("["):
val = re.findall(r"[A-Za-z]+",val)[0]
tempArr.append(val)
else: tempArr.append(val)
df["Country"] = tempArr
df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
If you have string you could use Series.str.strip in order to remove ']' or '[' and then use Series.str.split to convert all rows to list ,after that we could use .str accesor
df['Country'] = df['Country'].str.strip('[|]').str.split(',')\
.str[0].str.replace("'", "")
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China

Creating new dataframe column using string filter of other column

Below is the dataframe with column name 'Address'. I want to create a separate column 'City' with specific string using filter from Address column.
df1
Serial_No Address
1 India Gate Delhi
2 Delhi Redcross Hospital
3 Tolleyganj Bus Stand Kolkata
4 Kolkata Howrah
5 Katra Jammu
Below is the script that I am using
descr = []
col = 'City'
for col in df:
if np.series(df[col]= df[df[col].str.contains('Delhi ', na=False)]:
desc = 'Delhi'
elif np.series(df[col]= df[df[col].str.contains('Kolkata ', na=False)]:
desc = 'Kolkata'
else:
desc = 'None'
Below is the intended output
df1
Serial_No Address City
1 India Gate Delhi Delhi
2 Delhi Redcross Hospital Delhi
3 Tolleyganj Bus Stand Kolkata Kolkata
4 Kolkata Howrah Kolkata
5 Katra Jammu None
Let us try str.extract
df['new'] = df.Address.str.extract(('(Delhi|Kolkata)'))[0]
Try this
import pandas as pd
df1=pd.DataFrame([[1,'India Gate Delhi'],[2,'Delhi Redcross Hospital'],[3,'Tolleyganj Bus Stand Kolkata'],[4,'Kolkata Howrah'],[5,'Katra Jammu']],columns=['Serial_No','Address'])
print(df1)
def f(df1):
if 'Delhi' in df1['Address']:
val = 'Delhi'
elif 'Kolkata' in df1['Address']:
val = 'Kolkata'
else:
val = 'None'
return val
df1['City'] = df1.apply(f, axis=1)
print(df1)

converting print statement output in loop into dataframe

I am trying to adapt the following code from print statement to dataframe output.
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
def on_occurence(pos,location):
print (i,':',location)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
the print output for the above code is
England UK : UK
Paris FRANCE : FRANCE
ITALY,gh ROME : ITALY
I would like it so the df looked like:
message
country
England UK
UK
Paris FRANCE
FRANCE
ITALY,gh ROME
ITALY
I have tried the following with no luck
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
df = pd.DataFrame(columns=["message","location"])
def on_occurence(pos,location):
print (i,':',location)
df = df.append({"message":i,"location":location},ignore_index=True)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
However the df looks like the following
message
country
NEW
UK FRANCE ITALY
df = pd.DataFrame(list(zip(places, location)), columns = ["Message", "Country"])
print(df)
My output:
Message Country
0 England UK UK
1 Paris FRANCE FRANCE
2 ITALY,gh ROME ITALY
If you want to print it without Row Index:
print(df.to_string(index=False))
Output in this case is:
Message Country
England UK UK
Paris FRANCE FRANCE
ITALY,gh ROME ITALY
I would recomend using dictionarys instead of 2 separate lists EG:
placeAndLocation = {
"england UK" : "UK",
"Paris France" : "france"
}
and so on.
Then to loop through this use:
for place, location in placeAndLocation.items():
print("place: " + place)
print("location: " + location)
I find this easier as you can easily see what data field lines up with what value and the data is contained within one variavle makeing it easier to resd down the line

String mode aggregation with group by function

I have dataframe which looks like below
Country City
UK London
USA Washington
UK London
UK Manchester
USA Washington
USA Chicago
I want to group country and aggregate on the most repeated city in a country
My desired output should be like
Country City
UK London
USA Washington
Because London and Washington appears 2 times whereas Manchester and Chicago appears only 1 time.
I tried
from scipy.stats import mode
df_summary = df.groupby('Country')['City'].\
apply(lambda x: mode(x)[0][0]).reset_index()
But it seems it won't work on strings
I can't replicate your error, but you can use pd.Series.mode, which accepts strings and returns a series, using iat to extract the first value:
res = df.groupby('Country')['City'].apply(lambda x: x.mode().iat[0]).reset_index()
print(res)
Country City
0 UK London
1 USA Washington
try like below:
>>> df.City.mode()
0 London
1 Washington
dtype: object
OR
import pandas as pd
from scipy import stats
Can use scipy with stats + lambda :
df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]})
City
Country
UK London
USA Washington
# df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]}).reset_index()
However, it gives nice count as well if you don't want to return ony First value:
>>> df.groupby('Country').agg({'City': lambda x:stats.mode(x)})
City
Country
UK ([London], [2])
USA ([Washington], [2])

Categories