converting print statement output in loop into dataframe - python

I am trying to adapt the following code from print statement to dataframe output.
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
def on_occurence(pos,location):
print (i,':',location)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
the print output for the above code is
England UK : UK
Paris FRANCE : FRANCE
ITALY,gh ROME : ITALY
I would like it so the df looked like:
message
country
England UK
UK
Paris FRANCE
FRANCE
ITALY,gh ROME
ITALY
I have tried the following with no luck
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
df = pd.DataFrame(columns=["message","location"])
def on_occurence(pos,location):
print (i,':',location)
df = df.append({"message":i,"location":location},ignore_index=True)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
However the df looks like the following
message
country
NEW
UK FRANCE ITALY

df = pd.DataFrame(list(zip(places, location)), columns = ["Message", "Country"])
print(df)
My output:
Message Country
0 England UK UK
1 Paris FRANCE FRANCE
2 ITALY,gh ROME ITALY
If you want to print it without Row Index:
print(df.to_string(index=False))
Output in this case is:
Message Country
England UK UK
Paris FRANCE FRANCE
ITALY,gh ROME ITALY

I would recomend using dictionarys instead of 2 separate lists EG:
placeAndLocation = {
"england UK" : "UK",
"Paris France" : "france"
}
and so on.
Then to loop through this use:
for place, location in placeAndLocation.items():
print("place: " + place)
print("location: " + location)
I find this easier as you can easily see what data field lines up with what value and the data is contained within one variavle makeing it easier to resd down the line

Related

Dataframe - filter the values of a particular column with isin()

I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid

How to replace a list with first element of list in pandas dataframe column?

I have a pandas dataframe df, which look like this:
df = pd.DataFrame({'Name':['Harry', 'Sam', 'Raj', 'Jamie', 'Rupert'],
'Country':['USA', "['USA', 'UK', 'India']", "['India', 'USA']", 'Russia', 'China']})
Name Country
Harry USA
Sam ['USA', 'UK', 'India']
Raj ['India', 'USA']
Jamie Russia
Rupert China
Some values in Country column are list, and I want to replace those list with the first element in the list, so that it will look like this:
Name Country
Harry USA
Sam USA
Raj India
Jamie Russia
Rupert China
As you have strings, you could use a regex here:
df['Country'] = df['Country'].str.extract('((?<=\[["\'])[^"\']*|^[^"\']+$)')
output (as a new column for clarity):
Name Country Country2
0 Harry USA USA
1 Sam ['USA', 'UK', 'India'] USA
2 Raj ['India', 'USA'] India
3 Jamie Russia Russia
4 Rupert China China
regex:
( # start capturing
(?<=\[["\']) # if preceded by [" or ['
[^"\']* # get all text until " or '
| # OR
^[^"\']+$ # get whole string if it doesn't contain " or '
) # stop capturing
Try something like:
import ast
def changeStringList(value):
try:
myList = ast.literal_eval(value)
return myList[0]
except:
return value
df["Country"] = df["Country"].apply(changeStringList)
df
Output
Name
Country
0
Harry
USA
1
Sam
USA
2
Raj
India
3
Jamie
Russia
4
Rupert
China
Note that, by using the changeStringList function, we try to reform the string list to an interpretable list of strings and return the first value. If it is not a list, then it returns the value itself.
Try this:
import ast
df['Country'] = df['Country'].where(df['Country'].str.contains('[', regex=False), '[\'' + df['Country'] + '\']').apply(ast.literal_eval).str[0]
Output:
>>> df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
A regex solution.
import re
tempArr = []
for val in df["Country"]:
if val.startswith("["):
val = re.findall(r"[A-Za-z]+",val)[0]
tempArr.append(val)
else: tempArr.append(val)
df["Country"] = tempArr
df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
If you have string you could use Series.str.strip in order to remove ']' or '[' and then use Series.str.split to convert all rows to list ,after that we could use .str accesor
df['Country'] = df['Country'].str.strip('[|]').str.split(',')\
.str[0].str.replace("'", "")
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China

What is the fastest way to modify my pandas dataframe?

The dataframe has 122,145 rows.
Following is snippet of data :
country_name,subdivision_1_name,subdivision_2_name,city_name
Spain,Madrid,Madrid,Sevilla La Nueva
Spain,Principality of Asturias,Asturias,Sevares
Spain,Catalonia,Barcelona,Seva
Spain,Cantabria,Cantabria,Setien
Spain,Basque Country,Biscay,Sestao
Spain,Navarre,Navarre,Sesma
Spain,Catalonia,Barcelona,Barcelona
I want to substitute city_name with subdivision_2_name whenever both the following conditions are satisfied:
subdivision_2_name and city_name have same country_name and same
subdivision_1_name , and
subdivision_2_name is present in city_name.
ex: For city_name "Seva" the subdivison_2_name "Barcelona" is present as a city_name as well in the dataframe with the same country_name "Spain" and same subdivision_1_name "Catalonia" , so I will replace "Seva" with "Barcelona".
I am able to create a proper apply func. I have prepared a loop:
for i in range(df.shape[0]):
if df.subdivision_2_name[i] in set(df.city_name[(df.country_name == df.country_name[i]) & (df.subdivision_1_name == df.subdivision_1_name[i])]):
df.city_name[i] = df.subdivision_2_name[i]
Edit : This loop took 1637 seconds(~28 min) to run
Suggest me a better method.
Use:
def f(x):
if x['subdivision_2_name'].isin(x['city_name']).any():
x['city_name'] = x['subdivision_2_name']
return (x)
df1 = df.groupby(['country_name','subdivision_1_name','subdivision_2_name']).apply(f)
print (df1)
country_name subdivision_1_name subdivision_2_name city_name
0 Spain Madrid Madrid Sevilla La Nueva
1 Spain Principality of Asturias Asturias Sevares
2 Spain Catalonia Barcelona Barcelona
3 Spain Cantabria Cantabria Setien
4 Spain Basque Country Biscay Sestao
5 Spain Navarre Navarre Sesma
6 Spain Catalonia Barcelona Barcelona

How to test string contains elements in list and assign the target element to another column via Pandas

I have a one column list presenting some company names. Some of those names contain the country names (e.g., "China" in "China A1", 'Finland' in "C1 in Finland"). I want to extract their belonging countries based on the company name and a pre-defined list consisted of country names.
The original dataframe df shows like this
Company name Country
0 China A1
1 Australia-A2
2 Belgium_C1
3 C1 in Finland
4 D1 of Greece
5 E2 for Pakistan
For now, I can only come up with an inefficient method. Here is my code:
country_list = ['China','America','Greece','Pakistan','Finland','Belgium','Japan','British','Australia']
for t in country_list:
df.loc[df['company name'].contains(t),'country']=t
The result shows like
Company name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan
I thought that when the country_list contains large amount of elements, i,e, countries, it would be time-consuming via loop method. Is there any simpler method to tackle with my problem?
Here's one way using str.extract:
df['Country'] = df['Company name'].str.extract('('+'|'.join(country_list)+')')
Company name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan
You need series.str.extract() here:
pat = r'({})'.format('|'.join(country_list))
# pat-->'(China|America|Greece|Pakistan|Finland|Belgium|Japan|British|Australia)'
df['Country']=df['Company name'].str.extract(pat, expand=False)
Maybe using findall in case you have more than one country name in one cell
df["Company name"].str.findall('|'.join(country_list)).str[0]
Out[758]:
0 China
1 Australia
2 Belgium
3 Finland
4 Greece
5 Pakistan
Name: Company name, dtype: object
Using str.extract with Regex
Ex:
import pandas as pd
country_list = ['China','America','Greece','Pakistan','Finland','Belgium','Japan','British','Australia']
df = pd.read_csv(filename)
df["Country"] = df["Company_name"].str.extract("("+"|".join(country_list)+ ")")
print(df)
Output:
Company_name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan

String mode aggregation with group by function

I have dataframe which looks like below
Country City
UK London
USA Washington
UK London
UK Manchester
USA Washington
USA Chicago
I want to group country and aggregate on the most repeated city in a country
My desired output should be like
Country City
UK London
USA Washington
Because London and Washington appears 2 times whereas Manchester and Chicago appears only 1 time.
I tried
from scipy.stats import mode
df_summary = df.groupby('Country')['City'].\
apply(lambda x: mode(x)[0][0]).reset_index()
But it seems it won't work on strings
I can't replicate your error, but you can use pd.Series.mode, which accepts strings and returns a series, using iat to extract the first value:
res = df.groupby('Country')['City'].apply(lambda x: x.mode().iat[0]).reset_index()
print(res)
Country City
0 UK London
1 USA Washington
try like below:
>>> df.City.mode()
0 London
1 Washington
dtype: object
OR
import pandas as pd
from scipy import stats
Can use scipy with stats + lambda :
df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]})
City
Country
UK London
USA Washington
# df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]}).reset_index()
However, it gives nice count as well if you don't want to return ony First value:
>>> df.groupby('Country').agg({'City': lambda x:stats.mode(x)})
City
Country
UK ([London], [2])
USA ([Washington], [2])

Categories