Creating a column identifier based on another column - python

I have a df below as
NAME
German Rural
1990 german
Mexican 1998
Mexican City
How can i create a new column based on the values of these columns ( if the column has the term %German% or % german% regardless of capital or lower case or case insensitive?
Desired output
NAME | Identifier
German Rural Euro
1990 german Euro
Mexican 1998 South American
Mexican City South American

You could do that with something like the following.
conditions = [df["NAME"].str.lower().str.contains("german"),
df["NAME"].str.lower().str.contains("mexican")]
values = [ "Euro", 'South American']
df["identifiter"] = np.select(conditions, values, default=np.nan)
print(df)
NAME identifiter
0 German Rural Euro
1 1990 german Euro
2 Mexican 1998 South American
3 Mexican City South American

Related

Columns must be same length as key Error python

I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.
"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"
cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000
file1.csv
research_groups_names_f1
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
University of California, Los Angeles
Harvard University
file2.csv
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
USA
The University of Queensland (UQ)
USA
University of California
USA
file_output.csv
research_groups_names_f1
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences
Chinese Academy of Sciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
USA
UQ
The University of Queensland (UQ)
Australia
Harvard University
Not found
USA
University of California, Los Angeles
University of California
USA
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names_f1"] == row["research_groups_names_f2"]
or row["research_groups_names_f1"] in n["research_groups_names_f2"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.
The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.
The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.
import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})
df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
'University of Michigan (U-M)',
'The University of Queensland (UQ)',
'University of California'],
'Locatio_f2': ['China', 'USA', 'USA', 'USA']})
df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)
df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)
df.merge(df2, on='key', how='left').drop(columns='key')
Output
research_groups_names_f1 research_groups_names_f2 Locatio_f2
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) USA
3 UQ The University of Queensland (UQ) USA
4 University of California, Los Angeles NaN NaN
5 Harvard University NaN NaN

how to find the full name of athlete in this case?

Let's say this is my data frame:
country Edition sports Athletes Medal Firstname Score
Germany 1990 Aquatics HAJOS, Alfred gold Alfred 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver Otto 2
Germany 1990 Aquatics DRIVAS, Dimitrios silver Dimitrios 2
US 2008 Athletics MALOKINIS, Ioannis gold Ioannis 1
US 2008 Athletics HAJOS, Alfred silver Alfred 2
US 2009 Athletics CHASAPIS, Spiridon gold Spiridon 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 golf HAJOS, Alfred Bronze Alfred 1
France 2011 golf ANDREOU, Joannis silver Joannis 2
Spain 2011 golf BURKE, Thomas gold Thomas 3
I am trying to find out which Athlete's first name has the largest sum of scores?
I have tried the following:
df.groupby ( 'Firstname' )[Score ].sum().idxmax()
This returns the first name of the Athlete but I want to display the full name of Athlete can anyone help me in this?
for example : I am getting 'Otto' as output but i want to display HIRSCHMANN, Otto as output!
Note: what I have noticed in my original data set when I groupby ( 'Athlete') the answer is different.
idxmax will only give you the index of the first row with maximal value. If multiple Firstname share the max score, it will find to find them.
Try this instead:
sum_score = df.groupby ('Firstname')['Score'].sum()
max_score = sum_score.max()
names = sum_score[sum_score == max_score].index
df[df['Firstname'].isin(names)]

Reading excel file with line breaks and tabs preserved using xlrd

I am trying to read excel file cells having multi line text in it. I am using xlrd 1.2.0. But when I print or even write the text in cell to .txt file it doesn't preserve line breaks or tabs i.e \n or \t.
Input:
File URL:
Excel file
Code:
import xlrd
filenamedotxlsx = '16.xlsx'
gall_artists = xlrd.open_workbook(filenamedotxlsx)
sheet = gall_artists.sheet_by_index(0)
bio = sheet.cell_value(0,1)
print(bio)
Output:
"Biography 2018-2019 Manoeuvre Textiles Atelier, Gent, Belgium 2017-2018 Thalielab, Brussels, Belgium 2017 Laboratoires d'Aubervilliers, Paris 2014-2015 Galveston Artist Residency (GAR), Texas 2014 MACBA, Barcelona & L'appartment 22, Morocco - Residency 2013 International Residence Recollets, Paris 2007 Gulbenkian & RSA Residency, BBC Natural History Dept, UK 2004-2006 Delfina Studios, UK Studio Award, London 1998-2000 De Ateliers, Post-grad Residency, Amsterdam 1995-1998 BA (Hons) Textile Art, Winchester School of Art UK "
Expected Output:
1975 Born in Hangzhou, Zhejiang, China
1980 Started to learn Chinese ink painting
2000 BA, Major in Oil Painting, China Academy of Art, Hangzhou, China
Curator, Hangzhou group exhibition for 6 female artists Untitled, 2000 Present
2007 MA, New Media, China Academy of Art, Hangzhou, China, studied under Jiao Jian
Lecturer, Department of Art, Zhejiang University, Hangzhou, China
2015 PhD, Calligraphy, China Academy of Art, Hangzhou, China, studied under Wang Dongling
Jury, 25th National Photographic Art Exhibition, China Millennium Monument, Beijing, China
2016 Guest professor, Faculty of Humanities, Zhejiang University, Hangzhou, China
Associate professor, Research Centre of Modern Calligraphy, China Academy of Art, Hangzhou, China
Researcher, Lanting Calligraphy Commune, Zhejiang, China
2017 Christie's produced a video about Chu Chu's art
2018 Featured by Poetry Calligraphy Painting Quarterly No.2, Beijing, China
Present Vice Secretary, Lanting Calligraphy Society, Hangzhou, China
Vice President, Zhejiang Female Calligraphers Association, Hangzhou, China
I have also used repr() to see if there are \n characters or not, but there aren't any.

Group a dataframe by a column and concactenate strings in another

I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

Categories