Extract year from column with string of movie names

Extract year from column with string of movie names - python

I have the following data, having two columns, "title name" and "gross" in table called train_df:
gross title name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)
I would like to remove the date from "title name". Output should look as follows:
gross title name
760507625.0 Avatar
658672302.0 Titanic
652270625.0 Jurassic World
623357910.0 The Avengers
534858444.0 The Dark Knight
Ignore the gross column as it needs no changing.

Using str.replace we can try:
train_df["title name"] = train_df["title name"].str.replace(r'\s+\(\d{4}\)$', '', regex=True)

Another solution, without re and only using .str.rsplit():
df['title name'] = df['title name'].str.rsplit(' (', n=1).str[0]
print(df)
Prints:
gross title name
0 760507625.0 Avatar
1 658672302.0 Titanic
2 652270625.0 Jurassic World
3 623357910.0 The Avengers
4 534858444.0 The Dark Knight
5 532177324.0 Rogue One
6 474544677.0 Star Wars: Episode I - The Phantom Menace
7 459005868.0 Avengers: Age of Ultron
8 448139099.0 The Dark Knight Rises
9 436471036.0 Shrek 2
10 424668047.0 The Hunger Games: Catching Fire
11 423315812.0 Pirates of the Caribbean: Dead Man's Chest
12 415004880.0 Toy Story 3
13 409013994.0 Iron Man 3
14 408084349.0 Captain America: Civil War
15 408010692.0 The Hunger Games
16 403706375.0 Spider-Man
17 402453882.0 Jurassic Park
18 402111870.0 Transformers: Revenge of the Fallen
19 400738009.0 Frozen
20 381011219.0 Harry Potter and the Deathly Hallows: Part 2
21 380843261.0 Finding Nemo
22 380262555.0 Star Wars: Episode III - Revenge of the Sith
23 373585825.0 Spider-Man 2
24 370782930.0 The Passion of the Christ

Related

re.IGNORCASE flag not working with .str.extract

I have the dataframe below and have created a column to catagorise based on specific text within a string.
However when I pass re.IGNORECASE flag it is still case sensetive?
Dataframe
test_data = {
"first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
"last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
"title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
"text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
"age": [32, 33, 28, 30, 42, 80]
}
df = pd.DataFrame(test_data)
code
category_dict = {
"Kung Fu":"Martial Art",
"capes":"Clothing",
"cocktails": "Drink",
"green": "Colour",
"scottish": "Scotland",
"East": "Direction"
}
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join(category_dict.keys())})\b",
flags=re.IGNORECASE)[0].map(category_dict))
Expected output
first_name last_name title text age category
0 Bruce Lee Mr He is a Kung Fu master 32 Martial Art
1 Clark Kent Mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner Mr Cocktails shaken not stirred 28 Drink
3 James Bond Mr angry Green man 30 Colour
4 Nanny Mc Phee Mrs suspect scottish accent 42 Scotland
5 Dot Cotton Mrs East end legend 80 Direction
I have searched the docs and have found no pointers, so any help would be appreciated!

here is one way to do it
the issue you're facing being that while the extract ignores the case, the extracted string mapping to dictionary is still case sensitive.
#create a dictionary with lower case keys
cd= {k.lower(): v for k,v in category_dict.items()}
# alternately, you can convert the category_dict keys to lower case
# I duplicated the dictionary, in case you need to keep the original keys
# convert the extracted word to lowercase and then map with the lowercase dict
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join((category_dict.keys()))})\b",
flags=re.IGNORECASE)[0].str.lower().map(cd))
df
first_name last_name title text age category
0 Bruce Lee mr He is a Kung Fu master 32 Martial Art
1 Clark Kent mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner mr Cocktails shaken not stirred 28 Drink
3 James Bond mr angry Green man 30 Colour
4 Nanny Mc Phee mrs suspect scottish accent 42 Scotland
5 Dot Cotton mrs East end legend 80 Direction

Pandas filtering to get names of coaches who is coach for both men and women's team

I have a dataframe like this -
Name Country Discipline Event
5 AIKMAN Siegfried Gottlieb Japan Hockey Men
6 AL SAADI Kais Germany Hockey Men
8 ALEKNO Vladimir Islamic Republic of Iran Volleyball Men
9 ALEKSEEV Alexey ROC Handball Women
11 ALSHEHRI Saad Saudi Arabia Football Men
.
.
.
I want to get the Names (Name of coaches) who is coach for both Men and Women team of a particular game(Discipline)
Please help me with this

You can use groupby and check for groups that have Event count >= 2:
filtered = df.groupby(['Discipline', 'Name']).filter(lambda x: x['Event'].count() >= 2)
If you want a list of unique names, then simply:
>>> filtered.Name.unique()

When can `re.finditer` not return anything but string.index can?

Simply,
In [9]: [m.start() for m in re.finditer(answer_text, context)]
Out[9]: []
In [10]: context.index(answer_text)
Out[10]: 384
As you can see, re.finditer does not return a match, but the index method does. Is this expected?
In [18]: context
Out[18]: 'Fight for My Way (; lit. "Third-Rate My Way") is a South Korean television series starring Park Seo-joon and Kim Ji-won, with Ahn Jae-hong and Song Ha-yoon. It premiered on May 22, 2017 every Monday and Tuesday at 22:00 (KST) on KBS2. Kim Ji-won (Hangul: 김지원 ; Hanja: 金智媛 ; born October 19, 1992) is a South Korean actress. She gained attention through her roles in television series "The Heirs" (2013), "Descendants of the Sun" (2016) and "Fight for My Way" (2017). Yellow Hair 2 () is a 2001 South Korean film, written, produced, and directed by Kim Yu-min. It is the sequel to Kim\'s 1999 film "Yellow Hair", though it does not continue the same story or feature any of the same characters. The original film gained attention when it was refused a rating due to its sexual content, requiring some footage to be cut before it was allowed a public release. "Yellow Hair 2" attracted no less attention from the casting of transsexual actress Harisu in her first major film role. Ko Joo-yeon (born February 22, 1994) is a South Korean actress who has gained attention in the Korean film industry for her roles in "Blue Swallow" (2005) and "The Fox Family" (2006). In 2007 she appeared in the horror film "Epitaph" as Asako, a young girl suffering from overbearing nightmares and aphasia, becoming so immersed in the role that she had to deal with sudden nosebleeds while on set. Kyu Hyun Kim of "Koreanfilm.org" highlighted her performance in the film, saying, "[The cast\'s] acting thunder is stolen by the ridiculously pretty Ko Joo-yeon, another Korean child actress who we dearly hope continues her film career." Kim Ji-won (Hangul:\xa0김지원 ; born December 21, 1995), better known by his stage name Bobby (Hangul:\xa0바비 ) is a Korean-American rapper and singer. He is known as a member of the popular South Korean boy group iKON, signed under YG Entertainment. Descendants of the Sun () is a 2016 South Korean television series starring Song Joong-ki, Song Hye-kyo, Jin Goo, and Kim Ji-won. It aired on KBS2 from February 24 to April 14, 2016, on Wednesdays and Thursdays at 22:00 for 16 episodes. KBS then aired three additional special episodes from April 20 to April 22, 2016 containing highlights and the best scenes from the series, the drama\'s production process, behind-the-scenes footage, commentaries from cast members and the final epilogue. What\'s Up () is a 2011 South Korean television series starring Lim Ju-hwan, Daesung, Lim Ju-eun, Oh Man-seok, Jang Hee-jin, Lee Soo-hyuk, Kim Ji-won and Jo Jung-suk. It aired on MBN on Saturdays to Sundays at 23:00 for 20 episodes beginning December 3, 2011. The 2016 KBS Drama Awards (), presented by Korean Broadcasting System (KBS), was held on December 31, 2016 at KBS Hall in Yeouido, Seoul. It was hosted by Jun Hyun-moo, Park Bo-gum and Kim Ji-won. Gap-dong () is a 2014 South Korean television series starring Yoon Sang-hyun, Sung Dong-il, Kim Min-jung, Kim Ji-won and Lee Joon. It aired on cable channel tvN from April 11 to June 14, 2014 on Fridays and Saturdays at 20:40 for 20 episodes. Kim Ji-won (Hangul: 김지원; born 26 February 1995) is a South Korean female badminton player. In 2013, Kim and her national teammates won the Suhadinata Cup after beat Indonesian junior team in the final round of the mixed team event. She also won the girls\' doubles title partnered with Chae Yoo-jung.'
In [19]: answer_text
Out[19]: '"The Heirs" (2013)'

how to find the full name of athlete in this case?

Let's say this is my data frame:
country Edition sports Athletes Medal Firstname Score
Germany 1990 Aquatics HAJOS, Alfred gold Alfred 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver Otto 2
Germany 1990 Aquatics DRIVAS, Dimitrios silver Dimitrios 2
US 2008 Athletics MALOKINIS, Ioannis gold Ioannis 1
US 2008 Athletics HAJOS, Alfred silver Alfred 2
US 2009 Athletics CHASAPIS, Spiridon gold Spiridon 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 golf HAJOS, Alfred Bronze Alfred 1
France 2011 golf ANDREOU, Joannis silver Joannis 2
Spain 2011 golf BURKE, Thomas gold Thomas 3
I am trying to find out which Athlete's first name has the largest sum of scores?
I have tried the following:
df.groupby ( 'Firstname' )[Score ].sum().idxmax()
This returns the first name of the Athlete but I want to display the full name of Athlete can anyone help me in this?
for example : I am getting 'Otto' as output but i want to display HIRSCHMANN, Otto as output!
Note: what I have noticed in my original data set when I groupby ( 'Athlete') the answer is different.

idxmax will only give you the index of the first row with maximal value. If multiple Firstname share the max score, it will find to find them.
Try this instead:
sum_score = df.groupby ('Firstname')['Score'].sum()
max_score = sum_score.max()
names = sum_score[sum_score == max_score].index
df[df['Firstname'].isin(names)]

Reading excel file with line breaks and tabs preserved using xlrd

I am trying to read excel file cells having multi line text in it. I am using xlrd 1.2.0. But when I print or even write the text in cell to .txt file it doesn't preserve line breaks or tabs i.e \n or \t.
Input:
File URL:
Excel file
Code:
import xlrd
filenamedotxlsx = '16.xlsx'
gall_artists = xlrd.open_workbook(filenamedotxlsx)
sheet = gall_artists.sheet_by_index(0)
bio = sheet.cell_value(0,1)
print(bio)
Output:
"Biography 2018-2019 Manoeuvre Textiles Atelier, Gent, Belgium 2017-2018 Thalielab, Brussels, Belgium 2017 Laboratoires d'Aubervilliers, Paris 2014-2015 Galveston Artist Residency (GAR), Texas 2014 MACBA, Barcelona & L'appartment 22, Morocco - Residency 2013 International Residence Recollets, Paris 2007 Gulbenkian & RSA Residency, BBC Natural History Dept, UK 2004-2006 Delfina Studios, UK Studio Award, London 1998-2000 De Ateliers, Post-grad Residency, Amsterdam 1995-1998 BA (Hons) Textile Art, Winchester School of Art UK "
Expected Output:
1975 Born in Hangzhou, Zhejiang, China
1980 Started to learn Chinese ink painting
2000 BA, Major in Oil Painting, China Academy of Art, Hangzhou, China
Curator, Hangzhou group exhibition for 6 female artists Untitled, 2000 Present
2007 MA, New Media, China Academy of Art, Hangzhou, China, studied under Jiao Jian
Lecturer, Department of Art, Zhejiang University, Hangzhou, China
2015 PhD, Calligraphy, China Academy of Art, Hangzhou, China, studied under Wang Dongling
Jury, 25th National Photographic Art Exhibition, China Millennium Monument, Beijing, China
2016 Guest professor, Faculty of Humanities, Zhejiang University, Hangzhou, China
Associate professor, Research Centre of Modern Calligraphy, China Academy of Art, Hangzhou, China
Researcher, Lanting Calligraphy Commune, Zhejiang, China
2017 Christie's produced a video about Chu Chu's art
2018 Featured by Poetry Calligraphy Painting Quarterly No.2, Beijing, China
Present Vice Secretary, Lanting Calligraphy Society, Hangzhou, China
Vice President, Zhejiang Female Calligraphers Association, Hangzhou, China
I have also used repr() to see if there are \n characters or not, but there aren't any.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract year from column with string of movie names - python

Using str.replace we can try: train_df["title name"] = train_df["title name"].str.replace(r'\s+\(\d{4}\)$', '', regex=True)

Related

re.IGNORCASE flag not working with .str.extract

Pandas filtering to get names of coaches who is coach for both men and women's team

When can `re.finditer` not return anything but string.index can?

how to find the full name of athlete in this case?

Reading excel file with line breaks and tabs preserved using xlrd

Categories

Resources