Adding column of values to pandas DataFrame - python

I'm doing a simple sentiment analysis and am stuck on something that I feel is very simple. I'm trying to add an new column with a set of values, in this example compound values. But after the for loop iterates it adds the same value for all the rows rather than a value for each iteration. The compound values are the last column in the DataFrame. There should be a quick fix. thanks!
for i, row in real.iterrows():
real['compound'] = sid.polarity_scores(real['title'][i])['compound']
title text subject date compound
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017 0.2263
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017 0.2263
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017 0.2263
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017 0.2263
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews December 29, 2017 0.2263

IIUC:
real['compound'] = real.apply(lambda row: sid.polarity_scores(row['title'])['compound'], axis=1)

Related

Creating a column identifier based on another column

I have a df below as
NAME
German Rural
1990 german
Mexican 1998
Mexican City
How can i create a new column based on the values of these columns ( if the column has the term %German% or % german% regardless of capital or lower case or case insensitive?
Desired output
NAME | Identifier
German Rural Euro
1990 german Euro
Mexican 1998 South American
Mexican City South American
You could do that with something like the following.
conditions = [df["NAME"].str.lower().str.contains("german"),
df["NAME"].str.lower().str.contains("mexican")]
values = [ "Euro", 'South American']
df["identifiter"] = np.select(conditions, values, default=np.nan)
print(df)
NAME identifiter
0 German Rural Euro
1 1990 german Euro
2 Mexican 1998 South American
3 Mexican City South American

When can `re.finditer` not return anything but string.index can?

Simply,
In [9]: [m.start() for m in re.finditer(answer_text, context)]
Out[9]: []
In [10]: context.index(answer_text)
Out[10]: 384
As you can see, re.finditer does not return a match, but the index method does. Is this expected?
In [18]: context
Out[18]: 'Fight for My Way (; lit. "Third-Rate My Way") is a South Korean television series starring Park Seo-joon and Kim Ji-won, with Ahn Jae-hong and Song Ha-yoon. It premiered on May 22, 2017 every Monday and Tuesday at 22:00 (KST) on KBS2. Kim Ji-won (Hangul: 김지원 ; Hanja: 金智媛 ; born October 19, 1992) is a South Korean actress. She gained attention through her roles in television series "The Heirs" (2013), "Descendants of the Sun" (2016) and "Fight for My Way" (2017). Yellow Hair 2 () is a 2001 South Korean film, written, produced, and directed by Kim Yu-min. It is the sequel to Kim\'s 1999 film "Yellow Hair", though it does not continue the same story or feature any of the same characters. The original film gained attention when it was refused a rating due to its sexual content, requiring some footage to be cut before it was allowed a public release. "Yellow Hair 2" attracted no less attention from the casting of transsexual actress Harisu in her first major film role. Ko Joo-yeon (born February 22, 1994) is a South Korean actress who has gained attention in the Korean film industry for her roles in "Blue Swallow" (2005) and "The Fox Family" (2006). In 2007 she appeared in the horror film "Epitaph" as Asako, a young girl suffering from overbearing nightmares and aphasia, becoming so immersed in the role that she had to deal with sudden nosebleeds while on set. Kyu Hyun Kim of "Koreanfilm.org" highlighted her performance in the film, saying, "[The cast\'s] acting thunder is stolen by the ridiculously pretty Ko Joo-yeon, another Korean child actress who we dearly hope continues her film career." Kim Ji-won (Hangul:\xa0김지원 ; born December 21, 1995), better known by his stage name Bobby (Hangul:\xa0바비 ) is a Korean-American rapper and singer. He is known as a member of the popular South Korean boy group iKON, signed under YG Entertainment. Descendants of the Sun () is a 2016 South Korean television series starring Song Joong-ki, Song Hye-kyo, Jin Goo, and Kim Ji-won. It aired on KBS2 from February 24 to April 14, 2016, on Wednesdays and Thursdays at 22:00 for 16 episodes. KBS then aired three additional special episodes from April 20 to April 22, 2016 containing highlights and the best scenes from the series, the drama\'s production process, behind-the-scenes footage, commentaries from cast members and the final epilogue. What\'s Up () is a 2011 South Korean television series starring Lim Ju-hwan, Daesung, Lim Ju-eun, Oh Man-seok, Jang Hee-jin, Lee Soo-hyuk, Kim Ji-won and Jo Jung-suk. It aired on MBN on Saturdays to Sundays at 23:00 for 20 episodes beginning December 3, 2011. The 2016 KBS Drama Awards (), presented by Korean Broadcasting System (KBS), was held on December 31, 2016 at KBS Hall in Yeouido, Seoul. It was hosted by Jun Hyun-moo, Park Bo-gum and Kim Ji-won. Gap-dong () is a 2014 South Korean television series starring Yoon Sang-hyun, Sung Dong-il, Kim Min-jung, Kim Ji-won and Lee Joon. It aired on cable channel tvN from April 11 to June 14, 2014 on Fridays and Saturdays at 20:40 for 20 episodes. Kim Ji-won (Hangul: 김지원; born 26 February 1995) is a South Korean female badminton player. In 2013, Kim and her national teammates won the Suhadinata Cup after beat Indonesian junior team in the final round of the mixed team event. She also won the girls\' doubles title partnered with Chae Yoo-jung.'
In [19]: answer_text
Out[19]: '"The Heirs" (2013)'

Reading excel file with line breaks and tabs preserved using xlrd

I am trying to read excel file cells having multi line text in it. I am using xlrd 1.2.0. But when I print or even write the text in cell to .txt file it doesn't preserve line breaks or tabs i.e \n or \t.
Input:
File URL:
Excel file
Code:
import xlrd
filenamedotxlsx = '16.xlsx'
gall_artists = xlrd.open_workbook(filenamedotxlsx)
sheet = gall_artists.sheet_by_index(0)
bio = sheet.cell_value(0,1)
print(bio)
Output:
"Biography 2018-2019 Manoeuvre Textiles Atelier, Gent, Belgium 2017-2018 Thalielab, Brussels, Belgium 2017 Laboratoires d'Aubervilliers, Paris 2014-2015 Galveston Artist Residency (GAR), Texas 2014 MACBA, Barcelona & L'appartment 22, Morocco - Residency 2013 International Residence Recollets, Paris 2007 Gulbenkian & RSA Residency, BBC Natural History Dept, UK 2004-2006 Delfina Studios, UK Studio Award, London 1998-2000 De Ateliers, Post-grad Residency, Amsterdam 1995-1998 BA (Hons) Textile Art, Winchester School of Art UK "
Expected Output:
1975 Born in Hangzhou, Zhejiang, China
1980 Started to learn Chinese ink painting
2000 BA, Major in Oil Painting, China Academy of Art, Hangzhou, China
Curator, Hangzhou group exhibition for 6 female artists Untitled, 2000 Present
2007 MA, New Media, China Academy of Art, Hangzhou, China, studied under Jiao Jian
Lecturer, Department of Art, Zhejiang University, Hangzhou, China
2015 PhD, Calligraphy, China Academy of Art, Hangzhou, China, studied under Wang Dongling
Jury, 25th National Photographic Art Exhibition, China Millennium Monument, Beijing, China
2016 Guest professor, Faculty of Humanities, Zhejiang University, Hangzhou, China
Associate professor, Research Centre of Modern Calligraphy, China Academy of Art, Hangzhou, China
Researcher, Lanting Calligraphy Commune, Zhejiang, China
2017 Christie's produced a video about Chu Chu's art
2018 Featured by Poetry Calligraphy Painting Quarterly No.2, Beijing, China
Present Vice Secretary, Lanting Calligraphy Society, Hangzhou, China
Vice President, Zhejiang Female Calligraphers Association, Hangzhou, China
I have also used repr() to see if there are \n characters or not, but there aren't any.

Dataframes' subtraction and assignment gives back NAs

Let's suppose that I have a dataset (df_data) such as the following:
Time Geography Population
2016 England and Wales 58381200
2017 England and Wales 58744600
2016 Northern Ireland 1862100
2017 Northern Ireland 1870800
2016 Scotland 5404700
2017 Scotland 5424800
2016 Wales 3113200
2017 Wales 3125200
If I do the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland']
df_wales = df_data[df_data['Geography']=='Wales']
df_scotland = df_data[df_data['Geography']=='Scotland']
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales']
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population']
then the df_england has NA values at the column Population.
How can I fix this?
By the way, I have read relevant posts but exactly worked for me (.loc, .copy etc).
This is really an organization problem. If you pivot then you can do the subtractions easily, and ensure alignment on Time
df_pop = df.pivot(index='Time', columns='Geography', values='Population')
df_pop['England'] = df_pop['England and Wales'] - df_pop['Wales']
Output df_pop:
Geography England and Wales Northern Ireland Scotland Wales England
Time
2016 58381200 1862100 5404700 3113200 55268000
2017 58744600 1870800 5424800 3125200 55619400
If you need to get back to your original format, then you can do:
df_pop.stack().to_frame('Population').reset_index()
# Time Geography Population
#0 2016 England and Wales 58381200
#1 2016 Northern Ireland 1862100
#2 2016 Scotland 5404700
#3 2016 Wales 3113200
#4 2016 England 55268000
#5 2017 England and Wales 58744600
#6 2017 Northern Ireland 1870800
#7 2017 Scotland 5424800
#8 2017 Wales 3125200
#9 2017 England 55619400
I had simply to do the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland'].reset_index(drop=True)
df_wales = df_data[df_data['Geography']=='Wales'].reset_index(drop=True)
df_scotland = df_data[df_data['Geography']=='Scotland'].reset_index(drop=True)
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales'].reset_index(drop=True)
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population']
or better way in principle since you are retaining the indices of the initial dataframe is the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland']
df_wales = df_data[df_data['Geography']=='Wales']
df_scotland = df_data[df_data['Geography']=='Scotland']
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales']
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population'].values

(bs4) trying to differentiate different containers in a HTML page

I have a web page from the Houses of Parlament. it has information on MP declared interests and I would like to store all MP interests for a project that I am thinking of.
root = 'https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm'
root is an example webpage. I want my output to be a dictionary, as there are interests under different sub headings and the entry could be a list.
Problem: if you look at the page, the first interest, (employment and earnings) is not wrapped up in a container, but rather the heading is a tag, and not connected to the text underneath it so I could call soup.find_all('p', {xlms='<p, {'xmlns':'http://www.w3.org/1999/xhtml')
but it would return the headings of expenses, and a few other headings like her name, and not the text under it.
which makes it difficult to iterate through the headings and storing the information
What would be the best way of iterating through the page, storing each heading, and the information under each heading?
Something like this may work:
import urllib.request
from bs4 import BeautifulSoup
ret = {}
page = urllib.request.urlopen("https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm")
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'lxml')
valid = False
value = ""
for i in soup.findAll('p'):
if i.find('strong') and i.text is not None:
# ignore first pass
if valid:
ret[key] = value
value = ""
valid = True
key = i.text
elif i.text is not None:
value = value + " " + i.text
# get last entry
if key is not None:
ret[key] = value
for x in ret:
print (x)
print (ret[x])
Outputs
4. Visits outside the UK
Name of donor: (1) Stop Aids (2) Aids Alliance Address of donor: (1) Grayston Centre, 28 Charles St, London N1 6HT (2) Preece House, 91-101 Davigdor Rd, Hove BN3 1RE Amount of donation (or estimate of the probable value): for myself and a member of staff, flights £2,784, accommodation £380.52, other travel costs £172, per diems £183; total £3,519.52. These costs were divided equally between both donors. Destination of visit: Uganda Date of visit: 11-14 November 2015 Purpose of visit: to visit the different organisations and charities (development) in regards to AIDS and HIV. (Registered 09 December 2015)Name of donor: Muslim Charities Forum Address of donor: 6 Whitehorse Mews, 37 Westminster Bridge Road, London SE1 7QD Amount of donation (or estimate of the probable value): for a member of staff and myself, return flights to Nairobi £5,170; one night's accommodation in Hargeisa £107.57; one night's accommodation in Borama £36.21; total £5,313.78 Destination of visit: Somaliland Date of visit: 7-10 April 2016 Purpose of visit: to visit the different refugee camps and charities (development) in regards to the severe drought in Somaliland. (Registered 18 May 2016)Name of donor: British-Swiss Chamber of Commerce Address of donor: Bleicherweg, 128002, Zurich, Switzerland Amount of donation (or estimate of the probable value): flights £200.14; one night's accommodation £177, train fare Geneva to Zurich £110; total £487.14 Destination of visit: Geneva and Zurich, Switzerland Date of visit: 28-29 April 2016 Purpose of visit: to participate in a public panel discussion in Geneva in front of British-Swiss Chamber of Commerce, its members and guests. (Registered 18 May 2016) 
2. (b) Any other support not included in Category 2(a)
Name of donor: Ann Pettifor Address of donor: private Amount of donation or nature and value if donation in kind: £1,651.07 towards rent of an office for my mayoral campaign Date received: 28 August 2015 Date accepted: 30 September 2015 Donor status: individual (Registered 08 October 2015)
1. Employment and earnings
Fees received for co-presenting BBC’s ‘This Week’ TV programme. Address: BBC Broadcasting House, Portland Place, London W1A 1AA. (Registered 04 November 2013)14 May 2015, received £700. Hours: 3 hrs. (Registered 03 June 2015)4 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)18 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)16 July 2015, received £700. Hours: 3 hrs. (Registered 07 August 2015)8 January 2016, received £700 for an appearance on 17 December 2015. Hours: 3 hrs. (Registered 14 January 2016)28 July 2015, received £4,000 for taking part in Grant Thornton’s panel at the JLA/FD Intelligence Post-election event. Address: JLA, 14 Berners Street, London W1T 3LJ. Hours: 5 hrs. (Registered 07 August 2015)23rd October 2015, received £1,500 for co-presenting BBC’s "Have I Got News for You" TV programme. Address: Hat Trick Productions, 33 Oval Road Camden, London NW1 7EA. Hours: 5 hrs. (Registered 26 October 2015)10 October 2015, received £1,400 for taking part in a talk at the New Wolsey Theatre in Ipswich. Address: Clive Conway Productions, 32 Grove St, Oxford OX2 7JT. Hours: 5 hrs. (Registered 26 October 2015)21 March 2016, received £4,000 via Speakers Corner (London) Ltd, Unit 31, Highbury Studios, 10 Hornsey Street, London N7 8EL, from Thompson Reuters, Canary Wharf, London E14 5EP, for speaking and consulting on a panel. Hours: 10 hrs. (Registered 06 April 2016)
Abbott, Ms Diane (Hackney North and Stoke Newington)
House of Commons
Session 2016-17
Publications on the internet

Categories