Edition Reviews Ratings BookCategory Price Edition_year
165 Paperback,– Import, 5 Jul 1996 4.5 out of 5 stars 2 customer reviews Sports 270.00 1996
166 Hardcover,– 18 Aug 2009 4.5 out of 5 stars 2 customer reviews Language, Linguistics & Writing 61.00 2009
167 Paperback,– 26 Jul 2018 3.7 out of 5 stars 23 customer reviews Crime, Thriller & Mystery 184.00 2018
168 Paperback,– Import, 22 Mar 2018 4.2 out of 5 stars 50 customer reviews Romance 70.00 2018
169 Paperback,– Abridged, Import 5.0 out of 5 stars 2 customer reviews Action & Adventure 418.00 port
170 Paperback,– 10 Jan 2018 4.7 out of 5 stars 4 customer reviews Sports 395.00 2018
171 Paperback,– Apr 2011 4.0 out of 5 stars 197 customer reviews Language, Linguistics & Writing 179.00 2011
172 Paperback,– 17 Feb 2009 5.0 out of 5 stars 2 customer reviews Comics & Mangas 782.00 2009
173 Paperback,– 22 Aug 2000 3.5 out of 5 stars 4 customer reviews Language, Linguistics & Writing 475.44 2000
174 Paperback,– 5 Jan 2012 4.0 out of 5 stars 30 customer reviews Humour 403.00 2012
Suppose in these dataframe, under Edition_year column, i want to delete rows in which the values of edition year is NOT a numeric value. i.e. there are some values which are strings. I have tried .drop() method but cannot the output required.
This is what i tried:
df = df.drop(df[df['Edition_year'].apply(lambda x: str(x).isalpha())].index, inplace = True)
You can determinate the Edition_year that are numeric using
numeric_filter = df.Edition_year.astype(str).str.isnumeric()
and then use the filter to select only the desired rows
df = df.loc[numeric_filter]
Related
I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.
I have a table from different companies' sales.
company_name sales year
A 200 2019
A 100 2018
A 30 2017
B 15 2019
B 30 2018
B 45 2017
Now, I want to add a previous year's sales in the same row just like
company_name sales year previous_sales
A 200 2019 100
A 100 2018 30
A 30 2017 Nan
B 15 2019 30
B 30 2018 45
B 45 2017 Nan
I tried to use the code like this, but I failed to get the right result
df["previous_sales"] = df.groupby(['company_name', 'year'])['sales'].shift()
I am trying to do a web scraping of table and its content from an Apple wikipedia page. I am using Beautiful Soup to extract the data. I have the following code:
from bs4 import BeautifulSoup
appleurl="https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products"
import requests
import pandas as pad
import lxml.html as html
_content = requests.get(appleurl)
soup = BeautifulSoup(_content.content)
_table = soup.findChildren('table')
rows = _table[0].findChildren(['th','tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print ("The value in this cell is %s"% value)
I am having the following values:
The value in this cell is 1976
The value in this cell is April 11
The value in this cell is Apple I
The value in this cell is Apple I
The value in this cell is September 1, 1977
The value in this cell is 1977
The value in this cell is April 1
The value in this cell is Apple II
The value in this cell is Apple II
The value in this cell is June 1, 1979
The value in this cell is 1978
The value in this cell is June 1
The value in this cell is Disk II
The value in this cell is Drives
The value in this cell is May 1, 1984
The value in this cell is 1979
The value in this cell is June 1
The value in this cell is Apple II Plus
The value in this cell is Apple II series
The value in this cell is December 1, 1982
The value in this cell is None
The value in this cell is None
The value in this cell is None
The value in this cell is Bell & Howell Disk II
The value in this cell is None
The value in this cell is Apple SilenType
The value in this cell is Printers
The value in this cell is October 1, 1982
The problem is that for the year 1979 the number of models are multiple which is not being extracted in my case. I need all the models for the year 1979. The code I have can extracting perfectly fine if there is a single row for each year. What shall I do if there are multiple rows for a single year as in the first table of the link I provided.
The values I need is Year, Release Date, Model. The other two columns can be eliminated.
I will really appreciate the help.
Yo can simply use pandas to do that.use pad.read_html()
import pandas as pad
df=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')[0]
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
Output:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
Update for all tables.
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
for df in dfs:
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
If you would like to do it in single dataframe then use this code.
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
dffinal=pd.DataFrame()
for df in dfs:
df1=pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False)
dffinal = dffinal.append(df1, ignore_index=True)
print(dffinal)
Output:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
9 1980 September 1 Apple III
10 1980 September 1 Modem IIB (Novation CAT)
11 1980 September 1 Printer IIA (Centronics 779)
12 1980 September 1 Monitor III
13 1980 September 1 Monitor II (various third party)
14 1980 September 1 Disk III
15 1981 September 1 Apple ProFile
16 1981 December 1 Apple III Revised[1]
17 1982 October 1 Apple Dot Matrix Printer
18 1982 October 1 Apple Daisy Wheel Printer
19 1983 January 1 Apple IIe
20 1983 January 1 Apple Lisa[2]
21 1983 December 1 Apple III Plus
22 1983 December 1 Apple ImageWriter
23 1984 January 1 Apple Lisa 2
24 1984 January 24 Macintosh (128K)
25 1984 January 24 Macintosh External Disk Drive (400K)
26 1984 January 24 Apple Modem 300
27 1984 January 24 Apple Modem 1200
28 1984 April 1 Apple IIc
29 1984 April 1 Apple Scribe Printer
.. ... ... ...
606 2019 March 18 iPad Mini (5th gen)
607 2019 March 19 iMac with Retina 4K display (21.5") (Early 2019)
608 2019 March 19 iMac with Retina 5K display (27") (Early 2019)
609 2019 March 20 AirPods (2nd gen)
610 2019 May 21 MacBook Pro with Touch Bar (4th gen) (13") (Mi...
611 2019 May 21 MacBook Pro with Touch Bar (4th gen) (15") (Mi...
612 2019 May 28 iPod Touch (7th gen)
613 2019 July 9 MacBook Air (13") (2019)
614 2019 July 9 Macbook Pro with Touch Bar (4th gen) (13") (Mi...
615 2019 September 20 Apple Watch Series 5
616 2019 September 20 Apple Watch Hermès Series 5
617 2019 September 20 Apple Watch Nike Series 5
618 2019 September 20 Apple Watch Edition Series 5
619 2019 September 20 iPhone 8 (128 GB)
620 2019 September 20 iPhone 8 Plus (128 GB)
621 2019 September 20 iPhone 11
622 2019 September 20 iPhone 11 Pro
623 2019 September 20 iPhone 11 Pro Max
624 2019 September 25 iPad (2019)
625 2019 October 30 AirPods Pro
626 2019 November 13 MacBook Pro with Touch Bar (16")
627 2019 December 10 Mac Pro (Late 2019)
628 2019 December 10 Pro Display XDR
629 2020 March 18 NaN
630 2020 March 18 iPad Pro (11") (2nd gen)
631 2020 March 18 iPad Pro (12.9") (4th gen)
632 2020 March 18 Magic Keyboard for iPad Pro
633 2020 March 18 MacBook Air (Early 2020)
634 2020 April 24 iPhone SE (2nd gen)
635 2020 May 4 MacBook Pro with Magic Keyboard (Mid 2020)
[636 rows x 3 columns]
I have a pandas dataframe like the following:
Customer Id year
0 1510220024 2017
1 1510270013 2017
2 1511160047 2017
3 1512100014 2017
4 1603180006 2017
5 1605030030 2017
6 1605160013 2017
7 1606060008 2017
8 1510220024 2018
9 1606270014 2017
10 1608080011 2017
11 1608090002 2017
12 1511160047 2018
13 1606270014 2018
And I want to build the following matrix from the above dataframe:
2017 2018
2017 11 3
2018 3 3
This matrix tells that there were total 11 customers in year 2017 and three of them also appeared in 2018 and so on. In actual, I have 7 years of data so it would be 7x7 matrix. I am struggling for a while now but can't get this right.
merge + crosstab:
m = df.merge(df, left_on='Customer Id', right_on='Customer Id')
pd.crosstab(m.year_x, m.year_y)
year_y 2017 2018
year_x
2017 11 3
2018 3 3
I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0