replicate entire dataframe 'x' times in Python

replicate entire dataframe 'x' times in Python - python

I've a sample dataframe
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
How can I replicate the above dataframe without changing the order?
Expected outcome:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

How about:
pd.concat([df]*3, ignore_index=True)
Output:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

You can use pd.concat:
result=pd.concat([df]*x).reset_index(drop=True)
print(result)
Output (for x=3):
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

Related

Modify duplicated rows in dataframe (Python)

I am working with a dataframe in Pandas and I need a solution to automatically modify one of the columns that has duplicate values. It is a column type 'object' and I would need to modify the name of the duplicate values. The dataframe is the following:
City Year Restaurants
0 New York 2001 20
1 Paris 2000 40
2 New York 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 33
6 Barcelona 2001 15
As you can see, New York is repeated 3 times. I would like to create a new dataframe in which this value would be automatically modified and the result would be the following:
City Year Restaurants
0 New York 2001 2001 20
1 Paris 2000 40
2 New York 1999 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 1998 33
6 Barcelona 2001 15
I would also be happy with "New York 1", "New York 2" and "New York 3". Any option would be good.

Use np.where, to modify column City if duplicated
df['City']=np.where(df['City'].duplicated(keep=False), df['City']+' '+df['Year'].astype(str),df['City'])

A different approach without the use of numpy would be with groupby.cumcount() which will give you your alternative New York 1, New York 2 but for all values.
df['City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 1 2000 40
2 New York 2 1999 41
3 Los Angeles 1 2004 35
4 Madrid 1 2001 22
5 New York 3 1998 33
6 Barcelona 1 2001 15
To have an increment only in the duplicate cases you can use loc:
df.loc[df[df.City.duplicated(keep=False)].index, 'City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 2000 40
2 New York 2 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 3 1998 33
6 Barcelona 2001 15

Not getting all information in web scraping from website

I am taking the details of property from a website but it hardly gives the information of 20 property while there are 100. There is no timeout
INPUT:
import requests
import pandas
from bs4 import BeautifulSoup
r=requests.get('https://www.century21.com/real-estate/new-york-ny/LCNYNEWYORK/')
c=r.content
soup=BeautifulSoup(c,'html.parser')
all=soup.find_all("div",{'class':'property-card-primary-info'})
#print(soup.prettify())
#len(all)
l=[]
for item in all:
d={}
d['price']=item.find('a',{'class':'listing-price'}).text.replace('\n','').replace(' ','')
add=item.find('div',{'class':'property-address-info'})
try:
d['address']=add.text.replace('\n',' ').replace(' ','')
except:
d['address']="None"
try:
d['beds']=item.find('div',{'class':'property-beds'}).find('strong').text.replace('\n','')
except:
d['beds']='None'
try:
d['baths']=item.find('div',{'class':'property-baths'}).find('strong').text.replace('\n','')
except:
d['baths']='None'
try:
d['area']=item.find('div',{'class':'property-sqft'}).find('strong').text
except:
d['area']='None'
l.append(d)
df=pandas.DataFrame(l)
print(df)
OUTPUT:
price address beds baths area
0 $1,680,000 161 West 61st Street 3-F New York NY 10023 2 2 None
1 $1,225,000 350 East 82nd Street 2-J New York NY 10028 2 2 None
2 $2,550,000 845 United Nations Plaza 39-E New York NY 10017 2 2 None
3 $1,850,000 57 Reade Street 17-C New York NY 10007 1 1 None
4 $828,000 80 Park Avenue 4-E New York NY 10016 1 1 None
5 $850,000 635 West 42nd Street 19L New York NY 10036 1 1 None
6 $1,749,000 635 West 42nd Street 45D New York NY 10036 2 2 None
7 $1,175,000 340 East 64th Street 11-P New York NY 10065 2 1 None
8 $5,450,000 450 East 83rd Street 24-BC New York NY 10028 5 5 None
9 $4,500,000 524 East 72nd Street 32-CDE New York NY 10021 3 3 None
10 $1,700,000 635 West 42nd Street 42E New York NY 10036 1 1 None
11 $850,000 635 West 42nd Street 15JJ New York NY 10036 1 1 None
12 $800,000 635 West 42nd Street 16JJ New York NY 10036 1 1 None
13 $22,500,000 635 West 42nd Street 28K New York NY 10036 6 6 6,000
14 $1,125,000 635 West 42nd Street 15G New York NY 10036 1 1 None
15 $1,085,000 635 West 42nd Street 14N New York NY 10036 1 1 800
16 $900,000 635 West 42nd Street 18E New York NY 10036 1 1 None
17 $1,600,000 635 West 42nd Street 23K New York NY 10036 2 2 1,070
18 $1,250,000 635 West 42nd Street 24H New York NY 10036 2 1 800
19 $995,000 635 West 42nd Street 4F New York NY 10036 1 1 800
But there are 100 property details on website why I am getting only 20. Is there any way so that I can get the all property details.

That page shows only the first 20 items at the beginning. Upon scrolling, the next 20 items are shown. And that is the reason you are getting only the first 20 items.
You could instead scrape from this URL.
https://www.century21.com/propsearch-async?lid=CNYNEWYORK&t=0&s=0&r=20&searchKey=244edb9b-0c67-41cc-aa75-1125928b7c87&p=1&o=comingsoon-asc
Change the s value in multiples of 20.
Example,
s=0: The first 20 items will be fetched
s=20: The next 20 items will be fetched
s=40: .....
and so on...

Leave rows in pandas dataframe based on a column in anotherr data frame

I have this df
nhl_df=pd.read_csv("assets/nhl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]
cities = cities.rename(columns={'Population (2016 est.)[8]': 'Population'})
cities = cities[['Metropolitan area','Population']]
print(cities)
Metropolitan area Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
It has 50 rows.
My second df has 28 rows
W/L Ratio
city
Arizona 0.707317
Boston 2.500000
Buffalo 0.555556
Calgary 1.057143
Carolina 1.028571
Chicago 0.846154
Colorado 1.433333
Columbus 1.500000
Dallas–Fort Worth 1.312500
Detroit 0.769231
Edmonton 0.900000
Florida 1.466667
Los Angeles 1.655862
Minnesota 1.730769
Montreal 0.725000
Nashville 2.944444
New York City 1.111661
Ottawa 0.651163
Philadelphia 1.615385
Pittsburgh 1.620690
San Jose 1.666667
St. Louis 1.375000
Tampa Bay 2.347826
Toronto 1.884615
Vancouver 0.775000
Vegas 2.125000
Washington 1.884615
Winnipeg 2.600000
I need to remove from the first dataframe the rows where the metropolitan area is not in the city column of the 2nd data frame.
I tried this:
cond = nhl_df['city'].isin(cities['Metropolitan Area'])
But I got this error which makes no sense
KeyError: 'city'

You need select column Metropolitan Area and in second cities DataFrame index, last filter with ~ for invert mask:
cond = cities['Metropolitan Area'].isin(nhl_df.index)
df = cities[~cond]
If first clumn is not index in city DataFrame:
cond = cities['Metropolitan Area'].isin(nhl_df['city'])
df = cities[~cond]

How to explode string by uppercase and generate more rows [Pandas]

here is my question
If I have dataframe like:
Metropolitan area Population NHL
0 New York City 20153634 RangersIslandersDevils
1 Los Angeles 13310447 KingsDucks
2 Washington 23131112 New London
3 Alabama 11111112 Lighting
I want to get a new dataframe like:
Metropolitan area Population NHL
0 New York City 20153634 Rangers
1 New York City 20153634 Islanders
2 New York City 20153634 Devils
3 Los Angeles 13310447 Kings
4 Los Angeles 13310447 Ducks
5 Washington 23131112 New London
6 Alabama 11111112 Lighting
So, as you can see, I need to split NHL team names by upper case, but if there is a space in the name, should not do anything.

You can use a combination of findall and explode:
out = (
df.assign(NHL=df["NHL"].str.findall(r"[A-Z](?:\s[A-Z]|[^A-Z])+"))
.explode("NHL")
.reset_index(drop=True)
)
print(out)
Metropolitan area Population NHL
0 New York City 20153634 Rangers
1 New York City 20153634 Islanders
2 New York City 20153634 Devils
3 Los Angeles 13310447 Kings
4 Los Angeles 13310447 Ducks
5 Washington 23131112 New London
6 Alabama 11111112 Lighting

Here is one way:
df.drop('NHL', axis=1).merge(df['NHL'].str.extractall('([A-Z](?:\s[A-Z]|[^A-Z])+)')
.reset_index(level=1, drop=True)
.rename(columns={0:'NHL'}),
left_index=True, right_index=True)
Output:
Metropolitan area Population NHL
0 New York City 20153634 Rangers
0 New York City 20153634 Islanders
0 New York City 20153634 Devils
1 Los Angeles 13310447 Kings
1 Los Angeles 13310447 Ducks
2 Washington 23131112 New London
3 Alabama 11111112 Lighting
Borrowed #CameronRiddell regex to correct parse teams.

This can be updated:
import re
df1 = df[df["NHL"].str.contains(" ")]
df2 = df[~df["NHL"].str.contains(" ")]
df2["NHL"] = df2.apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x["NHL"]), axis=1)
df2 = df2.explode("NHL")
pd.concat([df2,df1])

Formatting pandas output data

I have a dataframe and want the output to be formatted to save paper for printing.
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8
London 40
France 2 20
France 2 22
France 3
France 3
France 3
USA 10
Is there a way to format the dataframe to look like this:
GameA GameB
Country
London 5 London 20
London 5 London 10
London 3 London 5
London 3 London 6
London London 8
London London 40
GameA GameB
France 2 France 20
France 2 France 22
France 3
France 3
France 3
GameA
USA 10

The formatting is off a bit because of how it copy and pasted the text results above (due to the missing values), but this should work with your actual data.
countries = df.index.unique()
for country in countries:
print(df.loc[df.index == country])
print(' ')
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8 NaN
London 40 NaN
GameA GameB
Country
France 2 20
France 2 22
France 3 NaN
France 3 NaN
France 3 NaN
GameA GameB
Country
USA 10 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

replicate entire dataframe 'x' times in Python - python

How about: pd.concat([df]*3, ignore_index=True) Output: city 0 Los Angles 1 New York 2 Texas 3 Washington DC 4 Los Angles 5 New York 6 Texas 7 Washington DC 8 Los Angles 9 New York 10 Texas 11 Washington DC

You can use pd.concat: result=pd.concat([df]*x).reset_index(drop=True) print(result) Output (for x=3): city 0 Los Angles 1 New York 2 Texas 3 Washington DC 4 Los Angles 5 New York 6 Texas 7 Washington DC 8 Los Angles 9 New York 10 Texas 11 Washington DC

Related

Modify duplicated rows in dataframe (Python)

Not getting all information in web scraping from website

Leave rows in pandas dataframe based on a column in anotherr data frame

How to explode string by uppercase and generate more rows [Pandas]

Formatting pandas output data

Categories

Resources