Exploding a dataframe based on column value

Exploding a dataframe based on column value - python

I have a Pandas dataframe that looks like this:
Location
Number
Position
New York
111
1 through 12
Dallas
222
1 through 12
San Francisco
333
1 through 4
I would like to basically explode the dataframe based on the Position column so that the output looks like this:
Location
Number
Position
New York
111
1
New York
111
2
New York
111
3
New York
111
4
New York
111
5
New York
111
6
New York
111
7
New York
111
8
New York
111
9
New York
111
10
New York
111
11
New York
111
12
Dallas
222
etc. etc.
I would like to do this for every instance of Location and Number based on what the values in Position are.
Is there a quick and easy way to do this??

One option using explode:
out = (df
.assign(Position=[range(a, b+1) for x in df['Position']
for a,b in [map(int, x.split(' through '))]])
.explode('Position')
)
Another approach using reindexing:
df2 = df['Position'].str.extract('(\d+) through (\d+)').astype(int)
# 0 1
# 0 1 12
# 1 1 12
# 2 1 4
rep = df2[1].sub(df2[0]).add(1)
out = (df
.loc[df.index.repeat(rep)]
.assign(Position=lambda d: d.groupby(level=0).cumcount().add(df2[0]))
)
output:
Location Number Position
0 New York 111 1
0 New York 111 2
0 New York 111 3
0 New York 111 4
0 New York 111 5
0 New York 111 6
0 New York 111 7
0 New York 111 8
0 New York 111 9
0 New York 111 10
0 New York 111 11
0 New York 111 12
1 Dallas 222 1
1 Dallas 222 2
1 Dallas 222 3
1 Dallas 222 4
1 Dallas 222 5
1 Dallas 222 6
1 Dallas 222 7
1 Dallas 222 8
1 Dallas 222 9
1 Dallas 222 10
1 Dallas 222 11
1 Dallas 222 12
2 San Francisco 333 1
2 San Francisco 333 2
2 San Francisco 333 3
2 San Francisco 333 4

Related

Modify duplicated rows in dataframe (Python)

I am working with a dataframe in Pandas and I need a solution to automatically modify one of the columns that has duplicate values. It is a column type 'object' and I would need to modify the name of the duplicate values. The dataframe is the following:
City Year Restaurants
0 New York 2001 20
1 Paris 2000 40
2 New York 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 33
6 Barcelona 2001 15
As you can see, New York is repeated 3 times. I would like to create a new dataframe in which this value would be automatically modified and the result would be the following:
City Year Restaurants
0 New York 2001 2001 20
1 Paris 2000 40
2 New York 1999 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 1998 33
6 Barcelona 2001 15
I would also be happy with "New York 1", "New York 2" and "New York 3". Any option would be good.

Use np.where, to modify column City if duplicated
df['City']=np.where(df['City'].duplicated(keep=False), df['City']+' '+df['Year'].astype(str),df['City'])

A different approach without the use of numpy would be with groupby.cumcount() which will give you your alternative New York 1, New York 2 but for all values.
df['City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 1 2000 40
2 New York 2 1999 41
3 Los Angeles 1 2004 35
4 Madrid 1 2001 22
5 New York 3 1998 33
6 Barcelona 1 2001 15
To have an increment only in the duplicate cases you can use loc:
df.loc[df[df.City.duplicated(keep=False)].index, 'City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 2000 40
2 New York 2 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 3 1998 33
6 Barcelona 2001 15

Not getting all information in web scraping from website

I am taking the details of property from a website but it hardly gives the information of 20 property while there are 100. There is no timeout
INPUT:
import requests
import pandas
from bs4 import BeautifulSoup
r=requests.get('https://www.century21.com/real-estate/new-york-ny/LCNYNEWYORK/')
c=r.content
soup=BeautifulSoup(c,'html.parser')
all=soup.find_all("div",{'class':'property-card-primary-info'})
#print(soup.prettify())
#len(all)
l=[]
for item in all:
d={}
d['price']=item.find('a',{'class':'listing-price'}).text.replace('\n','').replace(' ','')
add=item.find('div',{'class':'property-address-info'})
try:
d['address']=add.text.replace('\n',' ').replace(' ','')
except:
d['address']="None"
try:
d['beds']=item.find('div',{'class':'property-beds'}).find('strong').text.replace('\n','')
except:
d['beds']='None'
try:
d['baths']=item.find('div',{'class':'property-baths'}).find('strong').text.replace('\n','')
except:
d['baths']='None'
try:
d['area']=item.find('div',{'class':'property-sqft'}).find('strong').text
except:
d['area']='None'
l.append(d)
df=pandas.DataFrame(l)
print(df)
OUTPUT:
price address beds baths area
0 $1,680,000 161 West 61st Street 3-F New York NY 10023 2 2 None
1 $1,225,000 350 East 82nd Street 2-J New York NY 10028 2 2 None
2 $2,550,000 845 United Nations Plaza 39-E New York NY 10017 2 2 None
3 $1,850,000 57 Reade Street 17-C New York NY 10007 1 1 None
4 $828,000 80 Park Avenue 4-E New York NY 10016 1 1 None
5 $850,000 635 West 42nd Street 19L New York NY 10036 1 1 None
6 $1,749,000 635 West 42nd Street 45D New York NY 10036 2 2 None
7 $1,175,000 340 East 64th Street 11-P New York NY 10065 2 1 None
8 $5,450,000 450 East 83rd Street 24-BC New York NY 10028 5 5 None
9 $4,500,000 524 East 72nd Street 32-CDE New York NY 10021 3 3 None
10 $1,700,000 635 West 42nd Street 42E New York NY 10036 1 1 None
11 $850,000 635 West 42nd Street 15JJ New York NY 10036 1 1 None
12 $800,000 635 West 42nd Street 16JJ New York NY 10036 1 1 None
13 $22,500,000 635 West 42nd Street 28K New York NY 10036 6 6 6,000
14 $1,125,000 635 West 42nd Street 15G New York NY 10036 1 1 None
15 $1,085,000 635 West 42nd Street 14N New York NY 10036 1 1 800
16 $900,000 635 West 42nd Street 18E New York NY 10036 1 1 None
17 $1,600,000 635 West 42nd Street 23K New York NY 10036 2 2 1,070
18 $1,250,000 635 West 42nd Street 24H New York NY 10036 2 1 800
19 $995,000 635 West 42nd Street 4F New York NY 10036 1 1 800
But there are 100 property details on website why I am getting only 20. Is there any way so that I can get the all property details.

That page shows only the first 20 items at the beginning. Upon scrolling, the next 20 items are shown. And that is the reason you are getting only the first 20 items.
You could instead scrape from this URL.
https://www.century21.com/propsearch-async?lid=CNYNEWYORK&t=0&s=0&r=20&searchKey=244edb9b-0c67-41cc-aa75-1125928b7c87&p=1&o=comingsoon-asc
Change the s value in multiples of 20.
Example,
s=0: The first 20 items will be fetched
s=20: The next 20 items will be fetched
s=40: .....
and so on...

replicate entire dataframe 'x' times in Python

I've a sample dataframe
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
How can I replicate the above dataframe without changing the order?
Expected outcome:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

How about:
pd.concat([df]*3, ignore_index=True)
Output:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

You can use pd.concat:
result=pd.concat([df]*x).reset_index(drop=True)
print(result)
Output (for x=3):
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

Merging the same labels for counting

I have a huge dataframe as:
country1 import1 export1 country2 import2 export2
0 USA 12 82 Germany 12 82
1 Germany 65 31 France 65 31
2 England 74 47 Japan 74 47
3 Japan 23 55 England 23 55
4 France 48 12 Usa 48 12
export1 and import1 belongs to country1
export2 and import2 belongs to country2
I want to count export and import values by country.
Output may be like:
country | total_export | total_import
______________________________________________
USA | 12211221 | 212121
France | 4545 | 5454
...
...

Use wide_to_long first:
df = (pd.wide_to_long(data.reset_index(), ['country','import','export'], i='index', j='tmp')
.reset_index(drop=True))
print (df)
country import export
0 USA 12 82
1 Germany 65 31
2 England 74 47
3 Japan 23 55
4 France 48 12
5 Germany 12 82
6 France 65 31
7 Japan 74 47
8 England 23 55
9 Usa 48 12
And then aggregate sum:
df = df.groupby('country', as_index=False).sum()
print (df)
country import export
0 England 97 102
1 France 113 43
2 Germany 77 113
3 Japan 97 102
4 USA 12 82
5 Usa 48 12

You can slice the table into two parts and concatenate them:
func = lambda x: x[:-1] # or lambda x: x.rstrip('0123456789')

data.iloc[:,:3].rename(func, axis=1).\
append(data.iloc[:,3:].rename(func, axis=1)).\
groupby('country').sum()

Output:
import export
country
England 97 102
France 113 43
Germany 77 113
Japan 97 102
USA 12 82
Usa 48 12

custom sort in python pandas dataframe needs better approach

i have a dataframe like this
user = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK']})
i want to apply custom sort in country and Japan needs to be in top for both the users
i have done this but this is not my expected output
user.sort_values(['User','Country'], ascending=[True, False], inplace=True)
my expected output
expected_output = pd.DataFrame({'User':['101','101','101','101','101','102','102','102','102','102'],'Country':['Japan','India','India','UK','Austria','Japan','Japan','Brazil','Singapore','UK']})
i tried to Cast the column as category and when passing the categories and put Japan at the top. is there any other approach i don't want to pass the all the countries list every time. i just want to give user 101 -japan or user 102- UK then the remaining rows order needs to come.
Thanks

Create a new key help sort by using map
user.assign(New=user.Country.map({'Japan':1}).fillna(0)).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[80]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
4 Japan 102
7 Japan 102
3 Brazil 102
8 Singapore 102
9 UK 102
Update base on comment
mapdf=pd.DataFrame({'Country':['Japan','UK'],'User':['101','102'],'New':[1,1]})
user.merge(mapdf,how='left').fillna(0).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[106]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
9 UK 102
3 Brazil 102
4 Japan 102
7 Japan 102
8 Singapore 102

Use boolean indexing with append, last sort by column User:
user = (user[user['Country'] == 'Japan']
.append(user[user['Country'] != 'Japan'])
.sort_values('User'))
Alternative solution:
user = (user.query('Country == "Japan"')
.append(user.query('Country != "Japan"'))
.sort_values('User'))
print (user)
User Country count
1 101 Japan 1
0 101 India 2
2 101 India 3
5 101 UK 1
6 101 Austria 1
4 102 Japan 1
7 102 Japan 1
3 102 Brazil 2
8 102 Singapore 1
9 102 UK 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Exploding a dataframe based on column value - python

Related

Modify duplicated rows in dataframe (Python)

Not getting all information in web scraping from website

replicate entire dataframe 'x' times in Python

Merging the same labels for counting

custom sort in python pandas dataframe needs better approach

Categories

Resources