How to convert pandas DataFrame to multiple DataFrame? - python

My DataFrame
df= pandas.DataFrame({
"City" :["Chennai","Banglore","Mumbai","Delhi","Chennai","Banglore","Mumbai","Delhi"],
"Name" :["Praveen","Dhansekar","Naveen","Kumar","SelvaRani","Nithya","Suji","Konsy"]
"Gender":["M","M","M","M","F","F","F","F"]})
when printed it appears like this, df=
City
Name
Gender
Chennai
Praveen
M
Banglore
Dhansekar
M
Mumbai
Naveen
M
Delhi
Kumar
M
Chennai
SelvaRani
F
Banglore
Nithya
F
Mumbai
Suji
F
Delhi
Konsy
F
I want to save the data in separate DataFrame as follows:
Chennai=
City
Name
Gender
Chennai
Praveen
M
Chennai
SelvaRani
F
Banglore=
City
Name
Gender
Banglore
Dhansekar
M
Banglore
Nithya
F
Mumbai=
City
Name
Gender
Mumbai
Naveen
M
Mumbai
Suji
F
Delhi=
City
Name
Gender
Delhi
Kumar
M
Delhi
Konsy
F
My code is:
D_name= sorted(df['City'].unique())
for i in D_name:
f"{i}"=df[df['City']==I]
The dataset have more than 100 Cities.How do I write a for loop in python to get output as multiple data frame?

You can groupby and create a dictionary like so:
dict_dfs = dict(iter(df.groupby("City")))
Then you can directly access individual cities:
Delhi = dict_dfs["Delhi"]
print(Delhi)
# result:
City Name Gender
3 Delhi Kumar M
7 Delhi Konsy F

You could do something like this:
groups = df.groupby(by='City')
Bangalore = groups.get_group('Bangalore')

Related

Filtering duplicate values using groupby

I'm reading the documentation to understand the method filter when used with groupby. In order to understand it, I've got the below scenario:
I'm trying to get the duplicate names grouped by city from my DataFrame df.
Below is my try:
df = pd.DataFrame({
'city':['LA','LA','LA','LA','NY', 'NY'],
'name':['Ana','Pedro','Maria','Maria','Peter','Peter'],
'age':[24, 27, 19, 34, 31, 20],
'sex':['F','M','F','F','M', 'M'] })
df_filtered = df.groupby('city').filter(lambda x: len(x['name']) >= 2)
df_filtered
The output I'm getting is:
city name age sex
LA Ana 24 F
LA Pedro 27 M
LA Maria 19 F
LA Maria 34 F
NY Peter 31 M
NY Peter 20 M
The output I'm expecting is:
city name age sex
LA Maria 19 F
LA Maria 34 F
NY Peter 31 M
NY Peter 20 M
It's not clear to me in which cases I have to use different column names in the "groupby" method and in the "len" inside of the "filter" method
Thank you
How about just duplicated:
df[df.duplicated(['city', 'name'], keep=False)]
You should groupby two columns 'city','name'
Yourdf=df.groupby(['city','name']).filter(lambda x : len(x)>=2)
Yourdf
Out[234]:
city name age sex
2 LA Maria 19 F
3 LA Maria 34 F
4 NY Peter 31 M
5 NY Peter 20 M

Concatenate with new observations- DataFrame

I'm trying to concatenate with new observations. I got the answer that I think it's right but still get the system came back to me saying "ValueError
Can only compare identically-labeled DataFrame objects" Can anyone tell me why there's value error while I think I got the right result?
Here is the question:
Assume the data frame Employee is as below:
Department Title Year Education Sex
Name
Bob IT analyst 1 Bachelor M
Sam Trade associate 3 PHD M
Peter HR VP 8 Master M
Jake IT analyst 2 Master M
and another data frame new_observations is:
Department Education Sex Title Year
Mary IT F VP 9.0
Amy ? PHD F associate 5.0
Jennifer Trade Master F associate NaN
John HR Master M analyst 2.0
Judy HR Bachelor F analyst 2.0
Update Employee with these new observations.
Here is my code:
import pandas as pd
Employee =pd.DataFrame({"Name":["Bob","Sam","Peter","Jake"],
"Education":["Bachelor","PHD","Master","Master"],
"Sex":["M","M","M","M"],
"Year":[1,3,8,2],
"Department":["IT","Trade","HR","IT"],
"Title":["analyst", "associate", "VP", "analyst"]})
Employee=Employee.set_index('Name')
new_observations = pd.DataFrame({
"Name": ["Mary","Amy","Jennifer","John","Judy"],
"Department":["IT","?","Trade","HR","HR"],
"Education":["","PHD","Master","Master","Bachelor"],
"Sex":["F","F","F","M","F"],
"Title":["VP","associate","associate","analyst","analyst"],
"Year":[9.0,5.0,"NaN",2.0,2.0]},
columns=
["Name","Department","Education","Sex","Title","Year"])
new_observations=new_observations.set_index('Name')
Employee = Employee.append(new_observations,sort=False)
Here is my result:
code result
I also tried
Employee = pd.concat([Employee, new_observations], axis = 1, sort=False)
Use pd.concat on axis=0, which is default, so you don't need to include axis:
pd.concat([Employee, new_observations], sort=False)
Output:
Education Sex Year Department Title
Name
Bob Bachelor M 1 IT analyst
Sam PHD M 3 Trade associate
Peter Master M 8 HR VP
Jake Master M 2 IT analyst
Mary F 9 IT VP
Amy PHD F 5 ? associate
Jennifer Master F NaN Trade associate
John Master M 2 HR analyst
Judy Bachelor F 2 HR analyst

Wildcard search in python string and then updating the string

I have a column named as city. I want to bring the city names to one format ex.
Column sample data :
City
Sydney
Sydney-EZ
Bangalore
Bengalore SEZ
Delhi
New Delhi
Sydney and Sydney-EZ or any other row containing word Sydney should be replaced by Sydney. Bangalore and Bangalore SEZ ( or any other row containing word Bangalore ) should be replaced by Bangalore . Delhi and New Delhi ( or any other row containing word Delhi ) should be replaced by Delhi.
Using apply with lambda
Ex:
import pandas as pd
df = pd.DataFrame({"City": ["Sydney", "Sydney-EZ", "Bangalore", "Bengalore SEZ"]})
toUpdate = "Sydney"
df["City"] = df["City"].apply(lambda x:toUpdate if toUpdate in x else x )
print(df)
Output:
City
0 Sydney
1 Sydney
2 Bangalore
3 Bengalore SEZ

Obtaining number of occurrences of a variable from a column in python using pandas

I am in the learning phase of analyzing data using python, stumbled upon a doubt.
Consider the following data set:
print (df)
CITY OCCUPATION
0 BANGALORE MECHANICAL ENGINEER
1 BANGALORE COMPUTER SCIENCE ENGINEER
2 BANGALORE MECHANICAL ENGINEER
3 BANGALORE COMPUTER SCIENCE ENGINEER
4 BANGALORE COMPUTER SCIENCE ENGINEER
5 MUMBAI ACTOR
6 MUMBAI ACTOR
7 MUMBAI SHARE BROKER
8 MUMBAI SHARE BROKER
9 MUMBAI ACTOR
10 CHENNAI RETIRED
11 CHENNAI LAND DEVELOPER
12 CHENNAI MECHANICAL ENGINEER
13 CHENNAI MECHANICAL ENGINEER
14 CHENNAI MECHANICAL ENGINEER
15 DELHI PHYSICIAN
16 DELHI PHYSICIAN
17 DELHI JOURNALIST
18 DELHI JOURNALIST
19 DELHI ACTOR
20 PUNE MANAGER
21 PUNE MANAGER
22 PUNE MANAGER
how to get the maximum number of jobs from a particular state using pandas.
eg:
STATE OCCUPATION
----------------
BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------
MUMBAI - ACTOR
------------
First solution is groupby with Counter and most_common:
For DELHI is same number 2 for JOURNALIST and PHYSICIAN, so difference in output of solutions.
from collections import Counter
df1 = df.groupby('CITY').OCCUPATION
.apply(lambda x: Counter(x).most_common(1)[0][0])
.reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI PHYSICIAN
3 MUMBAI ACTOR
4 PUNE MANAGER
Another solution with groupby, size and nlargest:
df1 = df.groupby(['CITY', 'OCCUPATION'])
.size()
.groupby(level=0)
.nlargest(1)
.reset_index(level=0,drop=True)
.reset_index(name='a')
.drop('a', axis=1)
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER
EDIT:
For debugging here is the best custom function what is same as lambda function:
from collections import Counter
def f(x):
#print Series
print (x)
#count values by Counter
print (Counter(x).most_common())
#get first top value - list ogf tuple
print (Counter(x).most_common(1))
#select list by indexing [0] - output is tuple
print (Counter(x).most_common(1)[0])
#select first value of tuple by another [0]
#for selecting count use [1] instead [0]
print (Counter(x).most_common(1)[0][0])
return Counter(x).most_common(1)[0][0]
df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER

PySpark groupby and max value selection

I have a PySpark dataframe like
name city date
satya Mumbai 13/10/2016
satya Pune 02/11/2016
satya Mumbai 22/11/2016
satya Pune 29/11/2016
satya Delhi 30/11/2016
panda Delhi 29/11/2016
brata BBSR 28/11/2016
brata Goa 30/10/2016
brata Goa 30/10/2016
I need to find-out most preferred CITY for each name and Logic is " take city as fav_city if city having max no. of city occurrence on aggregate 'name'+'city' pair". And if multiple same occurrence found then consider city with latest Date. WIll explain:
d = df.groupby('name','city').count()
#name city count
brata Goa 2 #clear favourite
brata BBSR 1
panda Delhi 1 #as single so clear favourite
satya Pune 2 ##Confusion
satya Mumbai 2 ##confusion
satya Delhi 1 ##shd be discard as other cities having higher count than this city
#So get cities having max count
dd = d.groupby('name').agg(F.max('count').alias('count'))
ddd = dd.join(d,['name','count'],'left')
#name count city
brata 2 Goa #fav found
panda 1 Delhi #fav found
satya 2 Mumbai #can't say
satya 2 Pune #can't say
In case of user 'satya' I need to go back to trx_history and get latest date for cities having equal_max count I:e from 'Mumbai' or 'Pune' which is last transacted (max date), consider that city as fav_city. In this case 'Pune' as '29/11/2016' is latest/max date.
But I am not able to proceed further how to get that done.
Please help me with logic or if any better solution(faster/compact way), please suggest. Thanks.
First convert date to the DateType:
import pyspark.sql.functions as F
df_with_date = df.withColumn(
"date",
F.to_date("date", "dd/MM/yyyy")
# For Spark < 2.2
# F.unix_timestamp("date", "dd/MM/yyyy").cast("timestamp").cast("date")
)
Next groupBy user and city but extend aggregation like this:
df_agg = (df_with_date
.groupBy("name", "city")
.agg(F.count("city").alias("count"), F.max("date").alias("max_date")))
Define a window:
from pyspark.sql.window import Window
w = Window().partitionBy("name").orderBy(F.desc("count"), F.desc("max_date"))
Add rank:
df_with_rank = (df_agg
.withColumn("rank", F.dense_rank().over(w)))
And filter:
result = df_with_rank.where(F.col("rank") == 1)
You can detect remaining duplicates using code like this:
import sys
final_w = Window().partitionBy("name").rowsBetween(-sys.maxsize, sys.maxsize)
result.withColumn("tie", F.count("*").over(final_w) != 1)
d = df.groupby('name','city').count()
#name city count
brata Goa 2 #clear favourite
brata BBSR 1
panda Delhi 1 #as single so clear favourite
satya Pune 2 ##Confusion
satya Mumbai 2 ##confusion
satya Delhi 1 ##shd be discard as other cities having higher count than this city
#So get cities having max count
dd = d.groupby('name').count().sort(F.col('count').desc())
display(dd.take(1))

Categories