Remove duplicate values while group by in pandas data frame

Remove duplicate values while group by in pandas data frame - python

given input data frame
Required output
I am able to achieve this using groupby fn as df_ = (df.groupby("entity_label", sort=True)["entity_text"].apply(tuple).reset_index(name="entity_text")), but duplicates are still there in the output tuple

You can use SeriesGroupBy.unique() to get the unique values of entity_text before applying tuple to the list, as follows:
(df.groupby("entity_label", sort=False)["entity_text"]
.unique()
.apply(tuple)
.reset_index(name="entity_text")
)
Result:
entity_label entity_text
0 job_title (Full Stack Developer, Senior Data Scientist, Python Developer)
1 country (India, Malaysia, Australia)

Try this:
import pandas as pd
df = pd.DataFrame({'entity_label':["job_title", "job_title","job_title","job_title", "country", "country", "country", "country", "country"],
'entity_text':["full stack developer", "senior data scientiest","python developer","python developer", "Inida", "Malaysia", "India", "Australia", "Australia"],})
df.drop_duplicates(inplace=True)
df['entity_text'] = df.groupby('entity_label')['entity_text'].transform(lambda x: ','.join(x))
df.drop_duplicates().reset_index().drop(['index'], axis='columns')
output:
entity_label entity_text
0 job_title full stack developer,senior data scientiest,py...
1 country Inida,Malaysia,India,Australia

Related

Create numpy array from panda daataframe inside a For loop

Lets say that i have the following dataframe:
data = {"Names": ["Ray", "John", "Mole", "Smith", "Jay", "Marc", "Tom", "Rick"],
"Sports": ["Soccer", "Judo", "Tenis", "Judo", "Tenis","Soccer","Judo","Tenis"]}
I want to have a for loop like that for each unique Sport i am able to retrieve a numpy array containing the Names of people playing that sport. In pseudo code that can be explainded as
for unique sport in sports:
nArray= numpy array of names of people practicing sport
---------
Do something with nArray
-------

Use GroupBy.apply with np.array:
df = pd.DataFrame(data)
s = df.groupby('Sports')['Names'].apply(np.array)
print (s)
Sports
Judo [John, Smith, Tom]
Soccer [Ray, Marc]
Tenis [Mole, Jay, Rick]
Name: Names, dtype: object
for sport, name in s.items():
print (name)
['John' 'Smith' 'Tom']
['Ray' 'Marc']
['Mole' 'Jay' 'Rick']

one way to go
df = pd.DataFrame(data)
for sport in df.Sports.unique():
list_of_names = list(df[df.Sports == sport].Names)
data = np.array(list_of_names)

You can do by pandas library for get list array of sport persons name.
import numpy as np
import pandas as pd
data = {"Names": ["Ray", "John", "Mole", "Smith", "Jay", "Marc", "Tom", "Rick"],
"Sports": ["Soccer", "Judo", "Tenis", "Judo", "Tenis","Soccer","Judo","Tenis"]}
df = pd.DataFrame(data)
unique_sports = df['Sports'].unique()
for sport in unique_sports:
uniqueNames = np.array(df[df['Sports'] == sport]['Names'])
print(uniqueNames)
Result :
['Mole' 'Jay' 'Rick']

Sum up data based on multiple factors in a dataframe

As you can see, I do have a data frame of different stores with multiple departments (1-99, but varying). I do want to sum up the revenue of all departments for each shop for each week.
Is there a more elegant way than using for loops and if statements? I'm using python with pandas.
Here's a photo of the table :
merged = walmart.merge(stores, how='left').merge(features, how='left')
testing_merged = testing.merge(stores, how='left').merge(features, how='left')
df = pd.DataFrame(data={"Store": merged.Store, "Dept": merged.Dept, "Date": merged.Date, "Weekly_Sales": merged.Weekly_Sales, "IsHoliday": merged.IsHoliday,
"Type": merged.Type, "Size": merged.Size, "Temperatur": merged.Temperature, "Fuel_Price": merged.Fuel_Price,
"MarkDown1": merged.MarkDown1, "MarkDown2": merged.MarkDown2, "MarkDown3": merged.MarkDown3, "MarkDown4": merged.MarkDown4,
"MarkDown5": merged.MarkDown5, "CPI": merged.CPI, "Unemployment": merged.Unemployment})

Using a group by function like the folowing
df.groupby(['departments', 'shop','week'])
Documentation : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

rename values from df.columns

suppose I have a dataframe column values as shown below
print(df.columns)
Index([ "CustomerID", "CompanyName", "ContactName", "ContactTitle",
"Address", "City", "Region", "PostalCode",
"Country", "Phone", "Fax"],
dtype='object')
I want to remove quotes from the values.
This is what I have tried:
for cols in list(df.columns):
new_col = str(cols).replace('"','').replace('"','')
cols.rename(index={cols: new_col})
Any solution will be helpfull.

Its a string type for .columns, If you remove quote its invalid.

Adding a Row to an Empty Pandas Dataframe [duplicate]

This question already has answers here:
Appending to an empty DataFrame in Pandas?
(5 answers)
Closed 2 years ago.
I am trying to parse through housing data I scraped off of the web. For each house, I am storing the characteristics within a list. I want to put each house's characteristics (e.g., bedrooms, bathrooms, square feet, etc.) into a row of the same pandas dataframe. However, when I try the code below, the headers of my dataframe appear, but not any of the contents. What am I missing here?
def processData(url):
#Rest of function omitted as it is not relevant to the question at hand.
entry = [location, price, beds, baths, sqft, lotSize, neighborhoodMed, dom, built, hs,
garage, neighborhood, hType, basementSize]
df = pd.DataFrame(columns = ["Address", "Price", "Beds", "Baths", "SqFt", "LotSize",
"NeighborhoodMedian", "DOM", "Built", "HighSchool", "Garage",
"Neighborhood", "Type", "BasementSize"])
df.append(entry) #This line doesn't work
return df
Output:

Just guessing as to your requirements but try the following
import pandas as pd
location, price, beds, baths, sqft, lotSize, neighborhoodMed, dom, built, hs, garage, neighborhood, hType, basementSize = range(
14)
entry = [
location, price, beds, baths, sqft, lotSize, neighborhoodMed, dom, built,
hs, garage, neighborhood, hType, basementSize
]
columns = [
"Address", "Price", "Beds", "Baths", "SqFt", "LotSize",
"NeighborhoodMedian", "DOM", "Built", "HighSchool", "Garage",
"Neighborhood", "Type", "BasementSize"
]
df = pd.DataFrame(columns=columns)
df = df.append(
dict(zip(columns, entry)), ignore_index=True) #This line doesn't work
print(df)

Using structured queries to geocode records in a pandas dataframe using GeoPy

I would like to use structured queries to do geocoding in GeoPy, and I would like to run this on a large number of observations. I don't know how to do these queries using a pandas dataframe (or something that can be easily transformed to and from a pandas dataframe).
First, some set up:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
Ngeolocator = Nominatim(user_agent="myGeocoder")
Ngeocode = RateLimiter(Ngeolocator.geocode, min_delay_seconds=1)
df = pandas.DataFrame(["Bob", "Joe", "Ed"])
df["CLEANtown"] = ['Harmony', 'Fargo', '']
df["CLEANcounty"] = ['', '', 'Traill']
df["CLEANstate"] = ['Minnesota', 'North Dakota', 'North Dakota']
df["full"]=['Harmony, Minnesota','Fargo, North Dakota','Traill County, North Dakota']
df.columns = ["name"] + list(df.columns[1:])
I know how to run a structured query on a single location by providing a dictionary. I.e.:
q={'city':'Harmony', 'county':'', 'state':'Minnesota'}
testN=Ngeocode(q,addressdetails=True)
And I know how to geocode from the dataframe simply using a single column populated with strings. I.e.:
df['easycode'] = df['full'].apply(lambda x: Ngeocode(x, language='en',addressdetails=True).raw)
But how do I turn the columns CLEANtown, CLEANcounty, and CLEANstate into dictionaries row by row, use those dictionaries as structured queries, and put the results back into the pandas dataframe?
Thank you!

One way is to use apply method of a DataFrame instead of a Series. This would pass the whole row to the lambda. Example:
df["easycode"] = df.apply(
lambda row: Ngeocode(
{
"city": row["CLEANtown"],
"county": row["CLEANcounty"],
"state": row["CLEANstate"],
},
language="en",
addressdetails=True,
).raw,
axis=1,
)
Similarly, if you wanted to make a single row of the dictionaries first, you could do:
df["full"] = df.apply(
lambda row: {
"city": row["CLEANtown"],
"county": row["CLEANcounty"],
"state": row["CLEANstate"],
},
axis=1,
)
df["easycode"] = df["full"].apply(
lambda x: Ngeocode(
x,
language="en",
addressdetails=True,
).raw
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicate values while group by in pandas data frame - python

given input data frame Required output I am able to achieve this using groupby fn as df_ = (df.groupby("entity_label", sort=True)["entity_text"].apply(tuple).reset_index(name="entity_text")), but duplicates are still there in the output tuple

Related

Create numpy array from panda daataframe inside a For loop

Sum up data based on multiple factors in a dataframe

rename values from df.columns

Adding a Row to an Empty Pandas Dataframe [duplicate]

Using structured queries to geocode records in a pandas dataframe using GeoPy

Categories

Resources