I have a data frame like:
Company Country
ABC USA
ABC USA
BCD USA
BCD USA
ABC USA
The output should be : -
Company Country
ABC USA
BCD USA
I think you need drop_duplicates if need unique values in all columns:
df = df.drop_duplicates()
print (df)
Company Country
0 ABC USA
2 BCD USA
Or if need specify column(s) for check duplicates add parameter subset:
df = df.drop_duplicates(subset=['Company'])
print (df)
Company Country
0 ABC USA
2 BCD USA
And solution with groupby and aggregate first:
df = df.groupby('Company', as_index=False).first()
print (df)
Company Country
0 ABC USA
1 BCD USA
Just for the sake of completeness, you can also use:
df.groupby('Company').head(1)
Out:
Company Country
0 ABC USA
2 BCD USA
Related
I have data on births that looks like this:
Date Country Sex
1.1.20 USA M
1.1.20 USA M
1.1.20 Italy F
1.1.20 England M
2.1.20 Italy F
2.1.20 Italy M
3.1.20 USA F
3.1.20 USA F
My purpose is to get a new dataframe in which each row is a date at a country, and then number of total births, number of male births and number of female births. It's supposed to look like this:
Date Country Births Males Females
1.1.20 USA 2 2 0
1.1.20 Italy 1 0 1
1.1.20 England 1 1 0
2.1.20 Italy 2 1 1
3.1.20 USA 2 0 2
I tried using this code:
df.groupby(by=['Date', 'Country', 'Sex']).size()
but it only gave me a new column of total births, with different rows for each sex in every date+country combination.
any help will be appreciated.
Thanks,
Eran
You can group the dataframe on columns Date and Country then aggregate column Sex using value_counts followed by unstack to reshape, finally assign the Births columns by summing frequency along axis=1:
out = df.groupby(['Date', 'Country'], sort=False)['Sex']\
.value_counts().unstack(fill_value=0)
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Or you can use a very similar approach with .crosstab instead of groupby + value_counts:
out = pd.crosstab([df['Date'], df['Country']], df['Sex'], colnames=[None])
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Date Country Female Male Births
0 1.1.20 USA 0 2 2
1 1.1.20 Italy 1 0 1
2 1.1.20 England 0 1 1
3 2.1.20 Italy 1 1 2
4 3.1.20 USA 2 0 2
I have DataFrame like below:
data = pd.DataFrame({"Country" : ["Brazil", "Brazil", "Germany", "Germany", "UK"],
"Order method" : ["Phone", "Retail", "Web", "Web", "Retail"]})
And I would like to create new DataFrame based on above data frame where I would like to see result as below:
Use GroupBy.size with Series.unstack and DataFrame.stack for add missing categories:
s = data.groupby(['Country','Order method']).size().unstack(fill_value=0).stack()
print (s)
Country Order method
Brazil Phone 1
Retail 1
Web 0
Germany Phone 0
Retail 0
Web 2
UK Phone 0
Retail 1
Web 0
dtype: int64
For DataFrame add DataFrame.reset_index:
df = (data.groupby(['Country','Order method'])
.size()
.unstack(fill_value=0)
.stack()
.reset_index(name='Count'))
print (df)
Country Order method Count
0 Brazil Phone 1
1 Brazil Retail 1
2 Brazil Web 0
3 Germany Phone 0
4 Germany Retail 0
5 Germany Web 2
6 UK Phone 0
7 UK Retail 1
8 UK Web 0
Last if necessary replace duplicated values to empty strings use Series.mask with Series.duplicated:
df['Country'] = df['Country'].mask(df['Country'].duplicated(), '')
print (df)
Country Order method Count
0 Brazil Phone 1
1 Retail 1
2 Web 0
3 Germany Phone 0
4 Retail 0
5 Web 2
6 UK Phone 0
7 Retail 1
8 Web 0
There are 2 dfs
datatypes are the same
df1 =
ID city name value
1 LA John 111
2 NY Sam 222
3 SF Foo 333
4 Berlin Bar 444
df2 =
ID city name value
1 NY Sam 223
2 LA John 111
3 SF Foo 335
4 London Foo1 999
5 Berlin Bar 444
I need to compare them and produce a new df, only with values, which are in df2, but not in df1
By some reason results after applying different methods are wrong
So far I've tried
pd.concat([df1, df2], join='inner', ignore_index=True)
but it returns all values together
pd.merge(df1, df2, how='inner')
it returns df1
then this one
df1[~(df1.iloc[:, 0].isin(list(df2.iloc[:, 0])))
it returns df1
The desired output is
ID city name value
1 NY Sam 223
2 SF Foo 335
3 London Foo1 999
Use DataFrame.merge by all columns without first and indicator parameter:
c = df1.columns[1:].tolist()
Or:
c = ['city', 'name', 'value']
df = (df2.merge(df1,on=c, indicator = True, how='left', suffixes=('','_'))
.query("_merge == 'left_only'")[df1.columns])
print (df)
ID city name value
0 1 NY Sam 223
2 3 SF Foo 335
3 4 London Foo1 999
Try this:
print("------------------------------")
print(df1)
df2 = DataFrameFromString(s, columns)
print("------------------------------")
print(df2)
common = df1.merge(df2,on=["city","name"]).rename(columns = {"value_y":"value", "ID_y":"ID"}).drop("value_x", 1).drop("ID_x", 1)
print("------------------------------")
print(common)
OUTPUT:
------------------------------
ID city name value
0 ID city name value
1 1 LA John 111
2 2 NY Sam 222
3 3 SF Foo 333
4 4 Berlin Bar 444
------------------------------
ID city name value
0 1 NY Sam 223
1 2 LA John 111
2 3 SF Foo 335
3 4 London Foo1 999
4 5 Berlin Bar 444
------------------------------
city name ID value
0 LA John 2 111
1 NY Sam 1 223
2 SF Foo 3 335
3 Berlin Bar 5 444
I have some customer data such as this in a data frame:
S No Country Sex
1 Spain M
2 Norway F
3 Mexico M
...
I want to have an output such as this:
Spain
M = 1207
F = 230
Norway
M = 33
F = 102
...
I have a basic notion that I want to group my rows based on their countries with something like df.groupby(df.Country), and on the selected rows, I need to run something like df.Sex.value_counts()
Thanks!
I think need crosstab:
df = pd.crosstab(df.Sex, df.Country)
Or if want use your solution add unstack for columns with first level of MultiIndex:
df = df.groupby(df.Country).Sex.value_counts().unstack(level=0, fill_value=0)
print (df)
Country Mexico Norway Spain
Sex
F 0 1 0
M 1 0 1
EDIT:
If want add more columns then is possible set which level parameter is converted to columns:
df1 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=0, fill_value=0).reset_index()
print (df1)
No Country Sex 1 2 3
0 Mexico M 0 0 1
1 Norway F 0 1 0
2 Spain M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=1, fill_value=0).reset_index()
print (df2)
Country No Sex Mexico Norway Spain
0 1 M 0 0 1
1 2 F 0 1 0
2 3 M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=2, fill_value=0).reset_index()
print (df2)
Sex No Country F M
0 1 Spain 0 1
1 2 Norway 1 0
2 3 Mexico 0 1
You can also use pandas.pivot_table:
res = df.pivot_table(index='Country', columns='Sex', aggfunc='count', fill_value=0)
print(res)
SNo
Sex F M
Country
Mexico 0 1
Norway 1 0
Spain 0 1
I have a dataframe containing about 300 000 rows with a structure like this:
name Jack
gender M
year 1993
country USA
city Odessa
name John
gender M
year 1992
name Sam
country Canada
city Toronto
Is there a possibility to make dataframe looks like this using Pandas?
name gender year country city
Jack M 1993 USA Odessa
John M 1992
Sam Canada Toronto
Row with "name" is always there, but others could be absent. I try to use iterrows with no success.
In [17]:
g = np.cumsum(df.iloc[: , 0] == 'name')
In [15]:
df.groupby(g).apply(lambda x : pd.DataFrame(x.set_index([0]).T , columns=['name' , 'gender' , 'year' , 'country' , 'city']) )
Out[15]:
name gender year country city
0
1 1 Jack M 1993 USA Odessa
2 1 John M 1992 NaN NaN
3 1 Sam NaN NaN Canada Toronto