I have original df where I have column "average", where is average value counted for country . Now I have new_df, where I want to add these df average values based on country.
df
id country value average
1 USA 3 2
2 UK 5 5
3 France 2 2
4 USA 1 2
new df
country average
USA 2
Italy Nan
I had a solution that worked but there is a problem, when there is in new_df a country for which I have not count the average yet. In that case I want to fill just nan.
Can you please recommend me any solution?
Thanks
If need add average column to df2 use DataFrame.merge with DataFrame.drop_duplicates:
df2.merge(df1.drop_duplicates('country')[['country','average']], on='country', how='left')
If need aggregate mean:
df2.join(df1.groupby('country')['average'].mean(), on='country')
Related
I have two dataframes, df1 and df2, which have a common column heading, Name. The Name values are unique within df1 and df2. df1's Name values are a subset of those in df2; df2 has more rows -- about 17,300 -- than df1 -- about 6,900 -- but each Name value in df1 is in df2. I would like to create a list of Name values in df1 that meet certain criteria in other columns of the corresponding rows in df2.
Example:
df1:
Name
Age
Hair
0
Jim
25
black
1
Mary
58
brown
3
Sue
15
purple
df2:
Name
Country
phoneOS
0
Shari
GB
Android
1
Jim
US
Android
2
Alain
TZ
iOS
3
Sue
PE
iOS
4
Mary
US
Android
I would like a list of only those Name values in df1 that have df2 Country and phoneOS values of US and Android. The example result would be [Jim, Mary].
I have successfully selected rows within one dataframe that meet multiple criteria in order to copy those rows to a new dataframe. In that case pandas/Python does the iteration over rows internally. I guess I could write a "manual" iteration over the rows of df1 and access df2 on each iteration. I was hoping for a more efficient solution whereby the iteration was handled internally as in the single-dataframe case. But my searches for such a solution have been fruitless.
try:
df_1.loc[df_1.Name.isin(df_2.loc[df_2.Country.eq('US') & \
df_2.phoneOS.eq('Android'), 'Name']), 'Name']
Result:
0 Jim
1 Mary
Name: Name, dtype: object
if you want the result as a list just add .to_list() at the end
data = df1.merge(df2, on='Name')
data.loc[((data.phoneOS == 'Android') & (data.Country == "US")), 'Name'].values.tolist()
In python, I have a df that looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 2
Lacy 2
Ryan 3
Colt 4
Tia 4
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
Dan 4
Hallie 5
Cam 5
Lacy 5
Ryan 6
Colt 7
Tia 7
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time. I know that I can reset the index if ID is a unique identifier. But in this case, more than one person can have the same ID. So how can I account for that?
From the example that you have provided above, you can observe that we can obtain the final dataframe by: adding the maximum value of ID in first df to the second and then concatenating them, to explain this better:
Name df2 final_df
Dan 1 4
This value in final_df is obtained by doing a 1+(max value of ID from df1 i.e. 3) and this trend is followed for all entries for the dataframe.
Code:
import pandas as pd
df = pd.DataFrame({'Name':['Anna','Polly','Sarah','Max','Kate','Ally','Steve'],'ID':[1,1,2,3,3,3,3]})
df1 = pd.DataFrame({'Name':['Dan','Hallie','Cam','Lacy','Ryan','Colt','Tia'],'ID':[1,2,2,2,3,4,4]})
max_df = df['ID'].max()
df1['ID'] = df1['ID'].apply(lambda x: x+max_df)
final_df = pd.concat([df,df1])
print(final_df)
I have multiple categorical columns like Marital Status, Education, Gender, City and I wanted to check all the unique values inside these columns at once instead of writing this code every time.
df['Education'].value_counts()
I can only give an example of a few features but I need a solution when there are so many categorical features and its not possible to write code again and again to examine them.
Maritial_Status Education City
Married UG LA
Single PHD CA
Single UG Ca
Expected output:
Maritial_Status Education City
Married 1 UG 2 LA 1
Single 2 PHD 1 CA 2
Is there any kind of method to do this in Python?
Thanks
Yes, you can get what you're looking for with the following approach (also you don't have to worry about if your df has more data than the 4 columns you specified):
Get (only) all your categorical columns from your df in a list:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
Then, run a loop performing .size() on your grouped object, over your categorical columns, and store each result (which is a df object) in an empty list.
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
Lastly, concat the newly created dataframes within your list, into 1.
dat = pd.concat(li,axis=1)
All in 1 block:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
dat = pd.concat(li,axis=1)# use axis=1, so that the concatenation is column-wise
Marital Status Marital Status_count ... City City_count
0 Divorced 4.0 ... Athens 4
1 Married 3.0 ... Berlin 2
2 Single 3.0 ... London 2
3 Widowed 2.0 ... New York 2
4 NaN NaN ... Singapore 2
Using value_counts, you can do the following
res = (df
.apply(lambda x: x.value_counts()) # column by column value_counts would be applied
.stack()
.reset_index(level=0).sort_index(axis=0)
.rename(columns={'level_0': 'Value', 0: 'value_counts'}))
Another format of the the output:
res['Id'] = res.groupby(level=0).cumcount()
res.set_index('Id', append=True)
Explanation:
After applying value_counts, you will get the following:
Then using stack you can remove the NAN and get all things "stacked up" and then you can do the formatting/ ordering of the output.
To know how many repeated unique values you have for each column, you can try drop_duplicates() method:
dataset.drop_duplicates()
i want to group by id and get three most frequent city. For example i have original dataframe
ID City
1 London
1 London
1 New York
1 London
1 New York
1 Berlin
2 Shanghai
2 Shanghai
and result i want is like this:
ID first_frequent_city second_frequent_city third_frequent_city
1 London New York Berlin
2 Shanghai NaN NaN
First step is use SeriesGroupBy.value_counts for count values of City per ID, advantage is already values are sorted, then get counter by GroupBy.cumcount, filter first 3 values by loc, pivoting by DataFrame.pivot, change columns names and last convert ID to column by DataFrame.reset_index:
df = (df.groupby('ID')['City'].value_counts()
.groupby(level=0).cumcount()
.loc[lambda x: x < 3]
.reset_index(name='c')
.pivot('ID','c','City')
.rename(columns={0:'first_', 1:'second_', 2:'third_'})
.add_suffix('frequent_city')
.rename_axis(None, axis=1)
.reset_index())
print (df)
ID first_frequent_city second_frequent_city third_frequent_city
0 1 London New York Berlin
1 2 Shanghai NaN NaN
Another way using count as a reference to sort, then recreate dataframe by looping through groupby object:
df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))
.drop_duplicates(["ID","City"])
.sort_values(["ID","count"], ascending=False))
print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))
0 1 2
0 London New York Berlin
1 Shanghai NaN NaN
A bit long, essentially you groupby twice, first part works on the idea that grouping sorts the data in ascending order, the second part allows us to split the data into individual columns :
(df
.groupby("ID")
.tail(3)
.drop_duplicates()
.groupby("ID")
.agg(",".join)
.City.str.split(",", expand=True)
.set_axis(["first_frequent_city",
"second_frequent_city",
third_frequent_city"],
axis="columns",)
)
first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai None None
Get the .count by ID and City and then use np.where() with .groupby() with max, median and min. Then set the index and unstack rows to columns on the max column.
df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])
df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)
output:
City
max first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai NaN NaN
I am trying to get from My Starting DataFrame
to My Desired Results
.
I am trying to do a groupby on two columns (Name, Month) and I have a column (Category) that has either the value 'Score1' or 'Score2'. I want to create two columns with the name of values from the Category column and set their values to a value determined from another column.
pd.crosstab([df.Name, df.Month], df.Category)
is the closest I've got to create the desire data frame however I can't figure out how to get the values from my "Value" column to populate the dataframe.
Results from crosstab
The Dataframe in code form
df = pd.DataFrame(columns=['Name', 'Month', 'Category', 'Value'])
df['Name'] = ['Jack','Jack','Sarah','Sarah','Zack']
df['Month'] = ['Jan.','Jan.','Feb.','Feb.','Feb.']
df['Category'] = ['Score1','Score2','Score1','Score2','Score1']
df['Value'] = [1,2,3,4,5]
Thanks!
You can use Pivot Table
df.pivot_table(index=['Name', 'Month'],values='Value', columns='Category').rename_axis(None, axis=1).reset_index()
Out[1]:
Name Month Score1 Score2
0 Jack Jan. 1.0 2.0
1 Sarah Feb. 3.0 4.0
2 Zack Feb. 5.0 NaN
one way is with groupby and unstack
new_df = (df.groupby(['Name','Month','Category'])
['Value'].first().unstack().reset_index())
print(new_df)
Category Name Month Score1 Score2
0 Jack Jan. 1.0 2.0
1 Sarah Feb. 3.0 4.0
2 Zack Feb. 5.0 NaN