I have dataframe like this
df = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK'],
'Name':['RN','TN','AP','AP','TN','TN','TS','RN','TN','AP']})
if the user and country same i want to combine name column values in other column like below
You could let
df['Name_E'] = df.groupby(['User', 'Country']).Name.transform(lambda x: str.join(', ', x))
groupby with transform
df['all_names'] = df.groupby(['Country', 'User']).Name.transform(lambda x: ','.join(set(x)))
Country Name User all_names
0 India RN 101 AP,RN
1 Japan TN 101 TN
2 India AP 101 AP,RN
3 Brazil AP 102 AP
4 Japan TN 102 TN,RN
5 UK TN 101 TN
6 Austria TS 101 TS
7 Japan RN 102 TN,RN
8 Singapore TN 102 TN
9 UK AP 102 AP
You need:
res = df.merge(df.groupby(['User', 'Country'])['Name'].unique().reset_index().rename(columns={'Name':'Name_E'}), on=['Country', 'User'])
res['Name_E'] = res['Name_E'].apply(lambda x: ",".join(x))
Output:
User Country Name Name_E
0 101 India RN RN,AP
1 101 India AP RN,AP
2 101 Japan TN TN
3 102 Brazil AP AP
4 102 Japan TN TN,RN
5 102 Japan RN TN,RN
6 101 UK TN TN
7 101 Austria TS TS
8 102 Singapore TN TN
9 102 UK AP AP
Related
I have a huge dataframe as:
country1 import1 export1 country2 import2 export2
0 USA 12 82 Germany 12 82
1 Germany 65 31 France 65 31
2 England 74 47 Japan 74 47
3 Japan 23 55 England 23 55
4 France 48 12 Usa 48 12
export1 and import1 belongs to country1
export2 and import2 belongs to country2
I want to count export and import values by country.
Output may be like:
country | total_export | total_import
______________________________________________
USA | 12211221 | 212121
France | 4545 | 5454
...
...
Use wide_to_long first:
df = (pd.wide_to_long(data.reset_index(), ['country','import','export'], i='index', j='tmp')
.reset_index(drop=True))
print (df)
country import export
0 USA 12 82
1 Germany 65 31
2 England 74 47
3 Japan 23 55
4 France 48 12
5 Germany 12 82
6 France 65 31
7 Japan 74 47
8 England 23 55
9 Usa 48 12
And then aggregate sum:
df = df.groupby('country', as_index=False).sum()
print (df)
country import export
0 England 97 102
1 France 113 43
2 Germany 77 113
3 Japan 97 102
4 USA 12 82
5 Usa 48 12
You can slice the table into two parts and concatenate them:
func = lambda x: x[:-1] # or lambda x: x.rstrip('0123456789')
data.iloc[:,:3].rename(func, axis=1).\
append(data.iloc[:,3:].rename(func, axis=1)).\
groupby('country').sum()
Output:
import export
country
England 97 102
France 113 43
Germany 77 113
Japan 97 102
USA 12 82
Usa 48 12
I have a pandas data frame df like this.
In [1]: df
Out[1]:
country count
0 Japan 78
1 Japan 80
2 USA 45
3 France 34
4 France 90
5 UK 45
6 UK 34
7 China 32
8 China 87
9 Russia 20
10 Russia 67
I want to remove rows with the maximum value in each group. So the result should look like:
country count
0 Japan 78
3 France 34
6 UK 34
7 China 32
9 Russia 20
My first attempt:
idx = df.groupby(['country'], sort=False).max()['count'].index
df_new = df.drop(list(idx))
My second attempt:
idx = df.groupby(['country'])['count'].transform(max).index
df_new = df.drop(list(idx))
But it didn't work. Any ideas?
groupby / transform('max')
You can first calculate a series of maximums by group. Then filter out instances where count is equal to that series. Note this will also remove duplicates maximums.
g = df.groupby(['country'])['count'].transform('max')
df = df[~(df['count'] == g)]
The series g represents maximums for each row by group. Where this equals df['count'] (by index), you have a row where you have the maximum for your group. You then use ~ for the negative condition.
print(df.groupby(['country'])['count'].transform('max'))
0 80
1 80
2 45
3 90
4 90
5 45
6 45
7 87
8 87
9 20
Name: count, dtype: int64
sort + drop
Alternatively, you can sort and drop the final occurrence:
res = df.sort_values('count')
res = res.drop(res.groupby('country').tail(1).index)
print(res)
country count
9 Russia 20
7 China 32
3 France 34
6 UK 34
0 Japan 78
i have a dataframe like this
user = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK']})
i want to apply custom sort in country and Japan needs to be in top for both the users
i have done this but this is not my expected output
user.sort_values(['User','Country'], ascending=[True, False], inplace=True)
my expected output
expected_output = pd.DataFrame({'User':['101','101','101','101','101','102','102','102','102','102'],'Country':['Japan','India','India','UK','Austria','Japan','Japan','Brazil','Singapore','UK']})
i tried to Cast the column as category and when passing the categories and put Japan at the top. is there any other approach i don't want to pass the all the countries list every time. i just want to give user 101 -japan or user 102- UK then the remaining rows order needs to come.
Thanks
Create a new key help sort by using map
user.assign(New=user.Country.map({'Japan':1}).fillna(0)).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[80]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
4 Japan 102
7 Japan 102
3 Brazil 102
8 Singapore 102
9 UK 102
Update base on comment
mapdf=pd.DataFrame({'Country':['Japan','UK'],'User':['101','102'],'New':[1,1]})
user.merge(mapdf,how='left').fillna(0).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[106]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
9 UK 102
3 Brazil 102
4 Japan 102
7 Japan 102
8 Singapore 102
Use boolean indexing with append, last sort by column User:
user = (user[user['Country'] == 'Japan']
.append(user[user['Country'] != 'Japan'])
.sort_values('User'))
Alternative solution:
user = (user.query('Country == "Japan"')
.append(user.query('Country != "Japan"'))
.sort_values('User'))
print (user)
User Country count
1 101 Japan 1
0 101 India 2
2 101 India 3
5 101 UK 1
6 101 Austria 1
4 102 Japan 1
7 102 Japan 1
3 102 Brazil 2
8 102 Singapore 1
9 102 UK 1
I have a dataframe:
df1=pd.DataFrame({
'ID':[101,102],
'Name':['Axel','Bob'],
'US':['GrA','GrC'],
'Europe':['GrB','GrD'],
'AsiaPac':['GrZ','GrF']
})
Which I want to change to this:
df2=pd.DataFrame({
'ID':[101,101,101,102,102,102],
'Name':['Axel','Axel','Axel','Bob','Bob','Bob'],
'Region':['US','Europe','AsiaPac','US','Europe','AsiaPac'],
'Group':['GrA','GrB','GrZ','GrC','GrD','GrF']
})
How do I do it? There is a crosstab function in pandas but it doesn't do this. In Qlik I would simply do
Crosstable(Region,Group,2)
LOAD
ID,
Name,
US,
Europe,
AsiaPac
And I would go from df1 to df2. How can I do this in python (pandas or otherwise)?
This is essentially reshaping your data from a wide format to a long format, as it's known in R parlance. In pandas, you can do this with pd.melt:
pd.melt(df1, id_vars=['ID', 'Name'], var_name='Region', value_name='Group')
# ID Name Region Group
# 0 101 Axel AsiaPac GrZ
# 1 102 Bob AsiaPac GrF
# 2 101 Axel Europe GrB
# 3 102 Bob Europe GrD
# 4 101 Axel US GrA
# 5 102 Bob US GrC
If you need your columns sorted on ID or Name and Group, as in your example output, you can add .sort_values() to the expression:
pd.melt(df1, id_vars=['ID', 'Name'], var_name='Region', value_name='Group').sort_values(['ID', 'Group'])
# ID Name Region Group
# 4 101 Axel US GrA
# 2 101 Axel Europe GrB
# 0 101 Axel AsiaPac GrZ
# 5 102 Bob US GrC
# 3 102 Bob Europe GrD
# 1 102 Bob AsiaPac GrF
You can try
1st
stack()
df1.set_index(['ID','Name']).stack().reset_index().rename(columns={'level_2':'Region',0:'Group'})
Out[890]:
ID Name Region Group
0 101 Axel AsiaPac GrZ
1 101 Axel Europe GrB
2 101 Axel US GrA
3 102 Bob AsiaPac GrF
4 102 Bob Europe GrD
5 102 Bob US GrC
2nd
pd.wide_to_long , even it is overkill. :)
df1=df1.rename(columns={'AsiaPac':'Group_AsiaPac','Europe':'Group_Europe','US':'Group_US'})
pd.wide_to_long(df1,['Group'], i=['ID','Name'], j='Region',sep='_',suffix='.').reset_index()
Out[918]:
ID Name Region Group
0 101 Axel AsiaPac GrZ
1 101 Axel Europe GrB
2 101 Axel US GrA
3 102 Bob AsiaPac GrF
4 102 Bob Europe GrD
5 102 Bob US GrC
I am trying to read a tab delimited text file into a dataframe.
This is the how the file looks in Excel:
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
Import into a df:
df = p.read_table("E:/FileLoc/ThisIsAFile.txt", encoding = "iso-8859-1")
Now it doesn't see the first 3 columns as part of the column index (df[0] = Transaction Type) and all of the headers shift over to reflect this.
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
I am trying to manipulate the text file and then import it to a mysql database as an end result.
You can use read_csv with separator 2 and more whitespaces:
import pandas as pd
import io
temp=u"""CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1"""
#after testing replace io.StringIO(temp) to filename
df =pd.read_csv(io.StringIO(temp), sep=r'\s{2,}', engine='python', encoding = "iso-8859-1")
print (df)
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE \
0 5/13/2016 0:00 13867666 6892372 S
CUSTOMER_NUMBER CUSTOMER_NAME
0 2026 CUSTOMER 1
If separator is tabulator, use sep='\t'.
EDIT:
I test it with your data and it works:
import pandas as pd
df = pd.read_csv('test/AnonymizedData.txt', sep='\t')
print (df)
CUSTOMER_NUMBER CUSTOMER_NAME CUSTOMER_BRANCH_CODE CUSTOMER_BRANCH_NAME \
0 2026 CUSTOMER 1 83 SALES BRANCH 1
1 2359 CUSTOMER 2 76 SALES BRANCH 2
2 100662 CUSTOMER 3 28 SALES BRANCH 3
3 3245 CUSTOMER 4 84 SALES BRANCH 4
4 3179 CUSTOMER 5 28 SALES BRANCH 5
5 39881 CUSTOMER 6 67 SALES BRANCH 6
6 37020 CUSTOMER 7 58 SALES BRANCH 7
7 1239 CUSTOMER 8 50 SALES BRANCH 8
8 2379 CUSTOMER 9 76 SALES BRANCH 9
CUSTOMER_CITY CUSTOMER_STATE ... PRICING_PRODUCT_TYPE_CODE \
0 TOWN 1 CO ... 11
1 TOWN 2 OH ... 11
2 TOWN 3 ME ... 11
3 TOWN 4 IL ... 11
4 TOWN 5 NH ... 11
5 TOWN 6 TX ... 11
6 TOWN 7 NC ... 11
7 TOWN 8 NY ... 11
8 TOWN 9 OH ... 11
PRICING_PRODUCT_TYPE ORGANIZATION_ID ORGANIZATION_NAME PRODUCT_LINE_CODE \
0 DISPOSABLES 83 ORGANIZATIONNAME 891
1 DISPOSABLES 83 ORGANIZATIONNAME 891
2 DISPOSABLES 83 ORGANIZATIONNAME 891
3 DISPOSABLES 83 ORGANIZATIONNAME 891
4 DISPOSABLES 83 ORGANIZATIONNAME 891
5 DISPOSABLES 83 ORGANIZATIONNAME 891
6 DISPOSABLES 83 ORGANIZATIONNAME 891
7 DISPOSABLES 83 ORGANIZATIONNAME 891
8 DISPOSABLES 83 ORGANIZATIONNAME 891
PRODUCT_LINE ROBOTIC_FLAG Unnamed: 52 Unnamed: 53 Unnamed: 54
0 PRODUCTNAME N N NaN 3
1 PRODUCTNAME N N NaN 3
2 PRODUCTNAME N N NaN 2
3 PRODUCTNAME N N NaN 7
4 PRODUCTNAME N N NaN 1
5 PRODUCTNAME N N NaN 4
6 PRODUCTNAME N N NaN 3
7 PRODUCTNAME N N NaN 5
8 PRODUCTNAME N N NaN 3
[9 rows x 55 columns]