I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
I have a data as below.How to convert the below into a dataframe. I need the Country name(some country names has comma inbetween)as first column and other values as each columns.
Input is a txt file with many countries
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
Output should be a dataframe with country name as first column
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485
Congo, Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
You can first use read_csv (no problem if it is .txt file) with some separator which is not in values like | for Series, then extract and strip country names to one column and another values split by ,:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
s = pd.read_csv(StringIO(temp), sep="|", header=None, squeeze=True)
print (s)
0 Czech Republic,22,22,22,21,21,21,21,21,19,18,1...
1 Congo,Dem.Rep.,275,306,327,352,376,411,420,466...
2 Congo,Rep.,209,222,231,243,255,269,424,457,367...
Name: 0, dtype: object
df = s.str.extract('([A-Za-z ,.]+)([0-9,]+)', expand=True)
df[0] = df[0].str.strip(',')
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None).reset_index()
#reset column names by 0,1,2...
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
If need index with countries:
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 \
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
12 13 14 15 16 17
Czech Republic 13 12 11 11 10 9
Congo,Dem.Rep. 697 708 710 702 692 666
Congo,Rep. 402 509 477 482 511 485
Solution be regex from another answer - it is possible use it as sep parameter, only engine='python' is necessary because warning:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",(?=\d)", header=None, engine='python')
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
jezrael's answer is the way to go if you want the complete output asap.
If you want to really understand some simpler code, try doing the following:
Split the string into some lists like this:
data = "Czech Republic..."
lines = data.split('\n')
rows = []
then iterate over the lines, and append them to a list of lists:
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
for line in lines:
temp = line.split(',')
if is_number(temp[1]):
rows.append([''.join(temp[:2])].extend(temp[2:])) // ignoring the first ',' delimiter if the second column is a number
else:
rows.append(temp)
then use this list of lists and read the following pandas DataFrame documentation, on how to preety-print it. (Hint: make the list of lists a dict first)
The solution using re.split() function and labeled data structure with columns:
import pandas as pd, re
s = '''
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
'''
data = []
for l in s.split('\n'):
if l: data.append(re.split(r',(?=\d)', l))
# setting output options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)
df = pd.DataFrame(data, columns=['Country name'] + list(range(len(data[0][1:]))))
print(df)
The output:
Country name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485