Pivoting data with date as a row in Python - python

I have data that I've left in a format that will allow me to pivot on dates that look like:
Region 0 1 2 3
Date 2005-01-01 2005-02-01 2005-03-01 ....
East South Central 400 500 600
Pacific 100 200 150
.
.
Mountain 500 600 450
I need to pivot this table so it looks like:
0 Date Region value
1 2005-01-01 East South Central 400
2 2005-02-01 East South Central 500
3 2005-03-01 East South Central 600
.
.
4 2005-03-01 Pacific 100
4 2005-03-01 Pacific 200
4 2005-03-01 Pacific 150
.
.
Since both Date and Region are under one another I'm not sure how to melt or pivot around these strings so that I can get my desired output.
How can I go about this?

I think this is the solution you are looking for. Shown by example.
import pandas as pd
import numpy as np
N=100
regions = list('abcdef')
df = pd.DataFrame([[i for i in range(N)], ['2016-{}'.format(i) for i in range(N)],
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N)),
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N))])
df.index = ['Region', 'Date', 'a', 'b', 'c', 'd']
print(df)
This gives
0 1 2 3 4 5 6 7 \
Region 0 1 2 3 4 5 6 7
Date 2016-0 2016-1 2016-2 2016-3 2016-4 2016-5 2016-6 2016-7
a 96 432 181 64 87 355 339 314
b 360 23 162 98 450 78 114 109
c 143 375 420 493 321 277 208 317
d 371 144 207 108 163 67 465 130
And the solution to pivot this into the form you want is
df.transpose().melt(id_vars=['Date'], value_vars=['a', 'b', 'c', 'd'])
which gives
Date variable value
0 2016-0 a 96
1 2016-1 a 432
2 2016-2 a 181
3 2016-3 a 64
4 2016-4 a 87
5 2016-5 a 355
6 2016-6 a 339
7 2016-7 a 314
8 2016-8 a 111
9 2016-9 a 121
10 2016-10 a 124
11 2016-11 a 383
12 2016-12 a 424
13 2016-13 a 453
...
393 2016-93 d 176
394 2016-94 d 277
395 2016-95 d 256
396 2016-96 d 174
397 2016-97 d 349
398 2016-98 d 414
399 2016-99 d 132

Related

How could i count the rating for each item_id?

From the u.item file, which is divided into [100000 rows x 4columns],
I have to find out which are the best movies.
I try, for each unique item_id (which is 1682) to find the overall rating for each one separately
import pandas as pd
import csv
ratings = pd.read_csv("erg3/files/u.data", encoding="utf-8", delim_whitespace=True,
names = ["user_id", "item_id", "rating", "timestamp"]
)
The data has this form:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
My expected output :
item_id
1 1753
2 420
3 273
4 742
...
1570 1
1486 1
1626 1
1580 1
i used this best_m = ratings.groupby("item_id")["rating"].sum()
followed by best_m = best_m.sort_values(ascending=False)
And the output looks like :
50 2541
100 2111
181 2032
258 1936
174 1786
...
1581 1
1570 1
1486 1
1626 1
1580 1

pad rows on a pandas dataframe with zeros till N count

Iam loading data via pandas read_csv like so:
data = pd.read_csv(file_name_item, sep=" ", header=None, usecols=[0,1,2])
which looks like so:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
I would like to pad this data with zeros till a row count of 256, meaning:
0 1 2
0 157 303 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
11 0 0 0
.. .. .. ..
256 0 0 0
How do I go about doing this? The file could have anything from 1 row to 200 odd rows and I am looking for something generic which pads this dataframe with 0's till 256 rows.
I am quite new to pandas and could not find any function to do this.
reindex with fill_value
df_final = data.reindex(range(257), fill_value=0)
Out[1845]:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
.. ... ... ..
252 0 0 0
253 0 0 0
254 0 0 0
255 0 0 0
256 0 0 0
[257 rows x 3 columns]
We can do
new_df = df.reindex(range(257)).fillna(0, downcast='infer')

Is there a way to optimize pandas apply function during groupby?

I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00

Extract complete country name from a string and make it as a dataframe column

I have a data as below.How to convert the below into a dataframe. I need the Country name(some country names has comma inbetween)as first column and other values as each columns.
Input is a txt file with many countries
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
Output should be a dataframe with country name as first column
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485
Congo, Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
You can first use read_csv (no problem if it is .txt file) with some separator which is not in values like | for Series, then extract and strip country names to one column and another values split by ,:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
s = pd.read_csv(StringIO(temp), sep="|", header=None, squeeze=True)
print (s)
0 Czech Republic,22,22,22,21,21,21,21,21,19,18,1...
1 Congo,Dem.Rep.,275,306,327,352,376,411,420,466...
2 Congo,Rep.,209,222,231,243,255,269,424,457,367...
Name: 0, dtype: object
df = s.str.extract('([A-Za-z ,.]+)([0-9,]+)', expand=True)
df[0] = df[0].str.strip(',')
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None).reset_index()
#reset column names by 0,1,2...
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
If need index with countries:
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 \
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
12 13 14 15 16 17
Czech Republic 13 12 11 11 10 9
Congo,Dem.Rep. 697 708 710 702 692 666
Congo,Rep. 402 509 477 482 511 485
Solution be regex from another answer - it is possible use it as sep parameter, only engine='python' is necessary because warning:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",(?=\d)", header=None, engine='python')
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
jezrael's answer is the way to go if you want the complete output asap.
If you want to really understand some simpler code, try doing the following:
Split the string into some lists like this:
data = "Czech Republic..."
lines = data.split('\n')
rows = []
then iterate over the lines, and append them to a list of lists:
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
for line in lines:
temp = line.split(',')
if is_number(temp[1]):
rows.append([''.join(temp[:2])].extend(temp[2:])) // ignoring the first ',' delimiter if the second column is a number
else:
rows.append(temp)
then use this list of lists and read the following pandas DataFrame documentation, on how to preety-print it. (Hint: make the list of lists a dict first)
The solution using re.split() function and labeled data structure with columns:
import pandas as pd, re
s = '''
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
'''
data = []
for l in s.split('\n'):
if l: data.append(re.split(r',(?=\d)', l))
# setting output options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)
df = pd.DataFrame(data, columns=['Country name'] + list(range(len(data[0][1:]))))
print(df)
The output:
Country name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485

Is there any way to keep a PeriodIndex as a series of Periods with a reset_index()?

I've noticed that for a DataFrame with a PeriodIndex, the month reverts to its native Int64 type upon a reset_index(), losing its freq attribute in the process. Is there any way to keep it as a Series of Periods?
For example:
In [42]: monthly
Out[42]:
qunits expend
month store upc
1992-12 1 21 83 248.17
72 3 13.95
78 2 6.28
79 1 5.82
85 5 28.10
87 1 1.87
88 6 11.76
...
1994-12 151 857 12 81.48
858 23 116.15
880 7 44.73
881 13 25.05
883 21 67.25
884 44 190.56
885 13 83.57
887 1 4.55
becomes:
In [43]: monthly.reset_index()
Out[43]:
month store upc qunits expend
0 275 1 21 83 248.17
1 275 1 72 3 13.95
2 275 1 78 2 6.28
3 275 1 79 1 5.82
4 275 1 85 5 28.10
5 275 1 87 1 1.87
6 275 1 88 6 11.76
7 275 1 89 21 41.16
...
500099 299 151 857 12 81.48
500100 299 151 858 23 116.15
500101 299 151 880 7 44.73
500102 299 151 881 13 25.05
500103 299 151 883 21 67.25
500104 299 151 884 44 190.56
500105 299 151 885 13 83.57
500106 299 151 887 1 4.55
Update 6/13/2014
It worked beautifully but the end result I need is the PeriodIndex values to be passed onto a grouped DataFrame. I got it to work but it seems to me that it could be done more compactly. I.e., my code is:
periods_index = monthly.index.get_level_values('month')
monthly.reset_index(inplace=True)
monthly.month = periods_index
grouped=monthly.groupby('month')
moments=pd.DataFrame(monthly.month.unique(),columns=['month'])
for month,group in grouped:
moments.loc[moments.month==month,'meanNo0']=wmean(group[group.relative!=1].avExpend,np.log(group[group.relative!=1].relative))
Any further suggestions?
How about this:
periods_index = monthly.index.get_level_values('month')

Categories