Pandas: transposing one column in a multiple column df - python

This is the data that I have:
cmte_id trans entity st amount fec_id
date
2007-08-15 C00112250 24K ORG DC 2000 C00431569
2007-09-26 C00119040 24K CCM FL 1000 C00367680
2007-09-26 C00119040 24K CCM MD 1000 C00140715
2007-07-20 C00346296 24K CCM CA 1000 C00434571
2007-09-24 C00346296 24K CCM MA 1000 C00433136
There are other descriptive columns that I have left out for the sake of brevity.
I would like to transform it so that the values in [cmte_id] become column headers and the values in [amount] become the respective values in the new columns. I know that this is probably a simple pivot operation. I have tried the following:
dfy.pivot('cmte_id', 'amount')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-203-e5d2cb89e880> in <module>()
----> 1 dfy.pivot('cmte_id', 'amount')
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
3761 """
3762 from pandas.core.reshape import pivot
-> 3763 return pivot(self, index=index, columns=columns, values=values)
3764
3765 def stack(self, level=-1, dropna=True):
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in pivot(self, index, columns, values)
323 append = index is None
324 indexed = self.set_index(cols, append=append)
--> 325 return indexed.unstack(columns)
326 else:
327 if index is None:
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in unstack(self, level)
3857 """
3858 from pandas.core.reshape import unstack
-> 3859 return unstack(self, level)
3860
3861 #----------------------------------------------------------------------
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level)
402 if isinstance(obj, DataFrame):
403 if isinstance(obj.index, MultiIndex):
--> 404 return _unstack_frame(obj, level)
405 else:
406 return obj.T.stack(dropna=False)
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _unstack_frame(obj, level)
442 else:
443 unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 444 value_columns=obj.columns)
445 return unstacker.get_result()
446
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns)
96
97 self._make_sorted_values_labels()
---> 98 self._make_selectors()
99
100 def _make_sorted_values_labels(self):
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _make_selectors(self)
134
135 if mask.sum() < len(self.index):
--> 136 raise ValueError('Index contains duplicate entries, '
137 'cannot reshape')
138
ValueError: Index contains duplicate entries, cannot reshape
Desired end result (except with additional columns, e.g. 'trans', fec_id, 'st', etc.) would look something like this:
date C00112250 C00119040 C00119040 C00346296 C00346296
2007-ago-15 2000
2007-set-26 1000
2007-set-26 1000
2007-lug-20 1000
2007-set-24 1000
Does anyone have an idea of how I can get closer to the end product?

Try this:
pvt = pd.pivot_table(df, index=df.index, columns='cmte_id',
values='amount', aggfunc='sum', fill_value=0)
Preserving other columns:
In [213]: pvt = pd.pivot_table(df.reset_index(), index=['index','trans','entity','st', 'fec_id'],
.....: columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \
.....: .reset_index()
In [214]: pvt
Out[214]:
cmte_id index trans entity st fec_id C00112250 C00119040 \
0 2007-07-20 24K CCM CA C00434571 0 0
1 2007-08-15 24K ORG DC C00431569 2000 0
2 2007-09-24 24K CCM MA C00433136 0 0
3 2007-09-26 24K CCM FL C00367680 0 1000
4 2007-09-26 24K CCM MD C00140715 0 1000
cmte_id C00346296
0 1000
1 0
2 1000
3 0
4 0
In [215]: pvt.head()['st']
Out[215]:
0 CA
1 DC
2 MA
3 FL
4 MD
Name: st, dtype: object
UPDATE:
import pandas as pd
import glob
# if you don't use ['cand_id'] column - remove it from `usecols` parameter
dfy = pd.concat([pd.read_csv(f, sep='|', low_memory=False, header=None,
names=['cmte_id', '2', '3', '4','5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'],
usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount', 'fec_id', 'cand_id'],
dtype={'date': str})
for f in glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')
],
ignore_index=True)
dfy['date'] = pd.to_datetime(dfy['date'], format='%m%d%Y')
# remove not needed column ASAP in order to save memory
del dfy['cand_id']
dfy = dfy[(dfy['date'].notnull()) & (dfy['date'] > '2007-01-01') & (dfy['date'] < '2014-12-31') ]
#df = dfy.set_index(['date'])
pvt = pd.pivot_table(dfy, index=['date','trans_typ','entity_typ','state','fec_id'],
columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \
.reset_index()
print(pvt.info())
pvt.to_excel('out.xlsx', index=False)

Related

problem merging list and dataframe in Python

I have CSV files that I want to merge with list of struct(class) I made.
In the CSV I have field 'sector' and another field with information about this sector.
The array type is of a class I made with fields: name, x, y where x,y is the location that belong to this name.
This is how I defined the list(I generated it from CSV file as well which each antenna appear many time with different parameters so I extracted only those I need)
# ant_file is the CSV with all the antennas, ant_list_name is the list with
# only antennas name and ant_list_tot is the list with the name and also x,y
# fields
for rowA in range(size_ant_file):
rec = ant_file.iloc[rowA]['name']
if rec not in ant_lis_name:
ant_lis_name.append(rec)
A = Antenna(ant_file.iloc[rowA]['name'], ant_file.iloc[rowA]['x'],
ant_file.iloc[rowA]['y'])
ant_list_tot.append(A)
print(antenna_list)
[Antenna(name='UWE33', x=34.9, y=31.9), Antenna(name='UTN00', x=34.8,
y=32.1), Antenna(name='UWE02', x=34.8, y=32.1)]
I tried to do it with double for loop:
#dataclass
class Antenna:
name: str
x: float
y: float
# records is the csv file and antenna_list is the list of type Antenna
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
The result CSV file is partly right and at the end there are rows with all fields correctly except x and y fields which are 0 and some rows with x and y values but without the information of the original fields.
It seems like there is a big shift of rows but I can't understand why.
I checked that there are no missing values
example:
records.csv at the begining:(date,hour and user_id are random number and its not important)
sector date hour user_id x y
abc 1.1.19 20:00 123 0 0
dfs 5.8.17 12:40 876 0 0
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
antenna_list in the form of (name,x,y): (also here, x and y is random number right now and its not important)
antenna_list[0] = (abc,12,16)
antenna_list[1] = (dfs,6,20)
antenna_list[2] = (ngh,13,98)
antenna_list[3] = (yjt,18,41)
the result I want to see is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 13 98
yjt 10.10.16 17:18 492 18 41
abc 6.8.16 22:10 985 12 16
dfs 7.1.15 19:15 542 6 20
but the real result is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
13 98
18 41
12 16
6 20
TIA
If you save antenna_list as two dicts,
antenna_dict_x = {'abc':12, 'dfs':6, 'ngh':13, 'yjt':18}
antenna_dict_y = {'abc':16, 'dfs':20, 'ngh':98, 'yjt':41}
then creating two columns should be an easy map,
data['x']=data['sector'].map(antenna_dict_x)
data['y']=data['sector'].map(antenna_dict_y)
So if you do:
import pandas as pd
class Antenna():
def __init__(self, name, x, y):
self.name = name
self.x = x
self.y = y
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is what you were expecting. Also, if you do:
import pandas as pd
from dataclasses import dataclass
#dataclass
class Antenna:
name: str
x: float
y: float
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is, again, what you were expecting. You did not post how you created the antenna list, but I assume that is where your error is.

DataFrame.values on selected column

I have the following error when i try to get not all values but only specified column. I think the error comes from the column I specify after .values
Any help would be appreciated.
supp_bal dataframe:
circulating_supply total_supply
currency
0xBTC 4758600 20999984
1337 26456031141 29258384256
1SG 2187147 22227000
1ST 85558370 93468691
1WO 20981450 37219452
1X2 0 3051868
2GIVE 521605983 521605983
42 41 41
611 478519 478519
777 0 10000000000
A 26842657 278273649
AAA 15090818 397000000
pos_bal dataframe:
2019-07-23 2019-07-24
app_vendor_id currency
3 1WO 2604 2304
ABX 44 44
ADH 822 82
ALX 25 200
AMLT 3673 367
BCH -41 -26
my code:
f = pos_bal.index.get_level_values('currency')
supp_bal['circulating_supply'].loc[f].values['circulating_supply']
error:
pos_bal['circulating_supply'] = supp_bal['circulating_supply'].loc[f].values['circulating_supply']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
You don't need to use the column name after values,
this should work:
import pandas as pd
import numpy as np
supp_bal = pd.read_csv('D:\\supp_bal.csv', header=0)
pos_bal = pd.read_csv('D:\\pos_bal.csv', header=0)
supp_bal = supp_bal.set_index('currency')
pos_bal = pos_bal.set_index(['app_vendor_id', 'currency'])
display(supp_bal)
display(pos_bal)
f = pos_bal.index.get_level_values('currency')
pos_bal['circulating_supply']= supp_bal['circulating_supply'].loc[f].values
display(pos_bal)
The output
circulating_supply total_supply
currency
0xBTC 4758600 20999984
1337 26456031141 29258384256
1SG 2187147 22227000
1ST 85558370 93468691
1WO 20981450 37219452
1X2 0 3051868
2GIVE 521605983 521605983
42 41 41
611 478519 478519
777 0 10000000000
A 26842657 278273649
AAA 15090818 397000000
7/23/2019 7/24/2019
app_vendor_id currency
3 1WO 2604 2304
ABX 44 44
ADH 822 82
ALX 25 200
AMLT 3673 367
Final pos_bal
7/23/2019 7/24/2019 circulating_supply
app_vendor_id currency
3 1WO 2604 2304 20981450.0
ABX 44 44 NaN
ADH 822 82 NaN
ALX 25 200 NaN
AMLT 3673 367 NaN
Note, in the data you provided, only 1WO appears in both DataFrames, that's why the other rows are all NaN.
btw, I have pandas 0.24.2.
Do you mean by:
f = pos_bal.index.get_level_values('currency')
supp_bal['circulating_supply'].loc[f]

Dataset selective picking and transformation

I have a dataset in .xlsx with hundreds of thousands of rows as follow:
slug symbol name date ranknow open high low close volume market close_ratio spread
companyA AAA companyA 28/04/2013 1 135,3 135,98 132,1 134,21 0 1500520000 0,5438 3,88
companyA AAA companyA 29/04/2013 1 134,44 147,49 134 144,54 0 1491160000 0,7813 13,49
companyA AAA companyA 30/04/2013 1 144 146,93 134,05 139 0 1597780000 0,3843 12,88
....
companyA AAA companyA 17/04/2018 1 8071,66 8285,96 7881,72 7902,09 6900880000 1,3707E+11 0,0504 404,24
....
lancer LA Lancer 09/01/2018 731 0,347111 0,422736 0,345451 0,422736 3536710 0 1 0,08
lancer LA Lancer 10/01/2018 731 0,435794 0,512958 0,331123 0,487106 2586980 0 0,8578 0,18
lancer LA Lancer 11/01/2018 731 0,479738 0,499482 0,309485 0,331977 950410 0 0,1184 0,19
....
lancer LA Lancer 17/04/2018 731 0,027279 0,041106 0,02558 0,031017 9936 1927680 0,3502 0,02
....
yocomin YC Yocomin 21/01/2016 732 0,008135 0,010833 0,002853 0,002876 63 139008 0,0029 0,01
yocomin YC Yocomin 22/01/2016 732 0,002872 0,008174 0,001192 0,005737 69 49086 0,651 0,01
yocomin YC Yocomin 23/01/2016 732 0,005737 0,005918 0,001357 0,00136 67 98050 0,0007 0
....
yocomin YC Yocomin 17/04/2018 732 0,020425 0,021194 0,017635 0,01764 12862 2291610 0,0014 0
....
Let's say I have a .txt file with a list of symbol of that time series I want to extract. For example:
AAA
LA
YC
I would like to get a dataset that would look as follow:
date AAA LA YC
28/04/2013 134,21 NaN NaN
29/04/2013 144,54 NaN NaN
30/04/2013 139 NaN NaN
....
....
....
17/04/2018 7902,09 0,031017 0,01764
where under the stock name (like AAA, etc) i get the "close" price. I'm open to both Python and R. Any help would be grate!
In python using pandas, this should work.
import pandas as pd
df = pd.read_excel("/path/to/file/Book1.xlsx")
df = df.loc[:, ['symbol', 'name', 'date', 'close']]
df = df.set_index(['symbol', 'name', 'date'])
df = df.unstack(level=[0,1])
df = df['close']
to read the symbols file file and then filter out symbols not in the dataframe:
symbols = pd.read_csv('/path/to/file/symbols.txt', sep=" ", header=None)
symbols = symbols[0].tolist()
symbols = pd.Index(symbols).unique()
symbols = symbols.intersection(df.columns.get_level_values(0))
And the output will look like:
print(df[symbols])
symbol AAA LA YC
name companyA Lancer Yocomin
date
2018-09-01 00:00:00 None 0,422736 None
2018-10-01 00:00:00 None 0,487106 None
2018-11-01 00:00:00 None 0,331977 None

How to slice pandas DataFrame based on values from another Dataframe without using for-loop?

I have a DataFrame df1:
df1.head() =
id type position
dates
2000-01-03 17378 600 400
2000-01-03 4203 600 150
2000-01-03 18321 600 5000
2000-01-03 6158 600 1000
2000-01-03 886 600 10000
2000-01-03 17127 600 800
2000-01-03 18317 1300 110
2000-01-03 5536 600 207
2000-01-03 5132 600 20000
2000-01-03 18191 600 2000
And a second DataFrame df2:
df2.head() =
dt_f dt_l
id_y id_x
670 715 2000-02-14 2003-09-30
704 2963 2000-02-11 2004-01-13
886 18350 2000-02-09 2001-09-24
1451 18159 2005-11-14 2007-03-06
2175 8648 2007-02-28 2007-09-19
2236 18321 2001-04-05 2002-07-02
2283 2352 2007-03-07 2007-09-19
6694 2007-03-07 2007-09-17
13865 2007-04-19 2007-09-19
14348 2007-08-10 2007-09-19
15415 2007-03-07 2007-09-19
2300 2963 2001-05-30 2007-09-26
I need to slice df1for each value of id_x, and count the number of rows within the interval dt_f:dt_l. This has to be done again for the values of id_y. Finally the result should be merged on df2, giving as output the following DataFrame:
df_result.head() =
dt_f dt_l n_x n_y
id_y id_x
670 715 2000-02-14 2003-09-30 8 10
704 2963 2000-02-11 2004-01-13 13 25
886 18350 2000-02-09 2001-09-24 32 75
1451 18159 2005-11-14 2007-03-06 48 6
where n_x(n_y) corresponds to the number of rows contained in the interval dt_f:dt_l for each value of id_x(id_y).
Here is the for-loop I have used:
idx_list = df2.index.tolist()
k = 1
for j in idx_list:
n_y = df1[df1.id == j[0]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count()
n_x = df1[df1.id == j[1]][df2['dt_f'].iloc[k]:df2['dt_l'].iloc[k]]['id'].count()
Would it be possible to do it without using a for-loop? DataFrame df1contains around 30000 rows and I am afraid a loop will slow down the process too much, since this is a small part of the whole script.
you want something like this:
#Merge the tables together - making sure we keep the index column
mg = df1.reset_index().merge(df2, left_on = 'id', right_on = 'id_x')
#Select only the rows that are within the start and end
mg = mg[(mg['index'] > mg['dt_f']) & (mg['index'] < mg['dt_l'])]
#Finally count by id_x
mg.groupby('id_x').count()
You'll need to tidy up the columns afterwards and repeat for id_y.

How to filter pivot tables on python

How do I filter pivot tables to return specific columns. Currently my dataframe is this:
print table
sum
Sex Female Male All
Date (Intervals)
April 166 191 357
August 212 263 475
December 173 263 436
February 192 298 490
January 148 195 343
July 189 260 449
June 165 238 403
March 165 278 443
May 236 253 489
November 167 247 414
October 185 287 472
September 175 306 481
All 2173 3079 5252
I want to display results of only the male column. I tried the following code:
table.query('Sex == "Male"')
However I got this error
TypeError: Expected tuple, got str
How would I be able to filter my table with specified rows or columns.
It looks like table has a column MultiIndex:
sum
Sex Female Male All
One way to check if your table has a column MultiIndex is to inspect table.columns:
In [178]: table.columns
Out[178]:
MultiIndex(levels=[['sum'], ['All', 'Female', 'Male']],
labels=[[0, 0, 0], [1, 2, 0]],
names=[None, 'sex'])
To access a column of table you need to specify a value for each level of the MultiIndex:
In [179]: list(table.columns)
Out[179]: [('sum', 'Female'), ('sum', 'Male'), ('sum', 'All')]
Thus, to select the Male column, you would use
In [176]: table[('sum', 'Male')]
Out[176]:
date
April 42.0
August 34.0
December 32.0
...
Since the sum level is unnecessary, you could get rid of it by specifying the values parameter when calling df.pivot or df.pivot_table.
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
# sex Female Male All
# date
# April 40.0 40.0 80.0
# August 48.0 32.0 80.0
# December 48.0 44.0 92.0
For example,
import numpy as np
import pandas as pd
import calendar
np.random.seed(2016)
N = 1000
sex = np.random.choice(['Male', 'Female'], size=N)
date = np.random.choice(calendar.month_name[1:13], size=N)
df = pd.DataFrame({'sex':sex, 'date':date, 'sum':1})
# This reproduces a table similar to yours
table = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True)
print(table[('sum', 'Male')])
# table2 has a single level Index
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
print(table2['Male'])
Another way to remove the sum level would be to use table = table['sum'],
or table.columns = table.columns.droplevel(0).

Categories