Combine MultiIndex columns to a single index in a pandas dataframe - python

With my code I integrate 2 databases in 1. The problem is when I add one more column to my databases, the result is not as expected. Use Python 2.7
code:
import pandas as pd
import pandas.io.formats.excel
import numpy as np
# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")
df = (pd.concat([df1,df2])
.set_index(["Cliente",'Fecha'])
.stack()
.unstack(1)
.sort_index(ascending=(True, False)))
m = df.index.get_level_values(1) == 'Impresiones'
df.index = np.where(m, 'Impresiones', df.index.get_level_values(0))
# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None
with pd.ExcelWriter("Data.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
archivo1:
Fecha Cliente Impresiones Impresiones 2 Revenue
20/12/17 Jose 1312 35 $12
20/12/17 Martin 12 56 $146
20/12/17 Pedro 5443 124 $1,256
20/12/17 Esteban 667 1235 $1
archivo2:
Fecha Cliente Impresiones Impresiones 2 Revenue
21/12/17 Jose 25 5 $2
21/12/17 Martin 6347 523 $123
21/12/17 Pedro 2368 898 $22
21/12/17 Esteban 235 99 $7,890
Hope Results:
I tried with m1 = df.index.get_level_values(1) == 'Impresiones 2' df.index = np.where(m1, 'Impresiones 2', df.index.get_level_values(0)) but I have this error: IndexError: Too many levels: Index has only 1 level, not 2

The first bit of the solution is similar to jezrael's answer to your previous question, using concat + set_index + stack + unstack + sort_index.
df = pd.concat([df1, df2])\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
Now comes the challenging part, we have to incorporate the Names in the 0th level, into the 1st level, and then reset the index.
I use np.insert to insert names above the revenue entry in the index.
i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
Now, I create a new MultiIndex which I then use to reindex df -
idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
Now, drop the extra level -
df.index = df.index.droplevel()
df
Fecha 20/12/17 21/12/17
Esteban
Revenue $1 $7,890
Impresiones2 1235 99
Impresiones 667 235
Jose
Revenue $12 $2
Impresiones2 35 5
Impresiones 1312 25
Martin
Revenue $146 $123
Impresiones2 56 523
Impresiones 12 6347
Pedro
Revenue $1,256 $22
Impresiones2 124 898
Impresiones 5443 2368

Related

How to reorder multi-index columns in Pandas?

df=pd.DataFrame({'Country':["AU","GB","KR","US","GB","US","KR","AU","US"],'Region Manager':['TL','JS','HN','AL','JS','AL','HN','TL','AL'],'Curr_Sales': [453,562,236,636,893,542,125,561,371],'Curr_Revenue':[4530,7668,5975,3568,2349,6776,3046,1111,4852],'Prior_Sales': [235,789,132,220,569,521,131,777,898],'Prior_Revenue':[1530,2668,3975,5668,6349,7776,8046,2111,9852]})
pd.pivot_table(df, values=['Curr_Sales', 'Curr_Revenue','Prior_Sales','Prior_Revenue'],index=['Country', 'Region Manager'],aggfunc=np.sum,margins=True)
Hi folks,
I have the following dataframe and I'd like to re-order the muti-index columns as
['Prior_Sales','Prior_Revenue','Curr_Sales', 'Curr_Revenue']
How can I do that in pandas?
The code is shown above
Thanks in advance for all the help!
Slice the resulting dataframe
pd.pivot_table(
df,
values=['Curr_Sales', 'Curr_Revenue', 'Prior_Sales', 'Prior_Revenue'],
index=['Country', 'Region Manager'],
aggfunc='sum',
margins=True
)[['Prior_Sales', 'Prior_Revenue', 'Curr_Sales', 'Curr_Revenue']]
Prior_Sales Prior_Revenue Curr_Sales Curr_Revenue
Country Region Manager
AU TL 1012 3641 1014 5641
GB JS 1358 9017 1455 10017
KR HN 263 12021 361 9021
US AL 1639 23296 1549 15196
All 4272 47975 4379 39875
cols = ['Prior_Sales','Prior_Revenue','Curr_Sales', 'Curr_Revenue']
df = df[cols]

Dataset selective picking and transformation

I have a dataset in .xlsx with hundreds of thousands of rows as follow:
slug symbol name date ranknow open high low close volume market close_ratio spread
companyA AAA companyA 28/04/2013 1 135,3 135,98 132,1 134,21 0 1500520000 0,5438 3,88
companyA AAA companyA 29/04/2013 1 134,44 147,49 134 144,54 0 1491160000 0,7813 13,49
companyA AAA companyA 30/04/2013 1 144 146,93 134,05 139 0 1597780000 0,3843 12,88
....
companyA AAA companyA 17/04/2018 1 8071,66 8285,96 7881,72 7902,09 6900880000 1,3707E+11 0,0504 404,24
....
lancer LA Lancer 09/01/2018 731 0,347111 0,422736 0,345451 0,422736 3536710 0 1 0,08
lancer LA Lancer 10/01/2018 731 0,435794 0,512958 0,331123 0,487106 2586980 0 0,8578 0,18
lancer LA Lancer 11/01/2018 731 0,479738 0,499482 0,309485 0,331977 950410 0 0,1184 0,19
....
lancer LA Lancer 17/04/2018 731 0,027279 0,041106 0,02558 0,031017 9936 1927680 0,3502 0,02
....
yocomin YC Yocomin 21/01/2016 732 0,008135 0,010833 0,002853 0,002876 63 139008 0,0029 0,01
yocomin YC Yocomin 22/01/2016 732 0,002872 0,008174 0,001192 0,005737 69 49086 0,651 0,01
yocomin YC Yocomin 23/01/2016 732 0,005737 0,005918 0,001357 0,00136 67 98050 0,0007 0
....
yocomin YC Yocomin 17/04/2018 732 0,020425 0,021194 0,017635 0,01764 12862 2291610 0,0014 0
....
Let's say I have a .txt file with a list of symbol of that time series I want to extract. For example:
AAA
LA
YC
I would like to get a dataset that would look as follow:
date AAA LA YC
28/04/2013 134,21 NaN NaN
29/04/2013 144,54 NaN NaN
30/04/2013 139 NaN NaN
....
....
....
17/04/2018 7902,09 0,031017 0,01764
where under the stock name (like AAA, etc) i get the "close" price. I'm open to both Python and R. Any help would be grate!
In python using pandas, this should work.
import pandas as pd
df = pd.read_excel("/path/to/file/Book1.xlsx")
df = df.loc[:, ['symbol', 'name', 'date', 'close']]
df = df.set_index(['symbol', 'name', 'date'])
df = df.unstack(level=[0,1])
df = df['close']
to read the symbols file file and then filter out symbols not in the dataframe:
symbols = pd.read_csv('/path/to/file/symbols.txt', sep=" ", header=None)
symbols = symbols[0].tolist()
symbols = pd.Index(symbols).unique()
symbols = symbols.intersection(df.columns.get_level_values(0))
And the output will look like:
print(df[symbols])
symbol AAA LA YC
name companyA Lancer Yocomin
date
2018-09-01 00:00:00 None 0,422736 None
2018-10-01 00:00:00 None 0,487106 None
2018-11-01 00:00:00 None 0,331977 None

How to add unprocessed strings to a DataFrame in python?

I have a string object that is looking like this:
Numărul de camere
3 camere
Suprafaţa totală
77 m²
Suprafaţa bucătăriei
11 m²
Tipul clădirii
Dat în exploatare
Etaj
3
Locul de amplasare în casă
In mijlocul casei
Grup sanitar
separat
Balcon/lojă
2
Parcare
acoperită
Încălzire autonomă
✔
This is data parsed from a web site. I want to add the data to a DataFrame:
df = pd.DataFrame(columns=['ID','Numarul de camere','Suprafata totala',
'Suprafata bucatariei','Tipul cladirii','Etaj',
'Amplasarea in bloc', 'Grup sanitar', 'Balcon/loja',
'Parcare', 'Incalzire autonoma'])
Each second row of strings is a characteristic and I want to add it to his place in my DataFrame. How to do this?
text = """Numărul de camere
3 camere
Suprafaţa totală
77 m²
Suprafaţa bucătăriei
11 m²
Tipul clădirii
Dat în exploatare
Etaj
3
Locul de amplasare în casă
In mijlocul casei
Grup sanitar
separat
Balcon/lojă
2
Parcare
acoperită
Încălzire autonomă
✔ """
#split the string
s = text.split('\n')
import pandas as pd
d = {k:v for k, v in zip(s[0::2],s[1::2])}
df = pd.DataFrame([d])
print df.head()
# if you want to preserve the order of the columns
df = pd.DataFrame.from_items([('Values', s[1::2])], orient='index',columns=s[0::2])
print df.head()

Python Pandas compare CSV keyerror

I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.
CSV1
reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433
CSV2
reference
234 8A
RT4 VV8
CODE
import pandas as pd
df1 = pd.read_csv(r'd:\temp\data1.csv')
df2 = pd.read_csv(r'd:\temp\data2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.
most probably you have either leading or trailing white spaces in reference column after reading your CSV files.
you can check it in this way:
print(df1.columns.tolist())
print(df2.columns.tolist())
you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls
Example:
In [74]: df1
Out[74]:
reference name house
0 234 8A john 37
1 564 68R bill 3
2 RT4 VV8 kate 88
3 76AA harry 433
In [75]: df2
Out[75]:
reference
0 234 8A
1 RT4 VV8
In [76]: df2.columns.tolist()
Out[76]: ['reference ']
In [77]: df1.columns.tolist()
Out[77]: ['reference', 'name', 'house']
In [78]: df1.merge(df2, on='reference')
...
KeyError: 'reference'
fixing df2:
data = """\
reference
234 8A
RT4 VV8"""
df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')
now it works:
In [80]: df1.merge(df2, on='reference')
Out[80]:
reference name house
0 234 8A john 37
1 RT4 VV8 kate 88

Pandas: transposing one column in a multiple column df

This is the data that I have:
cmte_id trans entity st amount fec_id
date
2007-08-15 C00112250 24K ORG DC 2000 C00431569
2007-09-26 C00119040 24K CCM FL 1000 C00367680
2007-09-26 C00119040 24K CCM MD 1000 C00140715
2007-07-20 C00346296 24K CCM CA 1000 C00434571
2007-09-24 C00346296 24K CCM MA 1000 C00433136
There are other descriptive columns that I have left out for the sake of brevity.
I would like to transform it so that the values in [cmte_id] become column headers and the values in [amount] become the respective values in the new columns. I know that this is probably a simple pivot operation. I have tried the following:
dfy.pivot('cmte_id', 'amount')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-203-e5d2cb89e880> in <module>()
----> 1 dfy.pivot('cmte_id', 'amount')
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
3761 """
3762 from pandas.core.reshape import pivot
-> 3763 return pivot(self, index=index, columns=columns, values=values)
3764
3765 def stack(self, level=-1, dropna=True):
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in pivot(self, index, columns, values)
323 append = index is None
324 indexed = self.set_index(cols, append=append)
--> 325 return indexed.unstack(columns)
326 else:
327 if index is None:
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in unstack(self, level)
3857 """
3858 from pandas.core.reshape import unstack
-> 3859 return unstack(self, level)
3860
3861 #----------------------------------------------------------------------
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level)
402 if isinstance(obj, DataFrame):
403 if isinstance(obj.index, MultiIndex):
--> 404 return _unstack_frame(obj, level)
405 else:
406 return obj.T.stack(dropna=False)
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _unstack_frame(obj, level)
442 else:
443 unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 444 value_columns=obj.columns)
445 return unstacker.get_result()
446
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns)
96
97 self._make_sorted_values_labels()
---> 98 self._make_selectors()
99
100 def _make_sorted_values_labels(self):
/home/jayaramdas/anaconda3/lib/python3.5/site-packages/pandas/core/reshape.py in _make_selectors(self)
134
135 if mask.sum() < len(self.index):
--> 136 raise ValueError('Index contains duplicate entries, '
137 'cannot reshape')
138
ValueError: Index contains duplicate entries, cannot reshape
Desired end result (except with additional columns, e.g. 'trans', fec_id, 'st', etc.) would look something like this:
date C00112250 C00119040 C00119040 C00346296 C00346296
2007-ago-15 2000
2007-set-26 1000
2007-set-26 1000
2007-lug-20 1000
2007-set-24 1000
Does anyone have an idea of how I can get closer to the end product?
Try this:
pvt = pd.pivot_table(df, index=df.index, columns='cmte_id',
values='amount', aggfunc='sum', fill_value=0)
Preserving other columns:
In [213]: pvt = pd.pivot_table(df.reset_index(), index=['index','trans','entity','st', 'fec_id'],
.....: columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \
.....: .reset_index()
In [214]: pvt
Out[214]:
cmte_id index trans entity st fec_id C00112250 C00119040 \
0 2007-07-20 24K CCM CA C00434571 0 0
1 2007-08-15 24K ORG DC C00431569 2000 0
2 2007-09-24 24K CCM MA C00433136 0 0
3 2007-09-26 24K CCM FL C00367680 0 1000
4 2007-09-26 24K CCM MD C00140715 0 1000
cmte_id C00346296
0 1000
1 0
2 1000
3 0
4 0
In [215]: pvt.head()['st']
Out[215]:
0 CA
1 DC
2 MA
3 FL
4 MD
Name: st, dtype: object
UPDATE:
import pandas as pd
import glob
# if you don't use ['cand_id'] column - remove it from `usecols` parameter
dfy = pd.concat([pd.read_csv(f, sep='|', low_memory=False, header=None,
names=['cmte_id', '2', '3', '4','5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'],
usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount', 'fec_id', 'cand_id'],
dtype={'date': str})
for f in glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')
],
ignore_index=True)
dfy['date'] = pd.to_datetime(dfy['date'], format='%m%d%Y')
# remove not needed column ASAP in order to save memory
del dfy['cand_id']
dfy = dfy[(dfy['date'].notnull()) & (dfy['date'] > '2007-01-01') & (dfy['date'] < '2014-12-31') ]
#df = dfy.set_index(['date'])
pvt = pd.pivot_table(dfy, index=['date','trans_typ','entity_typ','state','fec_id'],
columns='cmte_id', values='amount', aggfunc='sum', fill_value=0) \
.reset_index()
print(pvt.info())
pvt.to_excel('out.xlsx', index=False)

Categories