pandas combining dataframe - python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array
javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)
This code comes out with the following:
Town Java Jobs
435 York,NY 3593
212 NewYork,NY 3585
584 Seattle,WA 2080
624 Chicago,IL 1920
301 Boston,MA 1571
...
79 Holland,MI 5
38 Manhattan,KS 5
497 Vernon,IL 5
30 Clayton,MO 5
90 Waukegan,IL 5
[653 rows x 2 columns]
Town Python Jobs
160 NewYork,NY 2949
11 York,NY 2938
349 Seattle,WA 1321
91 Chicago,IL 1312
167 Boston,MA 1117
383 Hanover,NH 5
209 Bulverde,TX 5
203 Salisbury,NC 5
67 Rockford,IL 5
256 Ventura,CA 5
[416 rows x 2 columns]
I want to make a new dataframe that uses the town names as an index and has a column for each java and python. However, some of the towns will only have results for one of the languages.

import pandas as pd
javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])
result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
# Java Jobs Python Jobs
# Town
# York,NY 3593.0 2938.0
# NewYork,NY 3585.0 2949.0
# Seattle,WA 2080.0 1321.0
# Chicago,IL 1920.0 1312.0
# Boston,MA 1571.0 1117.0
# Holland,MI 5.0 NaN
# Manhattan,KS 5.0 NaN
# Vernon,IL 5.0 NaN
# Clayton,MO 5.0 NaN
# Waukegan,IL 5.0 NaN
# Hanover,NH NaN 5.0
# Bulverde,TX NaN 5.0
# Salisbury,NC NaN 5.0
# Rockford,IL NaN 5.0
# Ventura,CA NaN 5.0
pd.merge will by default join two DataFrames on all columns shared in common. In this case, javaFrame and pythonFrame share only the Town column in common. So by default pd.merge would join the two DataFrames on the Town column.
how='outer causes pd.merge to use the union of the keys from both frames. In other words it causes pd.merge to return rows whose data come from either javaFrame or pythonFrame even if only one DataFrame contains the Town. Missing data is fill with NaNs.

Use pd.concat
df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)
Java Jobs Python Jobs
Boston,MA 1571.0 1117.0
Bulverde,TX NaN 5.0
Chicago,IL 1920.0 1312.0
Clayton,MO 5.0 NaN
Hanover,NH NaN 5.0
Holland,MI 5.0 NaN
Manhattan,KS 5.0 NaN
NewYork,NY 3585.0 2949.0
Rockford,IL NaN 5.0
Salisbury,NC NaN 5.0
Seattle,WA 2080.0 1321.0
Ventura,CA NaN 5.0
Vernon,IL 5.0 NaN
Waukegan,IL 5.0 NaN
York,NY 3593.0 2938.0

Related

delete redundant rows in a dataframe with set in columns

I have a dataframe df:
Cluster OsId BrowserId PageId VolumePred ConversionPred
0 11 11 {789615, 955761, 1149586, 955764, 955767, 1187... 147.0 71.0
1 0 11 12 {1184903, 955761, 1149586, 1158132, 955764, 10... 73.0 38.0
2 0 11 15 {1184903, 1109643, 955761, 955764, 1074581, 95... 72.0 40.0
3 0 11 16 {1123200, 1184903, 1109643, 1018637, 1005581, ... 7815.0 5077.0
4 0 11 17 {1184903, 789615, 1016529, 955761, 955764, 955... 52.0 47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1154705 220.0 182.0
308 {18} 99 16 1155314 12.0 6.0
309 {9} 99 16 1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1184903 966.0 539.0
This dataframe contains redundansts rows that I need to delete them , so I try this :
df.drop_duplicates()
But I got this error : TypeError: unhashable type: 'set'
Any idea to help me to fix this error? Thanks!
Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:
#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]
If no row was removed it means no row has duplicates (tested are all columns together)

How to convert data from DataFrame to form

I'm trying to make a report and then convert it to the prescribed form but I don't know how. Below is my code:
data = pd.read_csv('https://raw.githubusercontent.com/hoatranobita/reports/main/Loan_list_test.csv')
data_pivot = pd.pivot_table(data,('CLOC_CUR_XC_BL'),index=['BIZ_TYPE_SBV_CODE'],columns=['TERM_CODE','CURRENCY_CD'],aggfunc=np.sum).reset_index
print(data_pivot)
Pivot table shows as below:
<bound method DataFrame.reset_index of TERM_CODE Ng?n h?n Trung h?n
CURRENCY_CD 1. VND 2. USD 1. VND 2. USD
BIZ_TYPE_SBV_CODE
201 170000.00 NaN 43533.42 NaN
202 2485441.64 5188792.76 2682463.04 1497309.06
204 35999.99 NaN NaN NaN
301 1120940.65 NaN 190915.62 453608.72
401 347929.88 182908.01 239123.29 NaN
402 545532.99 NaN 506964.23 NaN
403 21735.74 NaN 1855.92 NaN
501 10346.45 NaN NaN NaN
601 881974.40 NaN 50000.00 NaN
602 377216.09 NaN 828868.61 NaN
702 9798.74 NaN 23616.39 NaN
802 155099.66 NaN 762294.95 NaN
803 23456.79 NaN 97266.84 NaN
804 151590.00 NaN 378000.00 NaN
805 182925.30 54206.52 4290216.37 NaN>
Here is the prescribed form:
form = pd.read_excel('https://github.com/hoatranobita/reports/blob/main/Form%20A00034.xlsx?raw=true')
form.head()
Mã ngành kinh tế Dư nợ tín dụng (không bao gồm mua, đầu tư trái phiếu doanh nghiệp) Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 NaN Ngắn hạn NaN Trung và dài hạn NaN Tổng cộng
1 NaN Bằng VND Bằng ngoại tệ Bằng VND Bằng ngoại tệ NaN
2 101.0 NaN NaN NaN NaN NaN
3 201.0 NaN NaN NaN NaN NaN
4 202.0 NaN NaN NaN NaN NaN
As you see, pivot table have no 101 but form has. So what I have to do to convert from Dataframe to Form that skip 101.
Thank you.
Hi First create a worksheet using xlsxwriter
import xlsxwriter
#start workbook
workbook = xlsxwriter.Workbook('merge1.xlsx')
#Introduce formatting
format = workbook.add_format({'border': 1,'bold': True})
#Adding a worksheet
worksheet = workbook.add_worksheet()
merge_format = workbook.add_format({
'bold':1,
'border': 1,
'align': 'center',
'valign': 'vcenter'})
#Starting the Headers
worksheet.merge_range('A1:A3', 'Mã ngành kinh tế', merge_format)
worksheet.merge_range('B1:F1', 'Dư nợ tín dụng (không bao gồm mua, đầu tư trái phiếu doanh nghiệp)', merge_format)
worksheet.merge_range('B2:C2', 'Ngắn hạn', merge_format)
worksheet.merge_range('D2:E2', 'Trung và dài hạn', merge_format)
worksheet.merge_range('F2:F3', 'Tổng cộng', merge_format)
worksheet.write(2, 1, 'Bằng VND',format)
worksheet.write(2, 2, 'Bằng ngoại tệ',format)
worksheet.write(2, 3, 'Bằng VND',format)
worksheet.write(2, 4, 'Bằng ngoại tệ',format)
After this formatting you can start writing to sheet looping through using worksheet.write() below I have included a sample
expenses = (
['Rent', 1000],
['Gas', 100],
['Food', 300],
['Gym', 50],
)
for item, cost in (expenses):
worksheet.write(row, col, item)
row += 1
In row and col you can specify the cell row and column number it goes as a numerical value like a matrix.
And finally close the workbook
workbook.close()

How to Drop Rows with NaN Values so i can zip and range

I have a code that range the values between two columns, the code works normally when there is no empty cell, I tried the df.isnull, dropna, always the same problem
import pandas as pd
import numpy as np
path = [('SC200', 100, 102),
('Unified', 210, 210),
('Clé',np.nan,np.nan),
('samsung', 155, 158),
]
df_l = pd.DataFrame(path, columns=['Désignation', 'First', 'Last'])
zipped_l = zip(df_l['Désignation'], df_l['First'], df_l['Last'])
df_l = pd.DataFrame([(k, y) for k, s, e in zipped_l for y in range(s, e+1) ], columns=['Désignation', 'KITCODE'])
print(df_l)
Is this what you are trying to do?
import pandas as pd
import numpy as np
path = [('SC200', 100, 102),
('Unified', 210, 210),
('Clé',np.nan,np.nan),
('samsung', 155, 158),
]
df_l = pd.DataFrame(path, columns=['Désignation', 'First', 'Last'])
print (df_l)
def kitcd(d):
first = int(d.First)
last = int(d.Last) + 1
return [i for i in range(first, last)]
df_l['KITCODE'] = df_l.apply(lambda x: kitcd(x) if pd.notnull(x.First) else x.First, axis = 1)
df_l = df_l.explode('KITCODE')
print (df_l)
The output of this will be:
Original dataframe:
Désignation First Last
0 SC200 100.0 102.0
1 Unified 210.0 210.0
2 Clé NaN NaN
3 samsung 155.0 158.0
Updated dataframe with KITCODE:
Désignation First Last KITCODE
0 SC200 100.0 102.0 100
0 SC200 100.0 102.0 101
0 SC200 100.0 102.0 102
1 Unified 210.0 210.0 210
2 Clé NaN NaN NaN
3 samsung 155.0 158.0 155
3 samsung 155.0 158.0 156
3 samsung 155.0 158.0 157
3 samsung 155.0 158.0 158
If you want to ignore the rows that have NaN, then you can change the code to the following:
def kitcd(d):
first = int(d.First)
last = int(d.Last) + 1
return [i for i in range(first, last)]
df_l = df_l.dropna(axis=0, subset=['First', 'Last'])
df_l['KITCODE'] = df_l.apply(lambda x: kitcd(x), axis = 1)
df_l = df_l.explode('KITCODE')
print (df_l)
This will remove the record from df_l and will help you process the data as normal. The output will have same set with one row missing 'Clé'

How to generate a bar chart with data from a csv?

I have a csv with several columns, one of them is the city column. There are several cities and also the same city, repeated several times.
I would like to set up a bar chart with how many cities appear in CSV.
Example:
Y X
5 Belo Horizonte
1 Vespasiano
4 São Paulo
I made the following code, but I have gotten error, which is right after the code.
Code:
import matplotlib.pyplot as plt; plt.rcdefaults()
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#lendo o arquivo
tb_usuarios = 'tb_usuarios.csv'
usuarios = pd.read_csv(tb_usuarios,
header=0,
index_col=False
)
print(usuarios.head())
usuarios["vc_municipio"] = usuarios["vc_municipio"].dropna()
usuarios["vc_municipio"] = usuarios["vc_municipio"].str.upper()
municipio = usuarios.groupby(['vc_municipio'])
print(municipio)
y_pos = usuarios.groupby(['vc_municipio'])['vc_municipio'].count()
print(y_pos)
plt.bar(y_pos, municipio, align='center', alpha=0.5)
plt.xticks(y_pos, municipio)
plt.ylabel('Qtd')
plt.title('Municipio')
plt.show()
Error:
Traceback (most recent call last):
File "C:/Users/Henrique Mendes/PycharmProjects/emprestimo/venv1/emprestimo.py", line 20, in <module>
plt.bar(y_pos, municipio, align='center', alpha=0.5)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\pyplot.py", line 2440, in bar
**({"data": data} if data is not None else {}), **kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\__init__.py", line 1601, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axes\_axes.py", line 2348, in bar
self._process_unit_info(xdata=x, ydata=height, kwargs=kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axes\_base.py", line 2126, in _process_unit_info
kwargs = _process_single_axis(ydata, self.yaxis, 'yunits', kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axes\_base.py", line 2108, in _process_single_axis
axis.update_units(data)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axis.py", line 1493, in update_units
default = self.converter.default_units(data, self)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\category.py", line 115, in default_units
axis.set_units(UnitData(data))
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\category.py", line 181, in __init__
self.update(data)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\category.py", line 215, in update
for val in OrderedDict.fromkeys(data):
TypeError: unhashable type: 'numpy.ndarray'
My outputs:
"C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\Scripts\python.exe" "C:/Users/Henrique Mendes/PycharmProjects/emprestimo/venv1/emprestimo.py"
pr_usuario bl_administrador dt_nascimento ... dt_cheque es_anexo dt_anexo
0 2 0 24/02/1980 ... NaN NaN NaN
1 3 0 05/09/1985 ... NaN NaN NaN
2 4 1 20/03/1984 ... NaN NaN NaN
3 5 1 20/01/1982 ... NaN NaN NaN
4 6 0 25/05/1985 ... NaN NaN NaN
[5 rows x 30 columns]
{'BELO HORIZONTE': Int64Index([0, 1, 2, 3, 6, 9, 10, 14, 17, 20, 22, 25], dtype='int64'), 'BRASILIA': Int64Index([4], dtype='int64'), 'CONTAGEM': Int64Index([23], dtype='int64'), 'CURITIBA': Int64Index([5, 7, 15, 18, 19], dtype='int64'), 'SANTA LUZIA': Int64Index([21], dtype='int64'), 'VESPASIANO': Int64Index([24], dtype='int64')}
vc_municipio
BELO HORIZONTE 12
BRASILIA 1
CONTAGEM 1
CURITIBA 5
SANTA LUZIA 1
VESPASIANO 1
Name: vc_municipio, dtype: int64
How can I do this chart?
Use pandas:
Your data:
assuming your data is in a .csv with the following form
0.0,BELO HORIZONTE
1.0,BELO HORIZONTE
2.0,BELO HORIZONTE
3.0,BELO HORIZONTE
6.0,BELO HORIZONTE
9.0,BELO HORIZONTE
10.0,BELO HORIZONTE
14.0,BELO HORIZONTE
17.0,BELO HORIZONTE
20.0,BELO HORIZONTE
22.0,BELO HORIZONTE
25.0,BELO HORIZONTE
4.0,BRASILIA
23.0,CONTAGEM
5.0,CURITIBA
7.0,CURITIBA
15.0,CURITIBA
18.0,CURITIBA
19.0,CURITIBA
21.0,SANTA LUZIA
24.0,VESPASIANO
Create the dataframe:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('test.csv', header=None)
df.columns = ['value', 'city']
value city
0 0.0 BELO HORIZONTE
1 1.0 BELO HORIZONTE
2 2.0 BELO HORIZONTE
3 3.0 BELO HORIZONTE
4 6.0 BELO HORIZONTE
5 9.0 BELO HORIZONTE
6 10.0 BELO HORIZONTE
7 14.0 BELO HORIZONTE
8 17.0 BELO HORIZONTE
9 20.0 BELO HORIZONTE
10 22.0 BELO HORIZONTE
11 25.0 BELO HORIZONTE
12 4.0 BRASILIA
13 23.0 CONTAGEM
14 5.0 CURITIBA
15 7.0 CURITIBA
16 15.0 CURITIBA
17 18.0 CURITIBA
18 19.0 CURITIBA
19 21.0 SANTA LUZIA
20 24.0 VESPASIANO
Groupby and plot the data:
groupby
count
plot.bar
# groupby & count
city_count = df.groupby('city').count()
value
city
BELO HORIZONTE 12
BRASILIA 1
CONTAGEM 1
CURITIBA 5
SANTA LUZIA 1
VESPASIANO 1
# plot
city_count.plot.bar()
plt.ylabel('Qtd')
plt.title('Municipio')
plt.show()
Plot with seaborn:
import seaborn as sns
sns.barplot(x=city_count.index, y='value', data=city_count)
plt.xticks(rotation=45)
plt.show()
municipio = usuarios.groupby(['vc_municipio']) returns a groupby object in pandas which is causing your error as matplotlib doesn't handle that.
plt.bar takes x values followed by y values (see docs).
matplotlib.pyplot.bar(x, height, width=0.8, bottom=None, *, align='center', data=None, **kwargs)
Luckily for you, when you do a groupby in pandas it automatically consolidates x values (or categories) as indices for you.
Assuming that municipio is meant to be a list of categories (you want the count by city?) then the following should work.
Replacing your code
plt.bar(y_pos, municipio, align='center', alpha=0.5)
with
plt.bar(y_pos.index, y_pos, align='center', alpha=0.5)
Alternatively, you can use the pandas version of plt.bar (which extends matplot lib) to natively handle some of the dataframe quirks.

pandas rolling_apply TypeError: int object is not iterable"

I have a function saved and defined in a different script called TechAnalisys.py This function just outputs a scalar, so I plan to use pd.rolling_apply() to generate a new column into the original dataframe (df).
The function works fine when executed, but I have problems when using the rolling_apply() application.This link Passing arguments to rolling_apply shows how you should do it, and that is how I think it my code is but it still shows the error "TypeError: int object is not iterable" appears
This is the function (located in the script TechAnalisys.py)
def hurst(df,days):
import pandas as pd
import numpy as np
df2 = pd.DataFrame()
df2 = df[-days:]
rango = lambda x: x.max() - x.min()
df2['ret'] = 1 - df.PX_LAST/df.PX_LAST.shift(1)
df2 = df2.dropna()
ave = pd.expanding_mean(df2.ret)
df2['desvdeprom'] = df2.ret - ave
df2['acum'] = df2['desvdeprom'].cumsum()
df2['rangorolled'] = pd.expanding_apply(df2.acum, rango)
df2['datastd'] = pd.expanding_std(df2.ret)
df2['rango_rangostd'] = np.log(df2.rangorolled/df2.datastd)
df2['tiempo1'] = np.log(range(1,len(df2.index)+1))
df2 = df2.dropna()
model1 = pd.ols(y=df2['rango_rangostd'], x=df2['tiempo1'], intercept=False)
return model1.beta
and now this is the main script:
import pandas as pd
import numpy as np
import TechAnalysis as ta
df = pd.DataFrame(np.log(np.cumsum(np.random.randn(100000)+1)+1000),columns =['PX_LAST'])
The following works:
print ta.hurst(df,50)
This doesn't work:
df['hurst_roll'] = pd.rolling_apply(df, 15 , ta.hurst, args=(50))
Whats wrong in the code?
If you check the type of df within the hurst function, you'll see that rolling_apply passes it as numpy.array.
If you create a DataFrame from this numpy.array inside rolling_apply, it works. I also used a longer window because there were only 15 values per array but you seemed to be planning on using the last 50 days.
def hurst(df, days):
df = pd.DataFrame(df, columns=['PX_LAST'])
df2 = pd.DataFrame()
df2 = df.loc[-days:, :]
rango = lambda x: x.max() - x.min()
df2['ret'] = 1 - df.loc[:, 'PX_LAST']/df.loc[:, 'PX_LAST'].shift(1)
df2 = df2.dropna()
ave = pd.expanding_mean(df2.ret)
df2['desvdeprom'] = df2.ret - ave
df2['acum'] = df2['desvdeprom'].cumsum()
df2['rangorolled'] = pd.expanding_apply(df2.acum, rango)
df2['datastd'] = pd.expanding_std(df2.ret)
df2['rango_rangostd'] = np.log(df2.rangorolled/df2.datastd)
df2['tiempo1'] = np.log(range(1, len(df2.index)+1))
df2 = df2.dropna()
model1 = pd.ols(y=df2['rango_rangostd'], x=df2['tiempo1'], intercept=False)
return model1.beta
def rol_apply():
df = pd.DataFrame(np.log(np.cumsum(np.random.randn(1000)+1)+1000), columns=['PX_LAST'])
df['hurst_roll'] = pd.rolling_apply(df, 100, hurst, args=(50, ))
PX_LAST hurst_roll
0 6.907911 NaN
1 6.907808 NaN
2 6.907520 NaN
3 6.908048 NaN
4 6.907622 NaN
5 6.909895 NaN
6 6.911281 NaN
7 6.911998 NaN
8 6.912245 NaN
9 6.912457 NaN
10 6.913794 NaN
11 6.914294 NaN
12 6.915157 NaN
13 6.916172 NaN
14 6.916838 NaN
15 6.917235 NaN
16 6.918061 NaN
17 6.918717 NaN
18 6.920109 NaN
19 6.919867 NaN
20 6.921309 NaN
21 6.922786 NaN
22 6.924173 NaN
23 6.925523 NaN
24 6.926517 NaN
25 6.928552 NaN
26 6.930198 NaN
27 6.931738 NaN
28 6.931959 NaN
29 6.932111 NaN
.. ... ...
970 7.562284 0.653381
971 7.563388 0.630455
972 7.563499 0.577746
973 7.563686 0.552758
974 7.564105 0.540144
975 7.564428 0.541411
976 7.564351 0.532154
977 7.564408 0.530999
978 7.564681 0.532376
979 7.565192 0.536758
980 7.565359 0.538629
981 7.566112 0.555789
982 7.566678 0.553163
983 7.566364 0.577953
984 7.567587 0.634843
985 7.568583 0.679807
986 7.569268 0.662653
987 7.570018 0.630447
988 7.570375 0.659497
989 7.570704 0.622190
990 7.571009 0.485458
991 7.571886 0.551147
992 7.573148 0.459912
993 7.574134 0.463146
994 7.574478 0.463158
995 7.574671 0.535014
996 7.575177 0.467705
997 7.575374 0.531098
998 7.575620 0.540611
999 7.576727 0.465572
[1000 rows x 2 columns]

Categories