How to generate a bar chart with data from a csv? - python

I have a csv with several columns, one of them is the city column. There are several cities and also the same city, repeated several times.
I would like to set up a bar chart with how many cities appear in CSV.
Example:
Y X
5 Belo Horizonte
1 Vespasiano
4 São Paulo
I made the following code, but I have gotten error, which is right after the code.
Code:
import matplotlib.pyplot as plt; plt.rcdefaults()
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#lendo o arquivo
tb_usuarios = 'tb_usuarios.csv'
usuarios = pd.read_csv(tb_usuarios,
header=0,
index_col=False
)
print(usuarios.head())
usuarios["vc_municipio"] = usuarios["vc_municipio"].dropna()
usuarios["vc_municipio"] = usuarios["vc_municipio"].str.upper()
municipio = usuarios.groupby(['vc_municipio'])
print(municipio)
y_pos = usuarios.groupby(['vc_municipio'])['vc_municipio'].count()
print(y_pos)
plt.bar(y_pos, municipio, align='center', alpha=0.5)
plt.xticks(y_pos, municipio)
plt.ylabel('Qtd')
plt.title('Municipio')
plt.show()
Error:
Traceback (most recent call last):
File "C:/Users/Henrique Mendes/PycharmProjects/emprestimo/venv1/emprestimo.py", line 20, in <module>
plt.bar(y_pos, municipio, align='center', alpha=0.5)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\pyplot.py", line 2440, in bar
**({"data": data} if data is not None else {}), **kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\__init__.py", line 1601, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axes\_axes.py", line 2348, in bar
self._process_unit_info(xdata=x, ydata=height, kwargs=kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axes\_base.py", line 2126, in _process_unit_info
kwargs = _process_single_axis(ydata, self.yaxis, 'yunits', kwargs)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axes\_base.py", line 2108, in _process_single_axis
axis.update_units(data)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\axis.py", line 1493, in update_units
default = self.converter.default_units(data, self)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\category.py", line 115, in default_units
axis.set_units(UnitData(data))
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\category.py", line 181, in __init__
self.update(data)
File "C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\lib\site-packages\matplotlib\category.py", line 215, in update
for val in OrderedDict.fromkeys(data):
TypeError: unhashable type: 'numpy.ndarray'
My outputs:
"C:\Users\Henrique Mendes\PycharmProjects\emprestimo\venv1\Scripts\python.exe" "C:/Users/Henrique Mendes/PycharmProjects/emprestimo/venv1/emprestimo.py"
pr_usuario bl_administrador dt_nascimento ... dt_cheque es_anexo dt_anexo
0 2 0 24/02/1980 ... NaN NaN NaN
1 3 0 05/09/1985 ... NaN NaN NaN
2 4 1 20/03/1984 ... NaN NaN NaN
3 5 1 20/01/1982 ... NaN NaN NaN
4 6 0 25/05/1985 ... NaN NaN NaN
[5 rows x 30 columns]
{'BELO HORIZONTE': Int64Index([0, 1, 2, 3, 6, 9, 10, 14, 17, 20, 22, 25], dtype='int64'), 'BRASILIA': Int64Index([4], dtype='int64'), 'CONTAGEM': Int64Index([23], dtype='int64'), 'CURITIBA': Int64Index([5, 7, 15, 18, 19], dtype='int64'), 'SANTA LUZIA': Int64Index([21], dtype='int64'), 'VESPASIANO': Int64Index([24], dtype='int64')}
vc_municipio
BELO HORIZONTE 12
BRASILIA 1
CONTAGEM 1
CURITIBA 5
SANTA LUZIA 1
VESPASIANO 1
Name: vc_municipio, dtype: int64
How can I do this chart?

Use pandas:
Your data:
assuming your data is in a .csv with the following form
0.0,BELO HORIZONTE
1.0,BELO HORIZONTE
2.0,BELO HORIZONTE
3.0,BELO HORIZONTE
6.0,BELO HORIZONTE
9.0,BELO HORIZONTE
10.0,BELO HORIZONTE
14.0,BELO HORIZONTE
17.0,BELO HORIZONTE
20.0,BELO HORIZONTE
22.0,BELO HORIZONTE
25.0,BELO HORIZONTE
4.0,BRASILIA
23.0,CONTAGEM
5.0,CURITIBA
7.0,CURITIBA
15.0,CURITIBA
18.0,CURITIBA
19.0,CURITIBA
21.0,SANTA LUZIA
24.0,VESPASIANO
Create the dataframe:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('test.csv', header=None)
df.columns = ['value', 'city']
value city
0 0.0 BELO HORIZONTE
1 1.0 BELO HORIZONTE
2 2.0 BELO HORIZONTE
3 3.0 BELO HORIZONTE
4 6.0 BELO HORIZONTE
5 9.0 BELO HORIZONTE
6 10.0 BELO HORIZONTE
7 14.0 BELO HORIZONTE
8 17.0 BELO HORIZONTE
9 20.0 BELO HORIZONTE
10 22.0 BELO HORIZONTE
11 25.0 BELO HORIZONTE
12 4.0 BRASILIA
13 23.0 CONTAGEM
14 5.0 CURITIBA
15 7.0 CURITIBA
16 15.0 CURITIBA
17 18.0 CURITIBA
18 19.0 CURITIBA
19 21.0 SANTA LUZIA
20 24.0 VESPASIANO
Groupby and plot the data:
groupby
count
plot.bar
# groupby & count
city_count = df.groupby('city').count()
value
city
BELO HORIZONTE 12
BRASILIA 1
CONTAGEM 1
CURITIBA 5
SANTA LUZIA 1
VESPASIANO 1
# plot
city_count.plot.bar()
plt.ylabel('Qtd')
plt.title('Municipio')
plt.show()
Plot with seaborn:
import seaborn as sns
sns.barplot(x=city_count.index, y='value', data=city_count)
plt.xticks(rotation=45)
plt.show()

municipio = usuarios.groupby(['vc_municipio']) returns a groupby object in pandas which is causing your error as matplotlib doesn't handle that.
plt.bar takes x values followed by y values (see docs).
matplotlib.pyplot.bar(x, height, width=0.8, bottom=None, *, align='center', data=None, **kwargs)
Luckily for you, when you do a groupby in pandas it automatically consolidates x values (or categories) as indices for you.
Assuming that municipio is meant to be a list of categories (you want the count by city?) then the following should work.
Replacing your code
plt.bar(y_pos, municipio, align='center', alpha=0.5)
with
plt.bar(y_pos.index, y_pos, align='center', alpha=0.5)
Alternatively, you can use the pandas version of plt.bar (which extends matplot lib) to natively handle some of the dataframe quirks.

Related

delete redundant rows in a dataframe with set in columns

I have a dataframe df:
Cluster OsId BrowserId PageId VolumePred ConversionPred
0 11 11 {789615, 955761, 1149586, 955764, 955767, 1187... 147.0 71.0
1 0 11 12 {1184903, 955761, 1149586, 1158132, 955764, 10... 73.0 38.0
2 0 11 15 {1184903, 1109643, 955761, 955764, 1074581, 95... 72.0 40.0
3 0 11 16 {1123200, 1184903, 1109643, 1018637, 1005581, ... 7815.0 5077.0
4 0 11 17 {1184903, 789615, 1016529, 955761, 955764, 955... 52.0 47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1154705 220.0 182.0
308 {18} 99 16 1155314 12.0 6.0
309 {9} 99 16 1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1184903 966.0 539.0
This dataframe contains redundansts rows that I need to delete them , so I try this :
df.drop_duplicates()
But I got this error : TypeError: unhashable type: 'set'
Any idea to help me to fix this error? Thanks!
Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:
#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]
If no row was removed it means no row has duplicates (tested are all columns together)

Convert list to dataframe

I am running a loop that appends three fields. Predictfinal is a list, though it is not necessary that it should be a list.
predictfinal.append(y_hat_orig[0])
predictfinal.append(mape)
predictfinal.append(length)
At the end, predictfinal returns a long list. But I really want to conform the list into a Dataframe, where each row is 3 columns. However the list does not designate between the 3 columns, it's just a long list with commas in between. Somehow I am trying to slice predictfinal into 3 columns and a Dataframe from currnet unstructured list - any help how?
predictfinal
Out[88]:
[1433.0459967608983,
1.6407741379111223,
23,
1433.6389125340916,
1.6474721044455922,
22,
1433.867408791692,
1.6756763089082383,
21,
1433.8484984008207,
1.6457581105556003,
20,
1433.6340460965778,
1.6380908467895527,
19,
1437.0294365907992,
1.6147672264908473,
18,
1439.7485102740507,
1.5010415925555876,
17,
1440.950406295299,
1.433891246672529,
16,
1434.837060644701,
1.5252803314930383,
15,
1434.9716303636983,
1.6125952442799232,
14,
1441.3153523102953,
3.2633984339696185,
13,
1435.6932462859334,
3.2703435261200497,
12,
1419.9057834496082,
1.9100005818319687,
11,
1426.0739741342488,
1.947684057178654,
10]
Based on https://stackoverflow.com/a/48347320/6926444
We can achieve it by using zip() and iter(). The code below iterates three elements each time.
res = pd.DataFrame(list(zip(*([iter(data)] * 3))), columns=['a', 'b', 'c'])
Result:
a b c
0 1433.045997 1.640774 23
1 1433.638913 1.647472 22
2 1433.867409 1.675676 21
3 1433.848498 1.645758 20
4 1433.634046 1.638091 19
5 1437.029437 1.614767 18
6 1439.748510 1.501042 17
7 1440.950406 1.433891 16
8 1434.837061 1.525280 15
9 1434.971630 1.612595 14
10 1441.315352 3.263398 13
11 1435.693246 3.270344 12
12 1419.905783 1.910001 11
13 1426.073974 1.947684 10
You could do:
pd.DataFrame(np.array(predictfinal).reshape(-1,3), columns=['origin', 'mape', 'length'])
Output:
origin mape length
0 1433.045997 1.640774 23.0
1 1433.638913 1.647472 22.0
2 1433.867409 1.675676 21.0
3 1433.848498 1.645758 20.0
4 1433.634046 1.638091 19.0
5 1437.029437 1.614767 18.0
6 1439.748510 1.501042 17.0
7 1440.950406 1.433891 16.0
8 1434.837061 1.525280 15.0
9 1434.971630 1.612595 14.0
10 1441.315352 3.263398 13.0
11 1435.693246 3.270344 12.0
12 1419.905783 1.910001 11.0
13 1426.073974 1.947684 10.0
Or you can also modify your loop:
predictfinal = []
for i in some_list:
predictfinal.append([y_hat_orig[0], mape, length])
# output dataframe
pd.DataFrame(predictfinal, columns=['origin', 'mape', 'length'])
Here is a pandas solution
s=pd.Series(l)
s.index=pd.MultiIndex.from_product([range(len(l)//3),['origin','map','len']])
s=s.unstack()
Out[268]:
len map origin
0 23.0 1.640774 1433.045997
1 22.0 1.647472 1433.638913
2 21.0 1.675676 1433.867409
3 20.0 1.645758 1433.848498
4 19.0 1.638091 1433.634046
5 18.0 1.614767 1437.029437
6 17.0 1.501042 1439.748510
7 16.0 1.433891 1440.950406
8 15.0 1.525280 1434.837061
9 14.0 1.612595 1434.971630
10 13.0 3.263398 1441.315352
11 12.0 3.270344 1435.693246
12 11.0 1.910001 1419.905783
13 10.0 1.947684 1426.073974

How do I separate arrays and add them based on their index in the array?

I am trying to make a wage calculator where a user inserts a .txt file and the program calculates the number of hours worked.
So far I am able to separate the names, wage value, and hours, but I can't figure out how to add the hours together.
So my desired result would be:
Names of Employees
Wage (how much they make
Added number of hours per employee
Here is the data set (file name of txt is -> empwages.txt):
(Edit: the formatting is messed so heres a screen grab of the text:
Spencer 12.75 8 8 8 8 10
Ruiz 18 8 8 9.5 8 8
Weiss 14.80 7 5 8 8 10
Choi 15 4 7 5 3.3 2.2
Miller 18 6.5 9 1 4 1
Barnes 15 7.5 9 4 0 2
Desired Outcome:
'Spencer', 'Ruiz', 'Weiss', 'Choi', 'Miller', 'Barnes'
'12.75', '18', '14.80', '15', '18', '15'
'42', '41.5', ... and so on
Current code:
infile = open("empwages.txt","r")
masterList = infile.readlines()
nameList = []
hourList = []
plushourList = []
for master in masterList:
nameList.append(master.split()[0])
hourList.append(master.split()[1])
x = 2
while x <= 6:
plushourList.append(master.split()[x])
x += 1
print(nameList)
print(hourList)
print(plushourList)
It is useful that you get familar with the concept of unpacking a list in Python. You can use the following code to solve your problem:
names = []
hours = []
more_hours = []
with open('empwages.txt') as f:
for line in f:
name, hour, *more_hs = line.split()
names.append(name)
hours.append(hour)
more_hours.append(more_hs)
print(*names, sep=', ')
print(*hours, sep=', ')
print(*[sum(float(q) for q in e) for e in more_hours])
In case you need the strings as you have requested:
names = []
hours = []
more_hours = []
with open('empwages.txt') as f:
for line in f:
name, hour, *more_hs = line.split()
names.append(name)
hours.append(hour)
more_hours.append(more_hs)
print(more_hours)
names = ', '.join(names)
hours = ', '.join(hours)
more_hours = ', '.join(str(s) for s in [sum(float(q) for q in e) for e in more_hours])
print(names)
print(hours)
print(more_hours)
Output
Spencer, Ruiz, Weiss, Choi, Miller, Barnes
12.75, 18, 14.80, 15, 18, 15
42.0 41.5 38.0 21.5 21.5 22.5
Try using zip:
with open("empwages.txt") as f:
lines = [line.split() for line in f]
names, hours, *more_hours = zip(*lines)
print(names)
print(hours)
print([sum(map(float, i)) for i in zip(*more_hours)])
('Spencer', 'Ruiz', 'Weiss', 'Choi', 'Miller', 'Barnes')
('12.75', '18', '14.80', '15', '18', '15')
[42.0, 41.5, 38.0, 21.5, 21.5, 22.5]
This will:
Split the file up by line, and split the lines up by word
Put the first word of each line in names, the second in hours, and the rest in more_hours
You can add more variables before the *_ as needed.
(Edited to correctly sum hours).
Well if you're not opposed to using pandas:
import pandas as pd
from StringIO import StringIO
import re
initial_data = '''Spencer 12.75 8 8 8 8 10
Ruiz 18 8 8 9.5 8 8
Weiss 14.80 7 5 8 8 10
Choi 15 4 7 5 3.3 2.2
Miller 18 6.5 9 1 4 1
Barnes 15 7.5 9 4 0 2'''
df = pd.read_csv(StringIO(re.sub(r'[ ]+', ',', initial_data, flags=re.M)), header=None)
print(df)
0 1 2 3 4 5 6
0 Spencer 12.75 8.0 8 8.0 8.0 10.0
1 Ruiz 18.00 8.0 8 9.5 8.0 8.0
2 Weiss 14.80 7.0 5 8.0 8.0 10.0
3 Choi 15.00 4.0 7 5.0 3.3 2.2
4 Miller 18.00 6.5 9 1.0 4.0 1.0
5 Barnes 15.00 7.5 9 4.0 0.0 2.0
Then you can quickly sum over the columns like so:
df.loc[:, 1:].sum(axis=1)
0 54.75
1 59.50
2 52.80
3 36.50
4 39.50
5 37.50
dtype: float64

how can i plt this data? its file extension is .xvg

I am new in Python. I have tried this script but it does not work.
It give me this error:
Traceback (most recent call last):
File "temp.py", line 11, in <module>
y = [row.split(' ')[1] for row in data]
File "temp.py", line 11, in <listcomp>
y = [row.split(' ')[1] for row in data]
IndexError: list index out of range
The script is:
import numpy as np
import matplotlib.pyplot as plt
with open("data.xvg") as f:
data = f.read()
data = data.split('\n')
x = [row.split(' ')[0] for row in data]
y = [row.split(' ')[1] for row in data]
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_title("Plot title...")
ax1.set_xlabel('your x label..')
ax1.set_ylabel('your y label...')
ax1.plot(x,y, c='r', label='the data')
leg = ax1.legend()
plt.show()
The data is:
0.000000 299.526978
1.000000 4.849206
2.000000 0.975336
3.000000 0.853160
4.000000 0.767092
5.000000 0.995595
6.000000 0.976332
7.000000 1.111898
8.000000 1.251045
9.000000 1.346720
10.000000 1.522089
11.000000 1.705517
12.000000 1.822599
13.000000 1.988752
14.000000 2.073061
15.000000 2.242703
16.000000 2.370366
17.000000 2.530256
18.000000 2.714863
19.000000 2.849218
20.000000 3.033373
21.000000 3.185251
22.000000 3.282328
23.000000 3.431681
24.000000 3.668798
25.000000 3.788214
26.000000 3.877117
27.000000 4.032224
28.000000 4.138007
29.000000 4.315784
30.000000 4.504521
31.000000 4.668567
32.000000 4.787213
33.000000 4.973860
34.000000 5.128736
35.000000 5.240545
36.000000 5.392560
37.000000 5.556009
38.000000 5.709351
39.000000 5.793169
40.000000 5.987224
41.000000 6.096015
42.000000 6.158622
43.000000 6.402116
44.000000 6.533816
45.000000 6.711002
46.000000 6.876793
47.000000 7.104519
48.000000 7.237456
49.000000 7.299352
50.000000 7.471975
51.000000 7.691428
52.000000 7.792002
53.000000 7.928269
54.000000 8.014977
55.000000 8.211984
56.000000 8.330894
57.000000 8.530197
58.000000 8.690166
59.000000 8.808934
60.000000 8.996209
61.000000 9.104818
62.000000 9.325309
63.000000 9.389288
64.000000 9.576900
65.000000 9.761865
66.000000 9.807437
67.000000 10.027261
68.000000 10.129250
69.000000 10.392891
70.000000 10.497618
71.000000 10.627769
72.000000 10.811770
73.000000 11.119184
74.000000 11.181286
75.000000 11.156842
76.000000 11.350290
77.000000 11.493779
78.000000 11.720265
79.000000 11.700112
80.000000 11.939404
81.000000 12.293530
82.000000 12.267791
83.000000 12.394929
84.000000 12.545286
85.000000 12.784669
86.000000 12.754122
87.000000 13.129798
88.000000 13.166340
89.000000 13.389514
90.000000 13.436648
91.000000 13.647285
92.000000 13.722875
93.000000 13.992217
94.000000 14.167837
95.000000 14.320843
96.000000 14.450310
97.000000 14.515556
98.000000 14.598526
99.000000 14.807360
100.000000 14.982592
101.000000 15.312892
102.000000 15.280009
If it is an xvg file from GROMACS it probably has some comments starting with # so without editing that file you can:
x,y = np.loadtxt("file.xvg",comments="#",unpack=True)
plt.plot(x,y)
unpack=True makes the columns come out as individual arrays that are set to x and y on the left-hand side. Of course you could also parse the comments to get the labels and legends.
Try the following, you needed to convert each of your values into a float before appending them:
import numpy as np
import matplotlib.pyplot as plt
x, y = [], []
with open("data.xvg") as f:
for line in f:
cols = line.split()
if len(cols) == 2:
x.append(float(cols[0]))
y.append(float(cols[1]))
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_title("Plot title...")
ax1.set_xlabel('your x label..')
ax1.set_ylabel('your y label...')
ax1.plot(x,y, c='r', label='the data')
leg = ax1.legend()
plt.show()
This would give you a graph looking like:
The reason for getting the error is probably because you have an empty line somewhere in your file. By checking that the number of entries after the split is 2, it ensures that you should not get an index out of range error.
You can use python library or windows/linux executable to plot XVG files from GMXvg package.
It will discover XVGs and convert them to JPG or any other extension supported by python's matplotlib.

pandas combining dataframe

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array
javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)
This code comes out with the following:
Town Java Jobs
435 York,NY 3593
212 NewYork,NY 3585
584 Seattle,WA 2080
624 Chicago,IL 1920
301 Boston,MA 1571
...
79 Holland,MI 5
38 Manhattan,KS 5
497 Vernon,IL 5
30 Clayton,MO 5
90 Waukegan,IL 5
[653 rows x 2 columns]
Town Python Jobs
160 NewYork,NY 2949
11 York,NY 2938
349 Seattle,WA 1321
91 Chicago,IL 1312
167 Boston,MA 1117
383 Hanover,NH 5
209 Bulverde,TX 5
203 Salisbury,NC 5
67 Rockford,IL 5
256 Ventura,CA 5
[416 rows x 2 columns]
I want to make a new dataframe that uses the town names as an index and has a column for each java and python. However, some of the towns will only have results for one of the languages.
import pandas as pd
javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])
result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
# Java Jobs Python Jobs
# Town
# York,NY 3593.0 2938.0
# NewYork,NY 3585.0 2949.0
# Seattle,WA 2080.0 1321.0
# Chicago,IL 1920.0 1312.0
# Boston,MA 1571.0 1117.0
# Holland,MI 5.0 NaN
# Manhattan,KS 5.0 NaN
# Vernon,IL 5.0 NaN
# Clayton,MO 5.0 NaN
# Waukegan,IL 5.0 NaN
# Hanover,NH NaN 5.0
# Bulverde,TX NaN 5.0
# Salisbury,NC NaN 5.0
# Rockford,IL NaN 5.0
# Ventura,CA NaN 5.0
pd.merge will by default join two DataFrames on all columns shared in common. In this case, javaFrame and pythonFrame share only the Town column in common. So by default pd.merge would join the two DataFrames on the Town column.
how='outer causes pd.merge to use the union of the keys from both frames. In other words it causes pd.merge to return rows whose data come from either javaFrame or pythonFrame even if only one DataFrame contains the Town. Missing data is fill with NaNs.
Use pd.concat
df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)
Java Jobs Python Jobs
Boston,MA 1571.0 1117.0
Bulverde,TX NaN 5.0
Chicago,IL 1920.0 1312.0
Clayton,MO 5.0 NaN
Hanover,NH NaN 5.0
Holland,MI 5.0 NaN
Manhattan,KS 5.0 NaN
NewYork,NY 3585.0 2949.0
Rockford,IL NaN 5.0
Salisbury,NC NaN 5.0
Seattle,WA 2080.0 1321.0
Ventura,CA NaN 5.0
Vernon,IL 5.0 NaN
Waukegan,IL 5.0 NaN
York,NY 3593.0 2938.0

Categories