I'm trying to make a report and then convert it to the prescribed form but I don't know how. Below is my code:
data = pd.read_csv('https://raw.githubusercontent.com/hoatranobita/reports/main/Loan_list_test.csv')
data_pivot = pd.pivot_table(data,('CLOC_CUR_XC_BL'),index=['BIZ_TYPE_SBV_CODE'],columns=['TERM_CODE','CURRENCY_CD'],aggfunc=np.sum).reset_index
print(data_pivot)
Pivot table shows as below:
<bound method DataFrame.reset_index of TERM_CODE Ng?n h?n Trung h?n
CURRENCY_CD 1. VND 2. USD 1. VND 2. USD
BIZ_TYPE_SBV_CODE
201 170000.00 NaN 43533.42 NaN
202 2485441.64 5188792.76 2682463.04 1497309.06
204 35999.99 NaN NaN NaN
301 1120940.65 NaN 190915.62 453608.72
401 347929.88 182908.01 239123.29 NaN
402 545532.99 NaN 506964.23 NaN
403 21735.74 NaN 1855.92 NaN
501 10346.45 NaN NaN NaN
601 881974.40 NaN 50000.00 NaN
602 377216.09 NaN 828868.61 NaN
702 9798.74 NaN 23616.39 NaN
802 155099.66 NaN 762294.95 NaN
803 23456.79 NaN 97266.84 NaN
804 151590.00 NaN 378000.00 NaN
805 182925.30 54206.52 4290216.37 NaN>
Here is the prescribed form:
form = pd.read_excel('https://github.com/hoatranobita/reports/blob/main/Form%20A00034.xlsx?raw=true')
form.head()
Mã ngành kinh tế Dư nợ tín dụng (không bao gồm mua, đầu tư trái phiếu doanh nghiệp) Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 NaN Ngắn hạn NaN Trung và dài hạn NaN Tổng cộng
1 NaN Bằng VND Bằng ngoại tệ Bằng VND Bằng ngoại tệ NaN
2 101.0 NaN NaN NaN NaN NaN
3 201.0 NaN NaN NaN NaN NaN
4 202.0 NaN NaN NaN NaN NaN
As you see, pivot table have no 101 but form has. So what I have to do to convert from Dataframe to Form that skip 101.
Thank you.
Hi First create a worksheet using xlsxwriter
import xlsxwriter
#start workbook
workbook = xlsxwriter.Workbook('merge1.xlsx')
#Introduce formatting
format = workbook.add_format({'border': 1,'bold': True})
#Adding a worksheet
worksheet = workbook.add_worksheet()
merge_format = workbook.add_format({
'bold':1,
'border': 1,
'align': 'center',
'valign': 'vcenter'})
#Starting the Headers
worksheet.merge_range('A1:A3', 'Mã ngành kinh tế', merge_format)
worksheet.merge_range('B1:F1', 'Dư nợ tín dụng (không bao gồm mua, đầu tư trái phiếu doanh nghiệp)', merge_format)
worksheet.merge_range('B2:C2', 'Ngắn hạn', merge_format)
worksheet.merge_range('D2:E2', 'Trung và dài hạn', merge_format)
worksheet.merge_range('F2:F3', 'Tổng cộng', merge_format)
worksheet.write(2, 1, 'Bằng VND',format)
worksheet.write(2, 2, 'Bằng ngoại tệ',format)
worksheet.write(2, 3, 'Bằng VND',format)
worksheet.write(2, 4, 'Bằng ngoại tệ',format)
After this formatting you can start writing to sheet looping through using worksheet.write() below I have included a sample
expenses = (
['Rent', 1000],
['Gas', 100],
['Food', 300],
['Gym', 50],
)
for item, cost in (expenses):
worksheet.write(row, col, item)
row += 1
In row and col you can specify the cell row and column number it goes as a numerical value like a matrix.
And finally close the workbook
workbook.close()
Related
I'm trying to extract name, brand, prices, stock microdata from pages extracted from sitemap.xml
But I'm blocked with the following step, thank you for helping me as I'm a newbie I can't understand the blocking element
Scrape the sitemap.xml to have list of urls : OK
Extract the metadata : OK
Extract the product schema : OK
Extract the products not OK
Crawl the site and store the products not OK
Scrape the sitemap.xml to have list of urls : OK
import pandas as pd
import requests
import extruct
from w3lib.html import get_base_url
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import advertools as adv
proximus_sitemap = adv.sitemap_to_df('https://www.proximus.be/iportal/sitemap.xml')
proximus_sitemap = proximus_sitemap[proximus_sitemap['loc'].str.contains('boutique')]
proximus_sitemap = proximus_sitemap[proximus_sitemap['loc'].str.contains('/fr/')]
Extract the metadata : OK
def extract_metadata(url):
r = requests.get(url)
base_url = get_base_url(r.text, r.url)
metadata = extruct.extract(r.text,
base_url=base_url,
uniform=True,
syntaxes=['json-ld',
'microdata',
'opengraph'])
return metadata
metadata = extract_metadata('https://www.proximus.be/fr/id_cr_apple-iphone-13-128gb-blue/particuliers/equipement/boutique/apple-iphone-13-128gb-blue.html')
metadata
Extract the product schema : OK
def get_dictionary_by_key_value(dictionary, target_key, target_value):
for key in dictionary:
if len(dictionary[key]) > 0:
for item in dictionary[key]:
if item[target_key] == target_value:
return item
Product = get_dictionary_by_key_value(metadata, "#type", "Product")
Product
Extract the products not OK => errormessage = errorkey 'offers'
def get_products(metadata):
Product = get_dictionary_by_key_value(metadata, "#type", "Product")
if Product:
products = []
for offer in Product['offers']['offers']:
product = {
'product_name': Product.get('name', ''),
'brand': offer.get('description', ''),
'availability': offer.get('availability', ''),
'lowprice': offer.get('lowPrice', ''),
'highprice': offer.get('highPrice', ''),
'price': offer.get('price', ''),
'priceCurrency': offer.get('priceCurrency', ''),
}
products.append(product)
return products
Crawl the site and store the products not OK as blocked during previous step
def scrape_products(proximus_sitemap, url='url'):
df_products = pd.DataFrame(columns = ['product_name', 'brand', 'name', 'availability',
'lowprice', 'highprice','price','priceCurrency'])
for index, row in proximus_sitemap.iterrows():
metadata = extract_metadata(row[url])
products = get_products(metadata)
if products is not None:
for product in products:
df_products = df_products.append(product, ignore_index=True)
return df_products
df_products = scrape_products(proximus_sitemap, url='loc')
df_products.to_csv('patch.csv', index=False)
df_products.head()
You can simply continue by using the advertools SEO crawler. It has a crawl function that also extracts structured data by default (JSON-LD, OpenGraph, and Twitter).
I tried to crawl a sample of ten pages, and this what the output looks like:
adv.crawl(proximus_sitemap['loc'], 'proximums.jl')
proximus_crawl = pd.read_json('proximums.jl', lines=True)
proximus_crawl.filter(regex='jsonld').columns
Index(['jsonld_#context', 'jsonld_#type', 'jsonld_name', 'jsonld_url',
'jsonld_potentialAction.#type', 'jsonld_potentialAction.target',
'jsonld_potentialAction.query-input', 'jsonld_1_#context',
'jsonld_1_#type', 'jsonld_1_name', 'jsonld_1_url', 'jsonld_1_logo',
'jsonld_1_sameAs', 'jsonld_2_#context', 'jsonld_2_#type',
'jsonld_2_itemListElement', 'jsonld_2_name', 'jsonld_2_image',
'jsonld_2_description', 'jsonld_2_sku', 'jsonld_2_review',
'jsonld_2_brand.#type', 'jsonld_2_brand.name',
'jsonld_2_aggregateRating.#type',
'jsonld_2_aggregateRating.ratingValue',
'jsonld_2_aggregateRating.reviewCount', 'jsonld_2_offers.#type',
'jsonld_2_offers.priceCurrency', 'jsonld_2_offers.availability',
'jsonld_2_offers.price', 'jsonld_3_#context', 'jsonld_3_#type',
'jsonld_3_itemListElement', 'jsonld_image', 'jsonld_description',
'jsonld_sku', 'jsonld_review', 'jsonld_brand.#type',
'jsonld_brand.name', 'jsonld_aggregateRating.#type',
'jsonld_aggregateRating.ratingValue',
'jsonld_aggregateRating.reviewCount', 'jsonld_offers.#type',
'jsonld_offers.lowPrice', 'jsonld_offers.highPrice',
'jsonld_offers.priceCurrency', 'jsonld_offers.availability',
'jsonld_offers.price', 'jsonld_offers.offerCount',
'jsonld_1_itemListElement', 'jsonld_2_offers.lowPrice',
'jsonld_2_offers.highPrice', 'jsonld_2_offers.offerCount',
'jsonld_itemListElement'],
dtype='object')
These are some of the columns you might be interested in (containing price, currency, availability, etc.)
jsonld_2_description
jsonld_2_offers.priceCurrency
jsonld_2_offers.availability
jsonld_2_offers.price
jsonld_description
jsonld_offers.lowPrice
jsonld_offers.priceCurrency
jsonld_offers.availability
jsonld_offers.price
jsonld_2_offers.lowPrice
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
Numéro 7
EUR
OutOfStock
369.99
nan
nan
nan
nan
nan
nan
2
nan
nan
nan
nan
Apple
81.82
EUR
InStock
487.6
nan
3
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
4
nan
nan
nan
nan
Huawei
nan
EUR
OutOfStock
330.57
nan
5
nan
nan
nan
nan
Apple
81.82
EUR
LimitedAvailability
487.6
nan
6
Apple
EUR
InStock
589.99
nan
nan
nan
nan
nan
99
7
Apple
EUR
LimitedAvailability
589.99
nan
nan
nan
nan
nan
99
8
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
9
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you
Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN
I have a list of shares that make up an ETF. I have formatted the tickers into a list and have named this variable assets
print(assets)
['AUD', 'CRWD', 'SPLK', 'OKTA', 'AVGO', 'CSCO', 'NET', 'ZS', 'AKAM', 'FTNT', 'BAH', 'CYBR', 'CHKP', 'BA/', 'VMW', 'PFPT', 'PANW', 'VRSN', 'FFIV', 'JNPR', 'LDOS', '4704', 'FEYE', 'QLYS', 'SAIC', 'RPD', 'HO', 'MIME', 'SAIL', 'VRNS', 'ITRI', 'AVST', 'MANT', 'TENB', '053800', 'ZIXI', 'OSPN', 'RDWR', 'ULE', 'MOBL', 'ATEN', 'TUFN', 'RBBN', 'NCC', 'KRW', 'EUR', 'JPY', 'GBP', 'USD']
I use the following for loop to iterate through the list and pull historical data from yahoo
for i in assets:
try:
df[i] = web.DataReader(i, data_source='yahoo', start=start, end=end)['Adj Close']
except RemoteDataError:
print(f'{i}')
continue
I am returned with:
BA/
4704
H0
053800
KRW
JPY
Suggesting these assets cannot be found on yahoo finance. I understand this is the case and accept that.
When I look for the stocks that have theoretically been found (e.g. df['FEYE']) on yahoo finance I get the following.
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN 10 NaN 11 NaN 12 NaN 13 NaN 14 NaN 15 NaN 16 NaN 17 NaN 18 NaN 19 NaN 20 NaN 21 NaN 22 NaN 23 NaN 24 NaN 25 NaN 26 NaN 27 NaN 28 NaN 29 NaN 30 NaN 31 NaN 32 NaN 33 NaN 34 NaN 35 NaN 36 NaN 37 NaN 38 NaN 39 NaN 40 NaN 41 NaN 42 NaN 43 NaN 44 NaN 45 NaN 46 NaN 47 NaN 48 NaN
Name: FEYE, dtype: float64
When I proceed normally with just one share
(e.g. CSCO = web.DataReader(assets[5], data_source='yahoo', start=start, end=end)['Adj Close'])
It is all ok.
Any help is greatly appreciated,
Thank you!
Here is reproducible testing example of code and output.
If You have existing dataframe named df then new data is incompatible in terms of index and maybe column names.
Creating new dataframe is needed but outside the loop. Each itertation creates new column with ticker data.
import pandas as pd
import pandas_datareader.data as web
from pandas_datareader._utils import RemoteDataError
assets=['AUD', 'CRWD', 'SPLK', 'OKTA', 'AVGO', 'CSCO', 'NET', 'ZS', 'AKAM', 'FTNT', 'BAH', 'CYBR', 'CHKP', 'BA/', 'VMW', 'PFPT', 'PANW', 'VRSN', 'FFIV', 'JNPR', 'LDOS', '4704', 'FEYE', 'QLYS', 'SAIC', 'RPD', 'HO', 'MIME', 'SAIL', 'VRNS', 'ITRI', 'AVST', 'MANT', 'TENB', '053800', 'ZIXI', 'OSPN', 'RDWR', 'ULE', 'MOBL', 'ATEN', 'TUFN', 'RBBN', 'NCC', 'KRW', 'EUR', 'JPY', 'GBP', 'USD']
df = pd.DataFrame()
for i in assets:
try:
print(f'Try: {i}')
df[i] = web.DataReader(i, data_source='yahoo')['Adj Close']
except RemoteDataError as r:
print(f'Try: {i}: {r}')
continue
result:
Try: AUD
Try: CRWD
Try: SPLK
Try: OKTA
Try: AVGO
Try: CSCO
Try: NET
Try: ZS
Try: AKAM
Try: FTNT
Try: BAH
Try: CYBR
Try: CHKP
Try: BA/
Try: BA/: Unable to read URL: https://finance.yahoo.com/quote/BA//history?period1=1435975200&period2=1593741599&interval=1d&frequency=1d&filter=history
Response Text:
b'<html>\n<meta charset=\'utf-8\'>\n<script>\nvar u=\'https://www.yahoo.com/?err=404&err_url=https%3a%2f%2ffinance.yahoo.com%2fquote%2fBA%2f%2fhistory%3fperiod1%3d1435975200%26period2%3d1593741599%26interval%3d1d%26frequency%3d1d%26filter%3dhistory\';\nif(window!=window.top){\n document.write(\'<p>Content is currently unavailable.</p><img src="//geo.yahoo.com/p?s=1197757039&t=\'+new Date().getTime()+\'&_R=\'+encodeURIComponent(document.referrer)+\'&err=404&err_url=\'+u+\'" width="0px" height="0px"/>\');\n}else{\n window.location.replace(u);\n}\n</script>\n<noscript><META http-equiv="refresh" content="0;URL=\'https://www.yahoo.com/?err=404&err_url=https%3a%2f%2ffinance.yahoo.com%2fquote%2fBA%2f%2fhistory%3fperiod1%3d1435975200%26period2%3d1593741599%26interval%3d1d%26frequency%3d1d%26filter%3dhistory\'"></noscript>\n</html>\n'
Try: VMW
Try: PFPT
Try: PANW
Try: VRSN
Try: FFIV
Try: JNPR
Try: LDOS
Try: 4704
Try: 4704: No data fetched for symbol 4704 using YahooDailyReader
Try: FEYE
Try: QLYS
Try: SAIC
Try: RPD
Try: HO
Try: HO: No data fetched for symbol HO using YahooDailyReader
Try: MIME
Try: SAIL
Try: VRNS
Try: ITRI
Try: AVST
Try: MANT
Try: TENB
Try: 053800
Try: 053800: No data fetched for symbol 053800 using YahooDailyReader
Try: ZIXI
Try: OSPN
Try: RDWR
Try: ULE
Try: MOBL
Try: ATEN
Try: TUFN
Try: RBBN
Try: NCC
Try: KRW
Try: KRW: No data fetched for symbol KRW using YahooDailyReader
Try: EUR
Try: JPY
Try: JPY: No data fetched for symbol JPY using YahooDailyReader
Try: GBP
Please note there are 2 types of error:
when ticker does not exists, for example "HO"
when resulting URL is wrong due to "/" in "BA/"
Head of result set dataframe df.head():
AUD CRWD SPLK OKTA ... NCC EUR GBP USD
Date ...
2015-11-03 51.500000 NaN 57.139999 NaN ... 3.45 NaN 154.220001 13.608685
2015-12-22 55.189999 NaN 54.369999 NaN ... 3.48 NaN 148.279999 13.924644
2015-12-23 55.560001 NaN 56.509998 NaN ... 3.48 NaN 148.699997 14.146811
2015-12-24 55.560001 NaN 56.779999 NaN ... 3.48 NaN 149.119995 14.324224
2015-12-28 56.270000 NaN 57.660000 NaN ... 3.48 NaN 148.800003 14.057305
[5 rows x 43 columns]
Hope this helps.
I have a pandas data frame that looks like:
High Low ... Volume OpenInterest
2018-01-02 983.25 975.50 ... 8387 67556
2018-01-03 986.75 981.00 ... 7447 67525
2018-01-04 985.25 977.00 ... 8725 67687
2018-01-05 990.75 984.00 ... 7948 67975
I calculate the Average True Range and save it into a series:
i = 0
TR_l = [0]
while i < (df.shape[0]-1):
#TR = max(df.loc[i + 1, 'High'], df.loc[i, 'Close']) - min(df.loc[i + 1, 'Low'], df.loc[i, 'Close'])
TR = max(df['High'][i+1], df['Close'][i]) - min(df['Low'][i+1], df['Close'][i])
TR_l.append(TR)
i = i + 1
TR_s = pd.Series(TR_l)
ATR = pd.Series(TR_s.ewm(span=n, min_periods=n).mean(), name='ATR_' + str(n))
With a 14-period rolling window ATR looks like:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 8.096064
14 7.968324
15 8.455205
16 9.046418
17 8.895405
18 9.088769
19 9.641879
20 9.516764
But when I do:
df = df.join(ATR)
The ATR column in df is all NaN. It's because the indexes are different between the data frame and ATR. Is there any way to add the ATR column into the data frame?
Consider shift to avoid the while loop across rows and list building. Below uses Union Pacific (UNP) railroad stock data to demonstrate:
import pandas as pd
import pandas_datareader as pdr
stock_df = pdr.get_data_yahoo('UNP').loc['2019-01-01':'2019-03-29']
# SHIFT DATA ONE DAY BACK AND JOIN TO ORIGINAL DATA
stock_df = stock_df.join(stock_df.shift(-1), rsuffix='_future')
# CALCULATE TR DIFFERENCE BY ROW
stock_df['TR'] = stock_df.apply(lambda x: max(x['High_future'], x['Close']) - min(x['Low_future'], x['Close']), axis=1)
# CALCULATE EWM MEAN
n = 14
stock_df['ATR'] = stock_df['TR'].ewm(span=n, min_periods=n).mean()
Output
print(stock_df.head(20))
# High Low Open Close Volume Adj Close High_future Low_future Open_future Close_future Volume_future Adj Close_future TR ATR
# Date
# 2019-01-02 138.320007 134.770004 135.649994 137.779999 3606300.0 137.067413 136.750000 132.169998 136.039993 132.679993 5684500.0 131.993790 5.610001 NaN
# 2019-01-03 136.750000 132.169998 136.039993 132.679993 5684500.0 131.993790 138.580002 134.520004 134.820007 137.789993 5649900.0 137.077362 5.900009 NaN
# 2019-01-04 138.580002 134.520004 134.820007 137.789993 5649900.0 137.077362 139.229996 136.259995 137.330002 138.649994 4034200.0 137.932907 2.970001 NaN
# 2019-01-07 139.229996 136.259995 137.330002 138.649994 4034200.0 137.932907 152.889999 149.039993 151.059998 150.750000 10558800.0 149.970337 14.240005 NaN
# 2019-01-08 152.889999 149.039993 151.059998 150.750000 10558800.0 149.970337 151.059998 148.610001 150.289993 150.360001 4284600.0 149.582352 2.449997 NaN
# 2019-01-09 151.059998 148.610001 150.289993 150.360001 4284600.0 149.582352 155.289993 149.009995 149.899994 154.660004 6444600.0 153.860123 6.279999 NaN
# 2019-01-10 155.289993 149.009995 149.899994 154.660004 6444600.0 153.860123 155.029999 153.089996 153.639999 153.210007 3845200.0 152.417618 1.940002 NaN
# 2019-01-11 155.029999 153.089996 153.639999 153.210007 3845200.0 152.417618 154.240005 151.649994 152.229996 153.889999 3507100.0 153.094101 2.590012 NaN
# 2019-01-14 154.240005 151.649994 152.229996 153.889999 3507100.0 153.094101 154.360001 151.740005 153.789993 152.479996 4685100.0 151.691391 2.619995 NaN
# 2019-01-15 154.360001 151.740005 153.789993 152.479996 4685100.0 151.691391 153.729996 150.910004 152.910004 151.970001 4053200.0 151.184021 2.819992 NaN
# 2019-01-16 153.729996 150.910004 152.910004 151.970001 4053200.0 151.184021 154.919998 150.929993 151.110001 154.639999 4075400.0 153.840210 3.990005 NaN
# 2019-01-17 154.919998 150.929993 151.110001 154.639999 4075400.0 153.840210 158.800003 155.009995 155.539993 158.339996 5003900.0 157.521072 4.160004 NaN
# 2019-01-18 158.800003 155.009995 155.539993 158.339996 5003900.0 157.521072 157.199997 154.410004 156.929993 155.020004 6052900.0 154.218262 3.929993 NaN
# 2019-01-22 157.199997 154.410004 156.929993 155.020004 6052900.0 154.218262 156.020004 152.429993 155.449997 154.330002 4858000.0 153.531830 3.590012 4.011254
# 2019-01-23 156.020004 152.429993 155.449997 154.330002 4858000.0 153.531830 160.759995 156.009995 160.039993 160.339996 9222400.0 159.510742 6.429993 4.376440
# 2019-01-24 160.759995 156.009995 160.039993 160.339996 9222400.0 159.510742 162.000000 160.220001 161.460007 160.949997 7770700.0 160.117584 1.779999 3.991223
# 2019-01-25 162.000000 160.220001 161.460007 160.949997 7770700.0 160.117584 160.789993 159.339996 160.000000 159.899994 3733800.0 159.073013 1.610001 3.643168
# 2019-01-28 160.789993 159.339996 160.000000 159.899994 3733800.0 159.073013 160.929993 158.750000 160.039993 160.169998 3436900.0 159.341614 2.179993 3.432011
# 2019-01-29 160.929993 158.750000 160.039993 160.169998 3436900.0 159.341614 161.889999 159.440002 161.089996 160.820007 4112200.0 159.988266 2.449997 3.291831
# 2019-01-30 161.889999 159.440002 161.089996 160.820007 4112200.0 159.988266 160.990005 157.020004 160.750000 159.070007 7438600.0 158.247314 3.970001 3.387735
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array
javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)
This code comes out with the following:
Town Java Jobs
435 York,NY 3593
212 NewYork,NY 3585
584 Seattle,WA 2080
624 Chicago,IL 1920
301 Boston,MA 1571
...
79 Holland,MI 5
38 Manhattan,KS 5
497 Vernon,IL 5
30 Clayton,MO 5
90 Waukegan,IL 5
[653 rows x 2 columns]
Town Python Jobs
160 NewYork,NY 2949
11 York,NY 2938
349 Seattle,WA 1321
91 Chicago,IL 1312
167 Boston,MA 1117
383 Hanover,NH 5
209 Bulverde,TX 5
203 Salisbury,NC 5
67 Rockford,IL 5
256 Ventura,CA 5
[416 rows x 2 columns]
I want to make a new dataframe that uses the town names as an index and has a column for each java and python. However, some of the towns will only have results for one of the languages.
import pandas as pd
javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])
result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
# Java Jobs Python Jobs
# Town
# York,NY 3593.0 2938.0
# NewYork,NY 3585.0 2949.0
# Seattle,WA 2080.0 1321.0
# Chicago,IL 1920.0 1312.0
# Boston,MA 1571.0 1117.0
# Holland,MI 5.0 NaN
# Manhattan,KS 5.0 NaN
# Vernon,IL 5.0 NaN
# Clayton,MO 5.0 NaN
# Waukegan,IL 5.0 NaN
# Hanover,NH NaN 5.0
# Bulverde,TX NaN 5.0
# Salisbury,NC NaN 5.0
# Rockford,IL NaN 5.0
# Ventura,CA NaN 5.0
pd.merge will by default join two DataFrames on all columns shared in common. In this case, javaFrame and pythonFrame share only the Town column in common. So by default pd.merge would join the two DataFrames on the Town column.
how='outer causes pd.merge to use the union of the keys from both frames. In other words it causes pd.merge to return rows whose data come from either javaFrame or pythonFrame even if only one DataFrame contains the Town. Missing data is fill with NaNs.
Use pd.concat
df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)
Java Jobs Python Jobs
Boston,MA 1571.0 1117.0
Bulverde,TX NaN 5.0
Chicago,IL 1920.0 1312.0
Clayton,MO 5.0 NaN
Hanover,NH NaN 5.0
Holland,MI 5.0 NaN
Manhattan,KS 5.0 NaN
NewYork,NY 3585.0 2949.0
Rockford,IL NaN 5.0
Salisbury,NC NaN 5.0
Seattle,WA 2080.0 1321.0
Ventura,CA NaN 5.0
Vernon,IL 5.0 NaN
Waukegan,IL 5.0 NaN
York,NY 3593.0 2938.0