If I have a list of headers and I am using pandas:
[u'GAME_ID', u'TEAM_ID', u'TEAM_ABBREVIATION', u'TEAM_CITY', u'PLAYER_ID', u'PLAYER_NAME', u'START_POSITION', u'COMMENT', u'MIN', u'SPD', u'DIST', u'ORBC', u'DRBC', u'RBC', u'TCHS', u'SAST', u'FTAST', u'PASS', u'AST', u'CFGM', u'CFGA', u'CFG_PCT', u'UFGM', u'UFGA', u'UFG_PCT', u'FG_PCT', u'DFGM', u'DFGA', u'DFG_PCT']
Why do I get an output that is shortened
Like the following below:
PLAYER_NAME START_POSITION COMMENT MIN SPD ... CFGM CFGA \
0 Billy Bob G 37:42 4.12 5 ... 5 12
Why does pandas skip the other stats? Even though my code states:
output= pd.DataFrame(data, columns=stts)
print output
It's done on purpose, more specifically through pandas' Options and Settings.
You can change it through display.max_columns which is set by default to 20, as well as display.max_colwidth. Here's the full default list of information.
Related
I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():
Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'
Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex
If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.
I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object
This is my first time trying to use a lambda function, please help me determine what I'm doing incorrectly. I wrote a function to output time zones based on zip codes. The function works but not sure how to implement it as a lambdas function to create a new column in my dataframe
import pandas as pd
from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
def find_tz(zip_code):
try:
tz = zcdb[zip_code].timezone
return tz
except:
return '?'
data = [['Jane','92804'], ['Bob','75014'], ['Ashley','07650']]
df = pd.DataFrame(data, columns=['Contact','Zip'])
in: df
out:
Contact Zip
0 Jane 92804
1 Bob 75014
2 Ashley 07650
Do note that that zip code column data are strings, since US zip codes have leading 0s.
Me testing that the function I wrote works on values from df:
in: print(find_tz(df.loc[0,'Zip']))
print(find_tz(df.loc[1,'Zip']))
print(find_tz(df.loc[2,'Zip']))
out:
-8
-6
-5
My attempt at using a lambda function to create a new Timezone column, and the incorrect result I am getting:
in: df = df.assign(Timezone = lambda x: find_tz(x.Zip))
df
out:
Contact Zip Timezone
0 Jane 92804 ?
1 Bob 75014 ?
2 Ashley 07650 ?
My desired resulting dataframe would look like:
Contact Zip Timezone
0 Jane 92804 -8
1 Bob 75014 -6
2 Ashley 07650 -5
ETA: when I changed my find_tz() function to something like concatenating the input with another string of text, the lambda worked as I expected, so I'm not sure what I've done wrong.
You can use:
df['Timezone'] = df.Zip.apply(find_tz)
When you call lambda x: find_tz(x.Zip) the find_tz function is passed a Pandas series not the individual zip codes
first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.
I am new to Python. I am trying to scrape the data on the page:
For example:
Category: grains
Organic: No
Commodity: Coarse
SubCommodity: Corn
Publications: Daily
Location: All
Refine Commodity: All
Dates: 07/31/2018 - 08/01/2019
Is there a way in which Python can select this on the webpage and then click on run and then
Click on Download as Excel and store the excel file?
Is it possible? I am new to coding and need some guidance here.
Currently what I have done is enter the data and then on the resulting page I used Beautiful Soup to scrape the table. However it takes a lot of time since the table is spread on more than 200 pages.
Using the query you defined as an example, I input the query manually and found the following URL for the Excel (really HTML) format:
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018&runReport=true&grade=®ionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=excel'
In the URL are parameters we can set in Python, and we could easily make a loop to change the parameters. For now, let me just get into the example of actually getting this data. I use the pandas.read_html to read this HTML result and populate a DataFrame, which could be thought of as a table with columns and rows.
import pandas as pd
# use URL defined earlier
# url = '...'
df_lst = pd.read_html(url, header=1)
Now df_lst is a list of DataFrame objects containing the desired data. For your particular example, this results in 30674 rows and 11 columns:
>>> df_lst[0].columns
Index([u'Report Date', u'Location', u'Class', u'Variety', u'Grade Description',
u'Units', u'Transmode', u'Low', u'High', u'Pricing Point',
u'Delivery Period'],
dtype='object')
>>> df_lst[0].head()
Report Date Location Class Variety Grade Description Units Transmode Low High Pricing Point Delivery Period
0 07/31/2018 Blytheville, AR YELLOW NaN US NO 2 Bushel Truck 3.84 3.84 Country Elevators Cash
1 07/31/2018 Helena, AR YELLOW NaN US NO 2 Bushel Truck 3.76 3.76 Country Elevators Cash
2 07/31/2018 Little Rock, AR YELLOW NaN US NO 2 Bushel Truck 3.74 3.74 Mills and Processors Cash
3 07/31/2018 Pine Bluff, AR YELLOW NaN US NO 2 Bushel Truck 3.67 3.67 Country Elevators Cash
4 07/31/2018 Denver, CO YELLOW NaN US NO 2 Bushel Truck-Rail 3.72 3.72 Terminal Elevators Cash
>>> df_lst[0].shape
(30674, 11)
Now, back to the point I made about the URL parameters--using Python, we can run through lists and format the URL string to our liking. For instance, iterating through 20 years of the given query can be done by modifying the URL to have numbers corresponding to positional arguments in Python's str.format() method. Here's a full example below:
import datetime
import pandas as pd
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=®ionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
#print(df_lst[0].head()) # uncomment to see first five rows
#print(df_lst[0].shape) # uncomment to see DataFrame shape
Be careful with pd.read_html. I've modified my answer with a header keyword argument to pd.read_html() because the multi-indexing made it a pain to get results. By giving a single row index as the header, it's no longer a multi-index, and data indexing is easy. For instance, I can get corn class using this:
df_lst[0]['Class']
Compiling all the reports into one large file is also easy with Pandas. Since we have a DataFrame, we can use the pandas.to_csv function to export our data as a CSV (or any other file type you want, but I chose CSV for this example). Here's a modified version with the additional capability of outputting a CSV:
import datetime
import pandas as pd
# URL
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=®ionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
# CSV output file and flag
csv_out = 'myreports.csv'
flag = True
# Start and end dates
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
# Iterate through dates and get report from URL
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
print(df_lst[0].head()) # uncomment to see first five rows
print(df_lst[0].shape) # uncomment to see DataFrame shape
# Save to big CSV
if flag is True:
# 0th iteration, so write header and overwrite existing file
df_lst[0].to_csv(csv_out, header=True, mode='w') # change mode to 'wb' if Python 2.7
flag = False
else:
# Subsequent iterations should append to file and not add new header
df_lst[0].to_csv(csv_out, header=False, mode='a') # change mode to 'ab' if Python 2.7
Your particular query generates at least 1227 pages of data - so I just trimmed it down to one location - Arizona(from 07/31/2018 - 08/1/2019) - now generating 47 pages of data. xml size was 500KB
You can semi automate like this:
>>> end_day='01'
>>> start_day='31'
>>> start_month='07'
>>> end_month='08'
>>> start_year='2018'
>>> end_year='2019'
>>> link = f"https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={end_month}%2F{end_day}%2F{end_year}&commDetail=All&endDateWeekly={end_month}%2F{end_day}%2F{end_year}&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={start_month}%2F{start_day}%2F{start_year}+&runReport=true&grade=®ionsDesc=All+AR&subprimals=&mscore=&endYear={end_year}&repDateWeekly={start_month}%2F{start_day}%2F{start_year}&_wrange=1&endDateWeeklyGrain=&repYear={end_year}&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate={start_month}%2F{start_day}%2F{start_year}&endDate={end_month}%2F{end_day}%2F{end_year}&format=xml"
>>> link
'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&endDateWeekly=08%2F01%2F2019&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018+&runReport=true&grade=®ionsDesc=All+AR&subprimals=&mscore=&endYear=2019&repDateWeekly=07%2F31%2F2018&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=xml'
>>> with urllib.request.urlopen(link) as response:
... html = response.read()
...
loading the html could take a hot minute with large queries
If you for some reason wished to process the entire data set - you could repeat this process - but you may wish to look into techniques that can be specifically optimized for big data - perhaps a solution involving Python's Pandas and numexpr(for GPU acceleration/parallelization)
You can find the data used in this answer here - which you can download as an xml.
First import your xml:
>>> import xml.etree.ElementTree as ET
you can either download the file from the website in python
>>> tree = ET.parse(html)
or manually
>>> tree = ET.parse('report.xml')
>>> report = tree.getroot()
you can then do stuff like this:
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> for el in report[0]:
... print(el)
...
<Element 'reportDate' at 0x7f902adcf368>
<Element 'location' at 0x7f902ac814f8>
<Element 'classStr' at 0x7f902ac81548>
<Element 'variety' at 0x7f902ac81b88>
<Element 'grade' at 0x7f902ac29cc8>
<Element 'units' at 0x7f902ac29d18>
<Element 'transMode' at 0x7f902ac29d68>
<Element 'bidLevel' at 0x7f902ac29db8>
<Element 'deliveryPoint' at 0x7f902ac29ea8>
<Element 'deliveryPeriod' at 0x7f902ac29ef8>
More info on parsing xml is here.
You're going to want to learn some python - but hopefully you can make sense of the following. Luckily - there are many free python tutorials online - here is a quick snippet to get you started.
#lets find the lowest bid on a certain day
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> report[0][7]
<Element 'bidLevel' at 0x7f902ac29db8>
>>> report[0][7][0]
<Element 'lowPrice' at 0x7f902ac29e08>
>>> report[0][7][0].text
'3.84'
#how many low bids are there?
>>> len(report)
1216
#get an average of the lowest bids...
>>> low_bid_list = [float(bid[7][0].text) for bid in report]
[3.84, 3.76, 3.74, 3.67, 3.65, 3.7, 3.5, 3.7, 3.61,...]
>>> sum = 0
>>> for el in low_bid_list:
... sum = sum + el
...
>>> sum
4602.599999999992
>>> sum/len(report)
3.7850328947368355
Suppose I have the following DataFrames:
Containers:
Key ContainerCode Quantity
1 P-A1-2097-05-B01 0
2 P-A1-1073-13-B04 0
3 P-A1-2024-09-H05 0
5 P-A1-2018-08-C05 0
6 P-A1-2089-03-C08 0
7 P-A1-3033-16-H07 0
8 P-A1-3035-18-C02 0
9 P-A1-4008-09-G01 0
Inventory:
Key SKU ContainerCode Quantity
1 22-3-1 P-A1-4008-09-G01 1
2 2132-12 P-A1-3033-16-H07 55
3 222-12 P-A1-4008-09-G01 3
4 4561-3 P-A1-3083-12-H01 126
How do I update the Quantity values in Containers to reflect the number of units in each container based on the information in Inventory? Note that multiple SKUs can reside in a single ContainerCode, so we need to add to the quantity, rather than just replace it, and there may be multiple entries in Containers for a particular ContainerCode.
What are the possible ways to accomplish this, and what are their relative pros and cons?
EDIT
The following code seems to serve as a good test case:
import itertools
import pandas as pd
import numpy as np
inventory = pd.DataFrame({'Container Code':['A1','A2','A2','A4'],
'Quantity':[10,87,2,44],
'SKU':['123-456','234-567','345-678','456-567']})
containers = pd.DataFrame({'Container Code':['A1','A2','A3','A4'],
'Quantity':[2,0,8,4],
'Path Order':[1,2,3,4]})
summedInventory = inventory.groupby('Container Code')['Quantity'].sum()
print('Containers Data Frame')
print(containers)
print('\nInventory Data Frame')
print(inventory)
print('\nSummed Inventory List')
print(summedInventory)
print('\n')
newContainers = containers.drop('Quantity', axis=1). \
join(inventory.groupby('Container Code').sum(), on='Container Code')
print(newContainers)
This seems to produce the desired output.
I also tried using a regular merge:
pd.merge(containers.drop('Quantity', axis=1), \
summedInventory,how='inner',left_on='Container Code', right_index=True)
But that produces an 'IndexError: list index out of range'
Any ideas?
I hope I got your scenario correctly. I think you can use:
containers.drop('Quantity', axis = 1).\
join(inventory.groupby('ContainerCode').sum(), \
on = 'ContainerCode')
I'm first dropping quantity from containers because you don't need it - we'll create it from inventory.
Then, we group by inventory by the container code, to sum the quantity relevant to each container.
We then perform the join between the two, and each containercode existent in containers would recieve the summed quantity from inventory