dataframe.empty gives "True" result, which should be "False" imo - python

With the help of Pandas I try to search for an ID in a column in a CSV file. I want the code to check if the dataframe is empty.
When I try the standalone code in Python3 it works fine. However, incorporated in the main code, the value that is returned is always "True". This is the strangest thing I have encountered so far.
def lookupID(ID):
ID_df = df[df['ID'] == ID]
ID_test = ID_df.empty
return ID_test
df = pd.read_csv(mainFile, usecols=range(0,19))
for container in containers:
ID = container.a["data-id"]
if not lookupID(ID):
code
else:
code
I hope someone knows what the hell is up with this one.
Some example rows with the headers included:
ID,merk,soort,prijs,bouwjaar,km-stand,transmissie,brandstof,inhoud,vermogen,plaatsnaam,distance,datum_begin,datum,prijs_oud,carroserie,kleur,titel,URL
124471366,BMW,2 Serie,19900,2015,25089,Handgeschakeld,Benzine,1499,100,Nieuwerkerk Ad Ijssel (ZH),116240,08/07/2019,08/07/2019,19900,MPV,Grijs,Active Tourer 218i High Executive Sport Line
124538604,BMW,1-serie,8900,2006,66914,Automaat,Benzine,1995,110,Heerlen (LI),134876,08/07/2019,08/07/2019,8900,Hatchback,Blauw,120i Anniversary / Automaat / Navigatie
124538344,BMW,3-serie,2900,2012,108390,Automaat,Diesel,1995,121,Haarlem (NH),131073,08/07/2019,08/07/2019,2900,Sedan,Grijs,320D Sedan High Exe-LEER-FINANCIAL LEASE VA 269| - PMND

Related

How do I filter out elements in a column of a data frame based upon if it is in a list?

I'm trying to filter out bogus locations from a column in a data frame. The column is filled with locations taken from tweets. Some of the locations aren't real. I am trying to separate them from the valid locations. Below is the code I have. However, the output is not producing the right thing, it instead will only return France. I'm hoping someone can identify what I'm doing wrong here or another way to try. Let me know if I didn't explain it well enough. Also, I assign variables both outside and inside the function for testing purposes.
import pandas as pd
cn_csv = pd.read_csv("~/Downloads/cntry_list.csv") #this is just a list of every country along with respective alpha 2 and alpha 3 codes, see the link below to download csv
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv") #this is a dataframe with multiple columns, one being "source location" See edit below that displays data in "Source Location" column
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
def country_name_check(input_country_list):
cn_csv = pd.read_csv("~/Downloads/cntrylst.csv")
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv")
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
valid_names = []
tobe_checked = []
for i in new_list:
if i in country_names.values:
valid_names.append(i)
else:
tobe_checked.append(i)
return valid_names, tobe_checked
print(country_name_check(src_locs))
EDIT 1: Adding the link for the cntry_list.csv file. I downloaded the csv of the table data. https://worldpopulationreview.com/country-rankings/country-codes
Since I am unable to share a file on here, here is the "Source Location" column data:
Source Location
She/her
South Carolina, USA
Torino
England, UK
trying to get by
Bemidiji, MN
St. Paul, MN
Stockport, England
Liverpool, England
EH7
DLR - LAX - PDX - SEA - GEG
Barcelona
Curitiba
kent
Paris, France
Moon
Denver, CO
France
If your goal is to find and list country names, both valid and not, you may filter the initial results DataFrame:
# make list from unique values of Source Location that match values from country_names
valid_names = list(results[results['Source Location']
.isin(country_names)]['Source Location']
.unique())
# with ~ select unique values that don't match country_names values
tobe_checked = list(results[~results['Source Location']
.isin(country_names)]['Source Location']
.unique())
Your unwanted result with only France being returned could be solved by trying that simpler approach. However, the problem in your code may be there when reading cntrylst outside of the function, as indicated by ScottC

Python: Extract unique sentences from string and place them in new column concatenated with ;

Afternoon Stackoverflowers,
I have been challenged with some extract/, as I am trying to prepare data for some users.
As I was advised, it is very hard to do it in SQL, as there is no clear pattern, I tried some things in python, but without success (as I am still learning python).
Problem statement:
My SQL query output is either excel or text file (depends on how I publish it but can do it both ways). I have a field (fourth column in excel or text file), which contains either one or multiple rejection reasons (see example below), separated by a comma. And at the same time, a comma is used within errors (sometimes).
Field example without any modification
INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]
Desired output:
Invalid Address information; Company Id on the Invoice does not match with the Company Id registered for the Code in System; Incorrect Purchase order
Python code:
import pandas
excel_data_df = pandas.read_excel('rejections.xlsx')
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
excel_data_df['Invoice_Issues'].split(":", 1)
Output:
INVOICE_CREATION_FAILED[Invalid Address information:
I tried split string, but it doesn't work properly. It deletes everything after the colon, which is acceptable because it is coded that way, however, I would need to trim the string of the unnecessary data and keep only the desired output for each line.
I would be very thankful for any code suggestions on how to trim the output in a way that I will extract only what is needed from that string - if the substring is available.
In the excel, I would normally use list of errors, and nested IFs function with FIND and MATCH. But I am not sure how to do it in Python...
Many thanks,
Greg
This isn't the fastest way to do this, but in Python, speed is rarely the most important thing.
Here, we manually create a dictionary of the errors and what you would want them to map to, then we iterate over the values in Invoice_Issues, and use the dictionary to see if that key is present.
import pandas
# create a standard dataframe with some errors
excel_data_df = pandas.DataFrame({
'Invoice_Issues': [
'INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]']
})
# build a dictionary
# (just like "words" to their "meanings", this maps
# "error keys" to their "error descriptions".
errors_dictionary = {
'E01': 'Invalid Address information',
'E02': 'Incorrect Purchase order',
'E03': 'Invalid VAT ID',
# ....
'E39': 'No tax line available'
}
def extract_errors(invoice_issue):
# using the entry in the Invoice_Issues column,
# then loop over the dictionary to see if this error is in there.
# If so, add it to the list_of_errors.
list_of_errors = []
for error_number, error_description in errors_dictionary.items():
if error_description in invoice_issue:
list_of_errors.append(error_description)
return ';'.join(list_of_errors)
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
# for every row in the Invoice_Isses column, run the extract_errors function, and store the value in the 'Errors' column.
excel_data_df['Errors'] = excel_data_df['Invoice_Issues'].apply(extract_errors)
# display the dataframe with the extracted errors
print(excel_data_df.to_string())
excel_data_df.to_excel('extracted_errors.xlsx)

YFinance - tickerData.info not working for some stocks

import yfinance as yf
#define the ticker symbol
tickerSymbol = "AFT.NZ"
#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
print(tickerData.info)
This doesn't seem to work. IndexError: list index out of range
Replace "AFT.NZ" with "MSFT" or "FPH.NZ" and it works fine. Going to the Yahoo website, can't see why it wouldn't have data on it.
What's more confusing, is that replacing print(tickerData.info) with tickerDf = tickerData.history(period='max') does print some of the data.
I need the info because I want the full name of the company along with the currency the shares are traded in. Which is why just having the price data isn't the solution.
The AFT.NZ is just an example, most others on the NZX50 seem to have the same problem.
There's a merge request from williamsiuhang to fix this that's currently 9 days old.
https://github.com/ranaroussi/yfinance/pull/371/commits/7e137357296a1df177399d26543e889848efc021
I just made the change manually myself, go into base.py (in your_py_dir\Lib\site-packages\yfinance) and change line 286:
old line:
self._institutional_holders = holders[1]
new line:
self._institutional_holders = holders[1] if len(holders) > 1 else []
I've had the same problem, and see there's a lot of posts on github with the same error.
I've fixed the error with a try & except in the base.py file for yfinance
Line 282
# holders
try:
url = "{}/{}/holders".format(self._scrape_url, self.ticker)
holders = _pd.read_html(url)
self._major_holders = holders[0]
self._institutional_holders = holders[1]
if 'Date Reported' in self._institutional_holders:
self._institutional_holders['Date Reported'] = _pd.to_datetime(
self._institutional_holders['Date Reported'])
if '% Out' in self._institutional_holders:
self._institutional_holders['% Out'] = self._institutional_holders[
' % Out'].str.replace('%', '').astype(float)/100
except:
print("institutional_holders error")
Not a great solution, but makes it run for me. I'm not a great programmer, so I hope the problem get fixed in a more delicate way by the developer(s).

Error in For Loop Logic for a Column in Pandas Python 3

I have a list of locations. For every location in the location column, there is a function which finds its coordinates, if they are not there already. This operation is performed for all. The loop copies the last value of latitudes and longitudes in all the rows, which it shouldn't do. Where am I making the mistake?
What I have
location gpslat gpslong
Brandenburger Tor na na
India Gate na na
Gateway of India na na
What I want
location gpslat gpslong
Brandenburger Tor 52.16 13.37
India Gate 28.61 77.22
Gateway of India 18.92 72.81
What I get
location gpslat gpslong
Brandenburger Tor 18.92 72.81
India Gate 18.92 72.81
Gateway of India 18.92 72.81
My Code
i = 0
for location in df.location_clean:
try:
if np.isnan(float(df.gpslat.iloc[i])) == True:
df.gpslat.iloc[i], df.gpslong.iloc[i] = find_coordinates(location)
print(i, "Coordinates Found for --->", df.location_clean.iloc[i])
else:
print(i,'Coordinates Already Present')
except:
print('Spelling Mistake Encountered at', df.location_clean.iloc[i], 'Moving to NEXT row')
continue
i = i + 1
I guess, I am making a logical error either with the index i or the statement df.gpslat.iloc[i], df.gpslong.iloc[i] = find_coordinates(location) . I tried changing them and rerunning the loop, but it's same. It's also a time consuming process as there are thousands of locations.
It is hard to help without seeing the data, but this might help you.
In the future, please provide a minimal example of your data so we can work with it and help you better.
Furthermore, you should never use 'except' without supplying the exact error - in this case your except catches ALL errors, even when there is something else wrong besides your "spelling error" - without you noticing it!
When iterating over a dataframe, use iterrows() - it makes it a lot more readable and you don't have to use extra variables
using iloc opens you up to pandas' SettingWithCopyWarning ( see here: https://www.dataquest.io/blog/settingwithcopywarning/), try to avoid it.
Here is the code:
# ____ Preparation ____
import pandas as pd
import numpy as np
lst = [['name1', 1, 3]
,['name2',1, 3]
,['name3',None, None]
,['name4',1, 3]
]
df = pd.DataFrame(lst, columns =['location', 'gpslat', 'gpslong',])
print(df.head())
# ____ YOUR CODE ____
for index, row in df.iterrows():
try:
if np.isnan(float(row['gpslat'])) == True:
lat, long = find_coordinates(row['location'])
print(lat,long)
df.at[index, 'gpslat'] = lat
df.at[index, 'gpslong'] = long
except TypeError: # exchange this with the exact error which you want to catch
print('Spelling Mistake Encountered at', row['location'], ' in row ', index, 'Moving to NEXT row')
continue
print(df.head())

How to see what get_fundamentals is adding to universe

I have a simple python code with searches the market for companies with specific financial criteria.
Code:
num_stocks = 2
fundamental_df = get_fundamentals(
query(
fundamentals.valuation.market_cap,
fundamentals.income_statement.net_income
)
.filter(fundamentals.income_statement.net_income > 1)
.order_by(fundamentals.valuation.market_cap)
.desc())
.limit(num_stocks)
)
update_universe(fundamental_df.columns.values)
context.stocks = [stock for stock in fundamental_df]
for stock in context.stocks:
print stock
Im want to see if it is working/what companies it is querying.
I tried having my algo print what whats inside context.stocks but its not work.
Could someone help me out,
Thanks
Mike

Categories