Python Pandas-Automating Data Gathering from Website - python

I'm trying to make a script that can read through a table of company names from one website and take the names of each company and put them in an url (the url exists, and it contains more data specific to each company. This data is what I want to analyze).
However, I cannot get the names to be put in the url without python also putting in parts of the table, giving me the error below:
import numpy as np
import pandas as pd
import requests
url1 = "http://openinsider.com/latest-penny-stock-buys"
df1 = pd.read_html(url1)
table = df1[11]
# sorting
n = np.quantile(table["Qty"], [0.99])
print("20th percentile: ", n)
q = table.sort_values("Qty", ascending=False)
name = q["Ticker"].str.replace("\d+", "")
page = requests.get(url1)
name = table["Ticker"]
# Buyers for the company
url = "http://openinsider.com/"
for entry in name: # <- Question starts here
name = entry + 1
table2 = pd.read_html(url + str(name))
df2 = table2[11]
print(df2)
Error: InvalidURL: URL can't contain control characters. '/0 OPK\n1 VEII\n2 NGM\n3 STRR\n4
IMRA\n ... \n95 NaN\n96 CDXC\n97 PED\n98 FOA\n99 CAMP\nName:
Ticker, Length: 100, dtype: object' (found at least ' ')```
Thanks!

In your for-loop:
remove name = entry + 1
replace url+str(name) with url + entry
And so, you get the expected output printed:
X Filing Date Trade Date Ticker Insider Name Title \
0 NaN 2022-10-21 19:03:38 2022-10-19 MIST Wills Robert James Dir
1 NaN 2022-10-21 19:02:50 2022-10-20 MIST Pasternak Richard C Dir
2 M 2022-10-21 19:02:01 2022-10-20 MIST Liebert Debra K. Dir
3 NaN 2022-10-21 19:01:16 2022-10-19 MIST Tomsicek Michael John Dir
4 NaN 2022-09-09 16:15:34 2022-09-07 MIST Rtw Investments, LP 10%
5 NaN 2022-09-09 16:15:34 2022-09-07 MIST Rtw Investments, LP 10%
6 D 2022-06-01 21:32:38 2022-05-31 MIST Truex Paul F Dir
Trade Type Price Qty Owned ΔOwn Value 1d 1w 1m 6m
0 P - Purchase $4.93 15000 15000 New +$73,950 NaN NaN NaN NaN
1 P - Purchase $5.20 10000 10000 New +$52,000 NaN NaN NaN NaN
2 P - Purchase $5.28 14000 14127 >999% +$73,940 NaN NaN NaN NaN
3 P - Purchase $5.32 15000 15000 New +$79,800 NaN NaN NaN NaN
...

Related

why is pd.crosstab not giving the expected output in python pandas?

I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])

How we can Iterate within a particular row with known index in a pandas data frame?

I have a data frame named df_cp which has the data as below,
I need to insert a new project name for CompanyID 'LCM' at the first empty cell in a row with index 1. I have found the index of the row which is of my interest using this,
index_row = df_cp[df_cp['CompanyID']=='LCM'].index
How can I iterate within a row with index_row as 1, the task is to replace the first NaN at index 1 with "Healthcare".
Please help with this.
IIUC, you can use isna and idxmax:
df.loc[1, df.loc[1].isna().idxmax()] = 'Healthcare'
Output:
CompanyID Project01 Project02 Project03 Project04 Project05
0 134 oil furniture NaN NaN NaN
1 LCM oil furniture car Healthcare NaN
2 Z01 oil furniture NaN NaN NaN
3 453 oil furniture agro meat NaN
Note: idxmax returns the index of the first occurrence of the maximum value.
More, generalized:
m = df['CompanyID'] == 'LCM'
df.loc[m, df[m].isna().idxmax(axis=1)] = 'Healthcare'
df
Output:
CompanyID Project01 Project02 Project03 Project04 Project05
0 134 oil furniture NaN NaN NaN
1 LCM oil furniture car Healthcare NaN
2 Z01 oil furniture NaN NaN NaN
3 453 oil furniture agro meat NaN

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

How to shift the values of a certain group by different amounts

I have a DataFrame that looks like this:
user data
0 Kevin 1
1 Kevin 3
2 Sara 5
3 Kevin 23
...
And I want to get the historical values (looking let's say 2 entries forward) as rows:
user data data_1 data_2
0 Kevin 1 3 23
1 Sara 5 24 NaN
2 Kim ...
...
Right now I'm able to do this through the following command:
_temp = df.groupby(['user'], as_index = False)['data']
for i in range(1,2):
data['data_{0}'.format(i)] = _temp.shift(-1)
I feel like my approach is very inefficient and that there is a much faster way to do this (esp. when the number of lookahead/lookback values go up)!
You can use groupby.cumcount() with set_index() and unstack():
m=df.assign(k=df.groupby('user').cumcount().astype(str)).set_index(['user','k']).unstack()
m.columns=m.columns.map('_'.join)
print(m)
data_0 data_1 data_2
user
Kevin 1.0 3.0 23.0
Sara 5.0 NaN NaN

Writing tabular data to a csv file from a webpage

I've written a script in python to parse some data from a webpage and write it to a csv file via panda. So far what I've written can parse all the tables available in that page but in case of writing to a csv file it is writing the last table from that page to that csv file. Definitely, the data are being overwritten because of the loop. How can I fix this flaw so that my scraper will be able to write all the data from different tables instead of only the last table? Thanks in advance.
import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
df.to_csv("table_item.csv")
print(df)
Btw, I expect to write data to a csv file using panda only. Thanks again.
You can use read_html what return list of DataFrames in webpage, so then need concat for one df:
dfs = pd.read_html('http://www.espn.com/nba/schedule/_/date/20171001')
df = pd.concat(dfs, ignore_index=True)
#if necessary rename columns
d = {'Unnamed: 1':'a', 'Unnamed: 7':'b'}
df = df.rename(columns=d)
print (df.head())
matchup a time (ET) nat tv away tv home tv \
0 Atlanta ATL Miami MIA NaN NaN NaN NaN
1 LA LAC Toronto TOR NaN NaN NaN NaN
2 Guangzhou Guangzhou Washington WSH NaN NaN NaN NaN
3 Charlotte CHA Boston BOS NaN NaN NaN NaN
4 Orlando ORL Memphis MEM NaN NaN NaN NaN
tickets b
0 2,401 tickets available from $6 NaN
1 284 tickets available from $29 NaN
2 2,792 tickets available from $2 NaN
3 2,908 tickets available from $6 NaN
4 1,508 tickets available from $3 NaN
And last to_csv for write to file:
df.to_csv("table_item.csv", index=False)
EDIT:
For learning is possible append each DataFrame to list and then concat:
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
dfs = []
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
print(df)
df.to_csv("table_item.csv")

Categories