Scraping EDGAR HTML file and want to convert into dataframe - python

I'm new to scraping the website
url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
df = pd.read_html(url, parse_dates=[0])[0]
print (df.head())
This is my code and I want to extract all data from this website, but the result always be the first 'body'
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 Collection Period Beginning: NaN NaN 08/01/2020 NaN
2 Collection Period Ending: NaN NaN 08/31/2020 NaN
3 Previous Payment/Close Date: NaN NaN 08/17/2020 NaN
4 Payment Date NaN NaN 09/15/2020 NaN
How can I get the rest of all?

pd.read_html returns a list of all the tables. You are just reading the initial table so it is giving you a single df.
try :
df = pd.read_html(url, parse_dates=[0])
df1 = df[0]
df2 = df[1]
.. and so on to read all the df at the index. df holds the list and you can access the list elements at each index.

Related

Python - Pandas - DROPNA(subset) deleting value for no apparent reasons?

I'm cleaning some data and I've been struggling with one thing.
I have a dataframe with 7740 rows and 68 columns.
Most of the columns contains Nan values.
What i'm interested in, is to remove NaN values when it is NaN in those two columns : [SERIAL_ID],[NUMBER_ID]
Example :
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
NaN
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
NaN
NaN
4555555
Results
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
4555555
Removing rows when NaN is in the two columns.
I've used the followings to do so :
df.dropna(subset=['SERIAL_ID', 'NUMBER_ID'], how='all', inplace=True)
When I use this on my dataframe with 68 columns the result I get is this one :
SERIAL_ID
NUMBER_ID
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7896521
NaN
NaN
95856ERT5
NaN
NaN
NaN
NaN
4555555
I tried with a copy of the dataframe with only 3 columns, it is working fine.
It is somehow working (I can tel cause I have an identical ID in another column) but remove some of the value, and I have no idea why.
Please help I've been struggling the whole day with this.
Thanks again.
I don't know why it only works for 3 columns and not for 68 originals.
However, we can obtain desired output in other way.
use boolean indexing:
df[df[['SERIAL_ID', 'NUMBER_ID']].notnull().any(axis=1)]
You can use boolean logic or simple do something like this for any given column:
import numpy as np
import pandas as pd
# sample dataframe
d = {'SERIAL_ID':['8RY68U4R', '8756ERT5', np.nan, np.nan],
'NUMBER_ID':[np.nan, 8759321, np.nan ,7896521]}
df = pd.DataFrame(d)
# apply logic to columns
df['nans'] = df['NUMBER_ID'].isnull() * df['SERIAL_ID'].isnull()
# filter columns
df_filtered = df[df['nans']==False]
print(df_filtered)
which returns this:
SERIAL_ID NUMBER_ID nans
0 8RY68U4R NaN False
1 8756ERT5 8759321.0 False
3 NaN 7896521.0 False

Extracting values into a new column

I have a column in a dataframe as follows:
Data
[special_request=nowhiterice, waiter=Janice]
[allegic=no, waiter=Janice, tip=20]
[allergic=no, tip=20]
[special_request=nogreens]
May I know how could I make it such that one data = 1 column ?
special_request allegic waiter tip
You can make a Dictionary by splitting the elements of your series and build your Dataframe from it (s being your column here):
import pandas as pd
s = pd.Series([['special_request=nowhiterice', 'waiter=Janice'],
['allegic=no', 'waiter=Janice', 'tip=20'],
['allergic=no', 'tip=20'],
['special_request=nogreens']])
df = pd.DataFrame([dict(e.split('=') for e in row) for row in s])
print(df)
Output:
special_request waiter allegic tip allergic
0 nowhiterice Janice NaN NaN NaN
1 NaN Janice no 20 NaN
2 NaN NaN NaN 20 no
3 nogreens NaN NaN NaN NaN
Edit: if the column values are actual strings, you first should split your string (also stripping [, ]and whitespaces):
s = pd.Series(['[special_request=nowhiterice, waiter=Janice]',
'[allegic=no, waiter=Janice, tip=20]',
'[allergic=no, tip=20]',
'[special_request=nogreens]'])
df = pd.DataFrame([dict(map(str.strip, e.split('=')) for e in row.strip('[]').split(',')) for row in s])
print(df)
You can split the column value of string type into dict then use pd.json_normalize to convert dict to columns.
df_ = pd.json_normalize(df['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.strip("[]").split(',')])))
print(df_)
special_request waiter allegic tip allergic
0 nowhiterice Janice NaN NaN NaN
1 NaN Janice no 20 NaN
2 NaN NaN NaN 20 no
3 nogreens NaN NaN NaN NaN

Check whether a dataframe cell contains value that is in another dataframe's cell

I'm trying to do the following:
Given a row in df1, if str(row['code']) is in any rows for df2['code'], then I would like all those rows in df2['lamer_url_1'] and df2['shopee_url_1'] to take the corresponding values as from df1.
Then carry on with the next row for df1['code']...
'''
==============
Initial Tables:
df1
code lamer_url_1 shopee_url_1
0 L61B18H089 b a
1 L61S19H014 e d
2 L61S19H015 z y
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 NaN NaN NaN NaN
1 L61S19H014-S1500 NaN NaN NaN NaN
2 L61B18H089-F1424 NaN NaN NaN NaN
==============
Expected output:
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 b a NaN NaN
1 L61S19H014-S1500 e d NaN NaN
2 L61B18H089-F1424 b a NaN NaN
'''
I assumed that common part of "code" from "df2" are chars before "-". I also assumed that from "df1" we want 'lamer_url_1', 'shopee_url_1' and from "df2" we want 'lamer_url_2', 'shopee_url_2' (correct me in comment if I am wrong so I can polish code):
df1.set_index(df1['code'], inplace=True)
df2.set_index(df2['code'].apply(lambda x: x.split('-')[0]), inplace=True)
df2.index.names = ['code_join']
df3 = pd.merge(df2[['code', 'lamer_url_2', 'shopee_url_2']],
df1[['lamer_url_1', 'shopee_url_1']],
left_index=True, right_index=True)

When i convert my numpy array to Dataframe it update values to Nan

import impyute.imputation.cs as imp
print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)
When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?
Before
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 31 5.0 ... 117.50 5.0
1 61 2.0 ... 122.80 3.0
2 116 0.0 ... 137.50 2.5
3 123 0.0 ... 77.58 2.0
4 27 0.0 ... 135.10 3.5
5 77 0.0 ... 84.60 2.5
After
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
Editted
Solution first
Instead of passing columns to pd.DataFrame, just manually assign column names:
data = pd.DataFrame(imp.em(data))
data.columns = columns
Cause
Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns).
imp.em has a decorator #preprocess which converts input into a numpy.array if it is a pandas.DataFrame.
...
if pd_DataFrame and isinstance(args[0], pd_DataFrame):
args[0] = args[0].as_matrix()
return pd_DataFrame(fn(*args, **kwargs))
It therefore returns a dataframe reconstructed from a matrix, having range(data.shape[1]) as column names.
And as I have pointed below, when pd.DataFrame is instantiated with mismatching columns on another pd.DataFrame, all the contents become NaN.
You can test this by
from impyute.util import preprocess
#preprocess
def test(data):
return data
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
columns = data.columns
data = pd.DataFrame(test(data), columns = columns))
size time
0 NaN NaN
1 NaN NaN
2 NaN NaN
When you instantiate a pd.DataFrame from an existing pd.DataFrame, columns argument specifies which of the columns from original dataframe you want to use.
It does not re-label the dataframe. Which is not odd, just the way pandas intended in reindexing
By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.
# Make new pseudo dataset
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
data
size time
0 3 1
1 2 2
2 1 3
#Make new dataset with original `data`
data = pd.DataFrame(data, columns = ["a", "b"])
data
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
There may be some bug in impyute library. You are using em function which is nothing but a way to fill-missing values by expectation-maximization algorithm. You can try without using that function, as
df = pd.DataFrame(data = Data ,columns = columns)
You can raise this issue here after confirming. To confirm first load the data, using above example and find if there are null data present in the data by using df.isnull() method.
Data = pd.DataFrame(data = np.array(imp.em(Data)),columns = columns)
Doing this solved the issue i was facing, i guess the data after the use of em function doesn't return numpy array.

Python/Pandas - drop_duplicates selecting the most complete row

I have a dataframe with people information. However sometimes these guys get repeated and some rows have more info about the same person than the others. Is there a way to drop the duplicates using column 'Name' as reference but only keep the most filled rows?
If you have a dataframe like
df = pd.DataFrame([['a',np.nan,np.nan,'M'],['a',12,np.nan,'M'],['c',np.nan,np.nan,'M'],['d',np.nan,np.nan,'M']],columns=['Name','Age','Region','Gender'])
Sorting rows based on nan count and dropping duplicates with subset 'Name' by keep first one might help i.e.
df['count'] = pd.isnull(df).sum(1)
df= df.sort_values(['count']).drop_duplicates(subset=['Name'],keep='first').drop('count',1)
Output:
Before:
Name Age Region Gender
0 a NaN NaN M
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M
After:
Name Age Region Gender
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M

Categories