Instantiating python dataframe in a for loop - python

I am trying to create a condition using for loop and if statement for a python dataframe object. In order to accurately specify which row from the data table to extract upon a specific condition, I searched the row index, and created an argument to specify the location before the for loop. The specifics looks something like this:
import pandas as pd
input_csv_file = "./CSV/Officers_and_Shareholders.csv"
df = pd.read_csv(input_csv_file, skiprows=10, on_bad_lines='skip')
df.fillna('', inplace=True)
# df.drop([0, 3], inplace=True)
df.columns = ['Nama', 'Jabatan', 'Alamat', 'Klasifikasi Saham', 'Jumlah Lembar Saham', 'Total']
# print(df.shape)
# print(df.columns)
# print(df.iloc[:53])
# shareholders = df.iloc[24:42]
# print(shareholders)
# officers = df.iloc[0:23]
# print(officers)
dataframe = df.query("Total.ne('-')")
def get_shareholder_by_row_index():
for column in df.columns:
if object(df.iloc[column][:53]) == dataframe:
shareholders = df.iloc[24:42]
print(shareholders)
# elif object(df[:53][column]) != dataframe:
# officers = df.iloc[0:23]
# print(officers)
Because the format of the CSV file is not proper, I forced dataframe to re-create a header on top of the original CSV file, which I indicate under df.columns. The df.iloc[24:42] and df.iloc[0:23] are able to specifically locate the data range in the dataframe, but it doesn't return so when instantiated inside the for loop. Objectively, I want to create a function where if the row under the column Total is empty (-), then return the officers, but if the row under the column Total is not empty, then return shareholders. In this case, how should I modify the for loop and the if statement?
The desired output for shareholders will be:
24 PT CTCORP INFRASTRUKTUR D INDONESIA, ... Rp. 3.200.000.000
25 Nomor SK :- I ...
26 JalanKaptenPierreTendeanKavling12-14A ...
27 PT INTRERPORT PATIMBAN AGUNG, ... Rp. 2.900.000.000
28 Nomor SK :- ...
29 ...
30 ...
31 ...
32 ...
33 ...
34 PT PATIMBAN MAJU BERSAMA, ... Rp. 2.900.000.000
35 Nomor SK :AHU- ...
36 0061318.AH.01.01.TAHUN 2021 ...
37 Tanggal SK :30 September 2021 ...
38 ...
39 ...
40 PT TERMINAL PETIKEMAS ... Rp. 1.000.000.000
41 SURABAYA, ...
42 Nomor SK :- ...
and for the officers, it will return:
Nama ... Total
1 NIK: 3171060201830005 ...
2 NPWP: 246383541071000 ...
3 TTL: Jakarta, 02 Januari 1983 ...
5 NIK: 1271121011700003 ...
6 NPWP: 070970173112000 ...
7 TTL: Bogor, 10 November 1970 ...
8 ARLAN SEPTIA ANANDA ...
9 RASAM, ...
10 NIK: 3174051209620003 ...
11 NPWP: 080878200013000 ...
12 TTL: Jakarta, 12 September ...
13 1962 ...
15 NIK: 3171011605660004 ...
16 NPWP: 070141650093000 ...
17 TTL: Jakarta, 16 Mei 1966 ...
18 FUAD RIZAL, ...
21 PURNOMO, UTAMA RASRINIK: 3578032408610001 ...
22 NPWP: 097468813615000 ...
23 TTL: SLEMAN, 24 Agustus 1961 ...

Stakeholder and Officer will be printed withrecpect to the index (Row Number)
if this is not the desired answer then mention little detail
def get_shareholder_by_row_index():
for i in range(len(df)):
# this will give you shareholders if row under Total is empty else office if row is not empty
if df["Total"][i] == '' :
print(i," shareholders")
print(df.iloc[i])
# what ever your code is, will be here
else:
print(i," officers")
print(df.iloc[i])
# what ever your code is, will be here
# this will give you the indces where row under total is empty
print(df["Total"].iloc[:53][df["Total"] == ''])

Related

Web scraping with Python and Pandas - Pagination

With this short code I can get data from the table:
import pandas as pd
df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv('2023_I_M_800.csv')
I am trying to get data from all pages or a determinated number of them but since this website doesn't use lu or li elementsIdon'tknow exacxtly how to built it.
Any help or idea would be appreciated.
Since the url contains the page number, why not just making a loop and concat ?
`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular
import pandas as pd
​
F, L = 1, 4 # first and last pages
​
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
sub_df = pd.read_html(url, parse_dates=True)[0]
sub_df.insert(0, "page_number", page)
dico[page] = sub_df
​
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv
NB : You can access each sub_df separately by using key-indexing notation : dico[num_page].
Output :
print(out)
page_number Rank ... Date Results Score
0 1 1 ... 22 JAN 2023 1230
1 1 2 ... 22 JAN 2023 1204
2 1 3 ... 29 JAN 2023 1204
3 1 4 ... 27 JAN 2023 1192
4 1 5 ... 28 JAN 2023 1189
.. ... ... ... ... ...
395 4 394 ... 21 JAN 2023 977
396 4 394 ... 28 JAN 2023 977
397 4 398 ... 27 JAN 2023 977
398 4 399 ... 28 JAN 2023 977
399 4 399 ... 29 JAN 2023 977
[400 rows x 11 columns]
Try this:
for page in range(1, 10):
df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv(f'2023_I_M_800_page_{page}.csv')

Is there any example of code in python which i get table of numbers from the range in the first table?

In my first table I have columns: indeks, il, start and stop. The last two define a range. I need to list (in a new table) all numbers in the range from start to stop, but also save indeks and the other values belonging to the range.
This table shows what kind of data I have (sample):
ID
Indeks
Start
Stop
il
0
A1
1
3
25
1
B1
31
55
5
2
C1
36
900
865
3
D1
900
2500
20
...
...
...
...
...
And this is the table I want to get:
Indeks
Start
Stop
il
kod
A1
1
3
25
1
A1
1
3
25
2
A1
1
3
25
3
B1
31
55
5
31
B1
31
55
5
32
B1
31
55
5
33
...
...
...
...
...
B1
31
55
5
53
B1
31
55
5
54
B1
31
55
5
55
C1
36
900
865
36
C1
36
900
865
37
C1
36
900
865
38
...
...
...
...
...
C1
36
900
865
898
C1
36
900
865
899
C1
36
900
865
900
...
...
...
...
...
EDITET
lidy=pd.read_excel('path' )
lid=pd.DataFrame(lidy)
output = []
for i in range (0,len(lid)):
for j in range (lid.iloc[i,1],lid.iloc[i,2]+1):
y=((lid.iloc[i,0], j))output.append(y)
print(output)
OR
lidy=pd.read_excel('path' )
lid=pd.DataFrame(lidy)
for i in range (0,len(lid)):
for j in range (lid.iloc[i,1],lid.iloc[i,2]+1):
y=((lid.iloc[i,0], j))
print(y)
Two options:
(1 - preferred) Use Pandas (in combination with openpyxl as engine): The Excel-file I'm using is named data.xlsx, and sheet Sheet1 contains your data. Then this
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df["kod"] = df[["Start", "Stop"]].apply(
lambda row: range(row.iat[0], row.iat[1] + 1), axis=1
)
df = df.iloc[:, 1:].explode("kod", ignore_index=True)
with pd.ExcelWriter("data.xlsx", mode="a", if_sheet_exists="replace") as writer:
df.to_excel(writer, sheet_name="Sheet2", index=False)
should produce the required output in sheet Sheet2. The work is done by putting the required range()s in the new column kod, and then .explode()-ing it.
(2) Use only openpyxl:
from openpyxl import load_workbook
wb = load_workbook(filename="data.xlsx")
ws = wb["Sheet1"]
rows = ws.iter_rows(values_only=True)
# Reading the required column names
data = [list(next(rows)[1:]) + ["kod"]]
for row in rows:
# Read the input data (a row)
base = list(row[1:])
# Create the new data via iterating over the the given range
data.extend(base + [n] for n in range(base[1], base[2] + 1))
if "Sheet2" in wb.sheetnames:
del wb["Sheet2"]
ws_new = wb.create_sheet(title="Sheet2")
for row in data:
ws_new.append(row)
wb.save("data.xlsx")

Appending a dictionary to a dataframe as a new column

I'm very new to Python and was hoping to get some help. I am following an online example where the author creates a dictionary, adds some data to it and then appends this to his original dataframe.
When I follow the code the data in the dictionary doesn't get appended to the dataframe and as such I can't continue with the example.
The authors code is as follows:
from collections import defaultdict
won_last = defaultdict(int)
for index,row in data.iterrows():
home_team = row['HomeTeam']
visitor_team = row['AwayTeam']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
results.ix[index]=row
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
When I run this code I get the error message (note that the name of the dataframe is different but apart from that nothing has changed)
AttributeError Traceback (most recent call last)
<ipython-input-46-d31706a5f745> in <module>
4 row['HomeLastWin'] = won_last[home_team]
5 row['VisitorLastWin'] = won_last[visitor_team]
----> 6 data.ix[index]=row
7 won_last[home_team] = row['HomeWin']
8 won_last[visitor_team] = not row['HomeWin']
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'ix'
If I change the row data.ix[index]=row to data.loc[index]=row the code runs ok but nothing happens to my dataframe
Below is an example of the dataset I am working with
Div Date Time HomeTeam AwayTeam FTHG FTAG FTR HomeWIn
E0 12/09/2020 12:30 Fulham Arsenal 0 3 A FALSE
E0 12/09/2020 15:00 Crystal Palace Southampton 1 0 H FALSE
E0 12/09/2020 17:30 Liverpool Leeds 4 3 H TRUE
E0 12/09/2020 20:00 West Ham Newcastle 0 2 A TRUE
E0 13/09/2020 14:00 West Brom Leicester 0 3 A FALSE
and below is the dataset of the example I am working through with the columns added
Date Visitor Team VisitorPts Home Team HomePts HomeWin
20 01/11/2013 Milwaukee 105 Boston 98 FALSE
21 01/11/2013 Miami Heat 100 Brooklyn 101 TRUE
22 01/11/2013 Clevland 84 Charlotte 90 TRUE
23 01/11/2013 Portland 113 Denver 98 FALSE
24 01/11/2013 Dallas 91 Houston 113 TRUE
HomeLastWin VisitorLastWIn
FALSE FALSE
FALSE FALSE
FALSE TRUE
FALSE FALSE
TRUE TRUE
Thanks
Jon
Could you please try this,
Data that used as dataset_stack.csv
from collections import defaultdict
won_last = defaultdict(int)
# Load the Pandas libraries with alias 'pd'
import pandas as pd
# Read data from file 'dataset_stack.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
data = pd.read_csv("dataset_stack.csv")
results=pd.DataFrame(data=data)
#print(results)
# Preview the first 5 lines of the loaded data
#data.head()
for index,row in data.iterrows():
home_team = row['HomeTeam']
visitor_team = row['VisitorTeam']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
#results.ix[index]=row
#results.loc[index]=row
#add new column directly to dataframe instead of adding it to row & appending to dataframe
results['HomeLastWin']=won_last[home_team]
results['VisitorLastWin']=won_last[visitor_team]
results.append(row, ignore_index=True)
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
print(results)
Output:
Date VisitorTeam VisitorPts HomeTeam HomePts HomeWin \
0 1/11/2013 Milwaukee 105 Boston 98 False
1 1/11/2013 Miami Heat 100 Brooklyn 101 True
2 1/11/2013 Clevland 84 Charlotte 90 True
3 1/11/2013 Portland 113 Denver 98 False
4 1/11/2013 Dallas 91 Houston 113 True
HomeLastWin VisitorLastWin
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

Converting a List of DataFrames to a Tuples, Maintaining DataFrame Structure

After reading in 18 CSV files and appending them all to a list, so that (first two are displayed):
In [175] li
Out[175]:
[ ABC_IncomeStatement_Annual_As_Originally_Reported ... TTM
0 Gross Profit ... 903,400,000
1 Total Revenue ... 903,400,000
2 Business Revenue ... 902,700,000
3 Other Revenue ... NaN
4 Operating Income/Expenses ... -280,200,000
.. ... ... ...
57 Basic EPS ... 2.56
58 Diluted EPS ... 2.56
59 Basic WASO ... 193,576,187
60 Diluted WASO ... 193,576,187
61 Fiscal year ends in Jun 30 | AUD ... NaN
[62 rows x 12 columns],
DEF_IncomeStatement_Annual_As_Originally_Reported ... TTM
0 Gross Profit ... 1,321,800,000
1 Total Revenue ... 1,347,600,000
2 Business Revenue ... 1,347,600,000
3 Other Revenue ... NaN
4 Cost of Revenue ... -25,800,000
.. ... ... ...
63 Basic EPS ... 0.07
64 Diluted EPS ... 0.07
65 Basic WASO ... 2,316,707,932
66 Diluted WASO ... 2,316,707,932
67 Fiscal year ends in Dec 31 | AUD ... NaN
Where len(li) = 18. I have then taken the list of tickers using:
tickers = []
for code in range(0,len(li)):
tickers.append(li[code].columns[0][:3])
['ABC',
'DEF',
etc.]
test_tuple = list(zip(tickers,li))
def tuple_to_dict(tup,di):
for a,b in tup:
di.setdefault(a,[]).append(b)
return di
I have then created a list of tuples
di = {}
ASX = tuple_to_dict(test_tuple,di)
Now when I call ASX['ABC'] I am returned with the corresponding data, however it appears in a list object, not a DataFrame, which it was initially in and which I was hoping to keep it as.
Is there a way to maintain the DataFrame structure? There have been similar questions asked, none however related to a list of DataFrames.
Initial read in as follows:
import pandas as pd
import glob
path = '/Users/.../.../.../.../...'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename,index_col=None,header=0)
li.append(df)
Thanks a lot!
I don't understand why you transform it to tuple and then to a dict. You can just append it to dict like in the following code:
import pandas as pd
li = [pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']}),
pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})]
keys = ["df1", "df2"]
d = {}
for i, k in enumerate(keys):
d[k] = li[i]
Where print(type(d["df1"])) returns: <class 'pandas.core.frame.DataFrame'>

Get new dataframe by multiple conditions with pd.Dataframe.isin()

I am trying to write function that obtains a df and a dictionary that maps columns to values. The function slices rows (indexes) such that it returns only rows whose values match ‘criteria’ keys values.
for example:
df_isr13 = filterby_criteria(df, {"Area":["USA"], "Year":[2013]}) Only rows with "Year"=2013 and "Area"="USA" are included in the output.
I tried:
def filterby_criteria(df, criteria):
for key, values in criteria.items():
return df[df[key].isin(values)]
but I get only the first criterion
How can I get the new dataframe that except all criterias by pd.Dataframe.isin()?
You can use for loop and add every criterion by pandas merge function:
def filterby_criteria(df, criteria):
for key, values in criteria.items():
df = pd.merge(df[df [key].isin(values)], df, how='inner')
return df
Consider a simple merge of two data frames since by default merge uses all matching names:
from itertools import product
import pandas as pd
def filterby_criteria(df, criteria):
# EXTRACT DICT ITEMS
k,v = criteria.keys(), criteria.values()
# BUILD DF OF ALL POSSIBLE MATCHES
all_matches = (pd.DataFrame(product(*v))
.set_axis(list(k), axis='columns', inplace=False)
)
# RETURN MERGED DF
return df.merge(all_matches)
To demonstrate with random, seeded data:
Data
import numpy as np
import pandas as pd
np.random.seed(61219)
tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
years = list(range(2013, 2019))
random_df = pd.DataFrame({'Tool': np.random.choice(tools, 500),
'Int': np.random.randint(1, 10, 500),
'Num': np.random.uniform(1, 100, 500),
'Year': np.random.choice(years, 500)
})
print(random_df.head(10))
# Tool Int Num Year
# 0 spss 4 96.465327 2016
# 1 sas 7 23.455771 2016
# 2 r 5 87.349825 2014
# 3 julia 4 18.214028 2017
# 4 julia 7 17.977237 2016
# 5 stata 3 41.196579 2013
# 6 stata 8 84.943676 2014
# 7 python 4 60.576030 2017
# 8 spss 4 47.024075 2018
# 9 stata 3 87.271072 2017
Function call
criteria = {"Tool":["python", "r"], "Year":[2013, 2015]}
def filterby_criteria(df, criteria):
k,v = criteria.keys(), criteria.values()
all_matches = (pd.DataFrame(product(*v))
.set_axis(list(k), axis='columns', inplace=False)
)
return df.merge(all_matches)
final_df = filterby_criteria(random_df, criteria)
Output
print(final_df)
# Tool Int Num Year
# 0 python 8 96.611384 2015
# 1 python 7 66.782828 2015
# 2 python 9 73.638629 2015
# 3 python 4 70.763264 2015
# 4 python 2 28.311917 2015
# 5 python 3 69.888967 2015
# 6 python 8 97.609694 2015
# 7 python 3 59.198276 2015
# 8 python 3 64.497017 2015
# 9 python 8 87.672138 2015
# 10 python 9 33.605467 2015
# 11 python 8 25.225665 2015
# 12 r 3 72.202364 2013
# 13 r 1 62.192478 2013
# 14 r 7 39.264766 2013
# 15 r 3 14.599786 2013
# 16 r 4 22.963723 2013
# 17 r 1 97.647922 2013
# 18 r 5 60.457344 2013
# 19 r 5 15.711207 2013
# 20 r 7 80.273330 2013
# 21 r 7 74.190107 2013
# 22 r 7 37.923396 2013
# 23 r 2 91.970678 2013
# 24 r 4 31.489810 2013
# 25 r 1 37.580665 2013
# 26 r 2 9.686955 2013
# 27 r 6 56.238919 2013
# 28 r 6 72.820625 2015
# 29 r 3 61.255351 2015
# 30 r 4 45.690621 2015
# 31 r 5 71.143601 2015
# 32 r 6 54.744846 2015
# 33 r 1 68.171978 2015
# 34 r 5 8.521637 2015
# 35 r 7 87.027681 2015
# 36 r 3 93.614377 2015
# 37 r 7 37.918881 2015
# 38 r 3 7.715963 2015
# 39 python 1 42.681928 2013
# 40 python 6 57.354726 2013
# 41 python 1 48.189897 2013
# 42 python 4 12.201131 2013
# 43 python 9 1.078999 2013
# 44 python 9 75.615457 2013
# 45 python 8 12.631277 2013
# 46 python 9 82.227578 2013
# 47 python 7 97.802213 2013
# 48 python 1 57.103964 2013
# 49 python 1 1.941839 2013
# 50 python 3 81.981437 2013
# 51 python 1 56.869551 2013
PyFiddle Demo (click Run at top)

Categories