Looping through pandas dataframe from web html - python

I am new to python. I am working on finance data. I want to loop through multiple dataset.
I have following code to read the data.
df1_url = pd.read_html("https:url1")
df2_url = pd.read_html("https:url2")
df3_url = pd.read_html("https:url3")
df4_url = pd.read_html("https:url4")
Each dataset has different 9 different tables in it. but every dataset is of same format.
Eg. The resulted output should be like:
bs_sheet = df1_url[1]
ps_sheet = df1_url[3]
cf_sheet = df1_url[5]
This process is same for all dataframes. Here I want to loop through 4 different dataframes like this.
So I tried to have all these 4 dataset and put in the dictionary.
dfs= {'df1':df1_url,'df2':df2_url,'df3':df3_url,'df4':df4_url}
I tried to loop through different datasets,
def trans(frame):
for i in dfs:
bs_sheet = i[1]
ps_sheet = i[3]
cf_sheet = i[5]
data = pd.concat([bs_sheet,pl_sheet,cf_sheet],axis=0)
data = data.transpose
This operations should be performed for all 4 different dataset.
While I performed this operations I received string out of range. After this how to access each dataset?
My solution was this:
d={}
for key,data in dfs.items():
bs_sheets = data[1]
ps_sheets= data[3]
cs_flows = data[5]
data = pd.concat([bs_sheets,pl_sheets,cs_flows],axis=0)
data = data.transpose()
d[key]= data
Thanks for helping me out #Zeinab #lucas.

You function would not work as need to change frame in your function to dfs
def trans(dfs):
for i in dfs:
bs_sheet = i[1]
ps_sheet = i[3]
cf_sheet = i[5]
data = pd.concat([bs_sheet,pl_sheet,cf_sheet],axis=0)
data = data.transpose

import pandas as pd
dfs= {'df1':['a','a','a','a'],'df2':['b','b','b','b'],
'df3':['c','c','c','c'],'df4':['d','d','d','d']}
d = []
for i in dfs.values() :
d.append(pd.Series(i))
final_pd= pd.concat(d,axis = 1)
print(final_pd)
0 1 2 3
0 a b c d
1 a b c d
2 a b c d
3 a b c d

Related

Python: concat data frames then save them to one csv

I have multiple data frames. I want to get some rows from each data frame based on a certain condition and add them into one data frame, then save them to one csv file.
I tried multiple methods, append with data frames is deprecated.
Here is the simple code. I want to retrieve the above and below values for all the rows larger than 2.
result= pd.concat() returns the required rows with the headers. That means with every iteration from the for loop, it prints the required rows. However, when I save them to csv, only the last three saved. How do I save/append the rows before adding them to the csv? What am I missing here?
df_sorted = pd.DataFrame({"ID": [1,2,3,4,5,6],
"User": ['a','b','c','d','e','f']})
Max = pd.DataFrame()
above = pd.DataFrame()
below = pd.DataFrame()
for i in range(len(df_sorted)):
if df_sorted.ID[i] > 2:
Max = df_sorted.iloc[[i]] # first df
if i < len(df_sorted) - 1:
above = df_sorted.iloc[[i+1]] # second df
if i > 0:
below = df_sorted.iloc[[i-1]] #third df
frames = [above, Max, below]
result = pd.concat(frames)
result.to_csv('new_df.csv')
The desired result should be,
ID User
2 b
3 c
4 d
3 c
4 d
5 e
4 d
5 e
6 f
5 e
6 f
what I get from result is,
ID User
5 e
6 f
6 f
Here it is:
columns = [ 'id', 'user']
Max = pd.DataFrame(columns=columns)
above = pd.DataFrame(columns=columns)
below = pd.DataFrame(columns=columns)
for i in range(len(df_sorted)):
if df_sorted.ID[i] > 2:
Max.loc[i,'id' ]=df_sorted.iloc[i, 0]
Max.loc[i,'user' ]=df_sorted.iloc[i, 1]
if i < len(df_sorted) - 1:
above.loc[i,'id' ]=df_sorted.iloc[i+1, 0]
above.loc[i,'user' ]=df_sorted.iloc[i+1, 1]
elif i > 0:
below.loc[i,'id' ]=df_sorted.iloc[i-1, 0]
below.loc[i,'user' ]=df_sorted.iloc[i-1, 1]
result = pd.concat([above, Max, below], axis = 0)
result
It seems that you did not define the Max, above and below.
Now, Max and above and below are only one value and every time, they are updated.
You should define Max=pd.dataframe(columns) or array and same thing for above and below. With this, you can save the data in these dataframes and with concat, you don't lose the data.

How to loop over big dataframe in batches

I have a pretty big dataframe of about 1.5 million rows and I am trying to execute the code below into batches of 10,000. Then append the results into the "dataset" dataframe. One of the columns, 'subjects' is structured really weird so I had to clean it up but it takes a long time to process. That's why I want to use the k=10,000 batch. Thoughts on the best way to accomplish this?
reuters_set = reuters_set.loc[reuters_set['subjects'].str.contains('P:')]
reuters_set.shape[0]
1590478
reuters_set.subjects.iloc[33] #Example of data in column that needs to be processed
['B:1092', 'B:12', 'B:19', 'B:20', 'B:22', 'B:227', 'B:228', 'B:229', 'B:24', 'G:1', 'G:6', 'G:B1', 'G:K', 'G:S', 'M:1QD', 'M:AV', 'M:B6', 'M:Z', 'R:600058.SS', 'N2:ASIA', 'N2:ASXPAC', 'N2:BMAT', 'N2:BMAT08', 'N2:CMPNY', 'N2:CN', 'N2:EASIA', 'N2:EMRG', 'N2:EQTY', 'N2:IRNST', 'N2:LEN', 'N2:MEMI', 'N2:METWHL', 'N2:MIN', 'N2:MINE', 'N2:MINE08', 'N2:MTAL', 'N2:MTAL08', 'N2:STEE', 'P:4295865030']
dataset = []
k = 10000
ct=0
# Testing the first 10,000. It takes really long after this value...
bk = reuters_set.iloc[0:k]
bk.reset_index(inplace = True)
bk['id'] = np.arange(bk.shape[0])
bk['N2'] = ''
bk['P'] = ''
bk['R'] = ''
for index, row in bk.iterrows():
a = [i.split(':') for i in ast.literal_eval(row['subjects'])]
b = pd.DataFrame(a)
b = b.groupby(0, as_index = False).agg({1:'unique'})
dict_code = dict(zip(b[0], b[1]))
if 'N2' in dict_code.keys():
bk.loc[bk['id']== index, 'N2'] = str(dict_code['N2'].tolist())
if 'R' in dict_code.keys():
bk.loc[bk['id']== index, 'R' ] = str(dict_code['R'].tolist())
if 'P' in dict_code.keys():
bk.loc[bk['id']== index, 'P' ] = str(dict_code['P'].tolist())

Trouble importing Excel fields into Python via Pandas - index out of bounds error

I'm not sure what happened, but my code has worked today, however not it won't. I have an Excel spreadsheet of projects I want to individually import and put into lists. However, I'm getting a "IndexError: index 8 is out of bounds for axis 0 with size 8" error and Google searches have not resolved this for me. Any help is appreciated. I have the following fields in my Excel sheet: id, funding_end, keywords, pi, summaryurl, htmlabstract, abstract, project_num, title. Not sure what I'm missing...
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
cols = [0,1,2,3,4,5,6,7,8]
df = df[df.columns[cols]]
tt = df['funding_end'] = df['funding_end'].astype(str)
tt = df.funding_end.tolist()
for t in tt:
allenddates.append(t)
bb = df['keywords'] = df['keywords'].astype(str)
bb = df.keywords.tolist()
for b in bb:
allkeywords.append(b)
uu = df['pi'] = df['pi'].astype(str)
uu = df.pi.tolist()
for u in uu:
allpis.append(u)
vv = df['summaryurl'] = df['summaryurl'].astype(str)
vv = df.summaryurl.tolist()
for v in vv:
allsummaryurls.append(v)
ww = df['htmlabstract'] = df['htmlabstract'].astype(str)
ww = df.htmlabstract.tolist()
for w in ww:
allhtmlabstracts.append(w)
xx = df['abstract'] = df['abstract'].astype(str)
xx = df.abstract.tolist()
for x in xx:
allabstracts.append(x)
yy = df['project_num'] = df['project_num'].astype(str)
yy = df.project_num.tolist()
for y in yy:
allprojectnums.append(y)
zz = df['title'] = df['title'].astype(str)
zz = df.title.tolist()
for z in zz:
alltitles.append(z)
"IndexError: index 8 is out of bounds for axis 0 with size 8"
cols = [0,1,2,3,4,5,6,7,8]
should be cols = [0,1,2,3,4,5,6,7].
I think you have 8 columns but your col has 9 col index.
IndexError: index out of bounds means you're trying to insert or access something which is beyond its limit or range.
Every time, when you load either of these files such as an test.xlx, test.csv or test.xlsx file using Pandas such as:
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
It would be better for everyone to find the length of columns of a DataFrame that will help you move forward when working with large Data_Sets. e.g.
import pandas as pd
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
data_frames = pd.DataFrame(data_set)
print("Length of Columns:", len(data_frames.columns))
This will give you the exact number of columns of an Excel Spread-Sheet. Then you can specify the Data Frames Accordingly:
Length of Columns: 8
cols = [0, 1, 2, 3, 4, 5, 6, 7]
I agree with #Bill CX that it sounds like you're trying to access a column that doesn't exist. Although I cannot reproduce your error, I have some ideas that may help you move forward.
First, double check the shape of your data frame:
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
print(df.shape) # print shape of data read in to python
The output should be
(X, 9) # "X" is the number of rows
If the data frame has 8 columns, then the df.shape will be (X, 8). This could be why your are getting the error.
Another check for you is to print out the first few rows of your data frame.
print(df.head)
This will let you double-check to see if you have read in the data in the correct form. I'm not sure, but it might be possible that your .xlsx file has 9 columns, but pandas is reading in only 8 of them.

Output unique values from a pandas dataframe without reordering the output

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

Python Pandas Panel counting value occurence

I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)

Categories