I'm a beginner python user and I am wondering if it is possible to store multiple dataframes that is generated from a loop, into a list.
Unfortunately, I do not have a reproducible example. What I am trying to do is to read in a directory of pdf files, make row 0 into the header, drop that row and store it into one dataframe within a list.
master_df= []
for i in range(1, len(pdffiles)):
df = read_pdf(pdffiles[i])
df.columns = df.iloc[0,] #get col names
df = df.reindex(df.index.drop(0)) #drop first row
df = df.replace(np.nan, '', regex=True, inplace = True)
master_df = df
This is the code that I have but I am getting this error at df.columns, reindex and replace.
AttributeError: 'NoneType' object has no attribute 'replace'
Could anyone point me in the right direction?
Update:
May I ask why the following code does not work? I'm trying to parse in a continue when the dataframe is not a None set.
master_df = []
for i in range(len(pdffiles)):
df = read_pdf(pdffiles[i])
if df is not None:
continue
df.columns = df.iloc[0,:] # get col names
df = df.reindex(df.index.drop(0)) # drop first row
df = df.fillna('')
master_df.append(df)
It is possible to store data frames in a list:
master_df = []
for i in range(len(pdffiles)):
df = read_pdf(pdffiles[i])
df.columns = df.iloc[0,:] # get col names
df = df.reindex(df.index.drop(0)) # drop first row
df = df.fillna('')
master_df.append(df)
You can use df.fillna() to replace NaN values with ''.
Related
I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?
You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']
If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])
I have a Python code that pulls data from a 3 rd party API.Below is the code.
for sub in sublocation_ids:
city_num_int = sub['id']
city_num_str = str(city_num_int)
city_name = sub['name']
filter_text_new = filter_text.format(city_num_str)
data = json.dumps({"filters": [filter_text_new], "sort_by":"fb_tw_and_li", "size":200, "from":1580491663000, "to":1588184960000, "content_type":"stories"})
r = requests.post(url = api_endpoint, data = data).json()
if r['articles'] != empty_list:
articles_list = r["articles"]
time.sleep(5)
articles_list_normalized = json_normalize(articles_list)
df = articles_list_normalized
df['publication_timestamp'] = pd.to_datetime(df['publication_timestamp'])
df['publication_timestamp'] = df['publication_timestamp'].apply(lambda x: x.now().strftime('%Y-%m-%d'))
df['citystate'] = city_name
df = df.drop('has_video', 1)
df.to_excel(writer, sheet_name = city_name)
writer.save()
Now city_num_int = sub['id'] is a unique ID for different cities. Now the API returns a "videos" column for few cities and not for other. I want to get rid of that video column before it gets written to Excel file.
I was able to drop "has_video" column using df.drop as that column is present in each and every city data pull. But how do do conditional dropping for "videos" column as it is only present for few cities.
You can ignore the errors raised by Dataframe.drop:
df = df.drop(['videos'], axis=1, errors='ignore')
Another way is to first check if column is present in DF, and only then delete it
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
You can use list comprehension on the column names to achieve what you want:
cols_to_keep = [c for c in df.columns if c != "videos"]
df = df[cols_to_keep]
I need to write the program that loops through the columns of data. Resetting the variable based on the cell value and each column representing a variable.
The variables in the exercise are dependent on these values that are being looped through.
How can I loop through the rows with each iteration of the loop increasing the value by 1?
df=pd.DataFrame(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1')
for i in range(0,5000):
df2 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(0)))
df3 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(1)))
df4 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(2)))
df5 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(3)))
firstname = df2
lastname = df3
address = df4
number= df5
#performed exercise
I have tried this on Jupyter. This is needed to load the Excel to df:
import numpy as np
import pandas as pd
import xlrd
df = pd.read_excel('Sample.xlsx', sheet_name = 0)
Then looping towards the column names is like this:
for col in df.columns:
print(col)
And looping towards the data is this:
for col in df.columns:
print("NEW ROW ----------")
for val in df[col]:
print (val)
This is the printed data:
Another way to do it is to loop through the columns and the rows:
columns = len(df.columns)
rows = len(df)
for column in range(columns):
print("-----------")
for row in range(rows):
print(df.loc[row][column])
I am trying to parse multiple excel sheets with Pandas into separate individual DataFrames.
My code so far is:
sheet_names =[tab1, tab2]
df_names = [1,2]
def initilize_dataframes(sheet_names):
for name in sheet_names:
df = xls_file.parse(name) #parse the xlxs sheet
df = df.transpose() #transpose dates to index
new_header = df.iloc[0] #column header names
df = df[1:] #drop 1st row
df.rename(columns=new_header, inplace= True) #rename the columns
return df`
`
for i in df_names:
df_(i) = initilize_dataframes(sheet_names)#something like this idk
The last two lines I can not wrap my head around. I get that the function will return the df, but I would like it to take the values from the df_names list. And label the DataFrame accordingly.
For example, tab1 in the excel sheet the DataFrame should be named df_1 and looping for tab2 and df_2 respectively.
It is possible by globals:
for i, val in enumerate(df_names):
globals()['df_' + str(vals)] = initilize_dataframes(sheet_names[i])
But better is use dict of DataFrames, sheet_names select by positions from enumerate, but need substract 1, because python counts from 0:
dfs = {}
for i, val in enumerate(df_names):
dfs[val] = initilize_dataframes(sheet_names[i])
print (dfs[1])
I have a problem with appending of dataframe.
I try to execute this code
df_all = pd.read_csv('data.csv', error_bad_lines=False, chunksize=1000000)
urls = pd.read_excel('url_june.xlsx')
substr = urls.url.values.tolist()
df_res = pd.DataFrame()
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
df_res.append(res)
And when I try to save df_res I get empty dataframe.
df_all looks like
ID,"url","used_at","active_seconds"
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:25,1
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:31,30
f85ce4b2f8787d48edc8612b2ccaca83,"4pda.ru/forum/index.php?showtopic=634566&view=getnewpost",2015-10-01 00:01:49,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"shop.mts.ru/smartfony/mts/smartfon-smart-sprint-4g-sim-lock-white.html?utm_source=admitad&utm_medium=cpa&utm_content=300&utm_campaign=gde_cpa&uid=3",2015-10-01 00:03:19,34
078d388438ebf1d4142808f58fb66c87,"market.yandex.ru/product/12675734/spec?hid=91491&track=char",2015-10-01 00:03:48,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"avito.ru/yoshkar-ola/telefony/mts",2015-10-01 00:04:21,4
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:25,1
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:26,9
and urls looks like
url
shoppingcart.aliexpress.com/order/confirm_order
ozon.ru/?context=order_done&number=
lk.wildberries.ru/basket/orderconfirmed
lamoda.ru/checkout/onepage/success/quick
mvideo.ru/confirmation?_requestid=
eldorado.ru/personal/order.php?step=confirm
When I print res in a loop it doesn't empty. But when I try print in a loop df_res after append, it return empty dataframe.
I can't find my error. How can I fix it?
If you look at the documentation for pd.DataFrame.append
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
(emphasis mine).
Try
df_res = df_res.append(res)
Incidentally, note that pandas isn't that efficient for creating a DataFrame by successive concatenations. You might try this, instead:
all_res = []
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
all_res.append(res)
df_res = pd.concat(all_res)
This first creates a list of all the parts, then creates a DataFrame from all of them once at the end.
If we want append based on index:
df_res = pd.DataFrame(data = None, columns= df.columns)
all_res = []
d1 = df.ix[index-10:index-1,] #it will take 10 rows before i-th index
all_res.append(d1)
df_res = pd.concat(all_res)