Python, iterating through an excel spreadsheet - python

I need to write the program that loops through the columns of data. Resetting the variable based on the cell value and each column representing a variable.
The variables in the exercise are dependent on these values that are being looped through.
How can I loop through the rows with each iteration of the loop increasing the value by 1?
df=pd.DataFrame(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1')
for i in range(0,5000):
df2 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(0)))
df3 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(1)))
df4 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(2)))
df5 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(3)))
firstname = df2
lastname = df3
address = df4
number= df5
#performed exercise

I have tried this on Jupyter. This is needed to load the Excel to df:
import numpy as np
import pandas as pd
import xlrd
df = pd.read_excel('Sample.xlsx', sheet_name = 0)
Then looping towards the column names is like this:
for col in df.columns:
print(col)
And looping towards the data is this:
for col in df.columns:
print("NEW ROW ----------")
for val in df[col]:
print (val)
This is the printed data:
Another way to do it is to loop through the columns and the rows:
columns = len(df.columns)
rows = len(df)
for column in range(columns):
print("-----------")
for row in range(rows):
print(df.loc[row][column])

Related

How to fill cell by cell of an empty pandas dataframe which has zero columns with a loop?

I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?
You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']
If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])

Update a pandas dataframe in each iteration and append to previous

I am trying to take a section of an existing pandas dataframe and duplicating that section with some updates in a loop. Basically, for all 273 rows of the section, I want to update the persons "GivenName" by replacing "Name1" with "Name2", "Name3"..."Name5".
data1 = data[0:273] #creating the subset
data2 = data1.copy()
df = []
for i in range(4):
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df.append(data2)
appended_data = pd.concat(df)
What I end up instead is with a dataframe where only the last value "Name5" is appended 4 times instead of "Name2", "Name3"..."Name5" etc. How can I update the "GivenName" values for each iteration and append all results?
Or a oneliner
pd.concat(data[0:273].assign(GivenName=f'Name{i+2}') for i in range(4))
What's happening is your list df is just getting four references to the same data2 DataFrame. In other words, the list looks like this:
[
data2,
data2,
data2,
data2
]
and you're setting data2["GivenName"] = "Name5" in the final iteration. The most straightforward way to get the behavior you're expecting is moving the DataFrame copy into the for loop:
df = []
for i in range(4):
data2 = data1.copy()
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df.append(data2)
There are a few issues here:
(1) df = [] creates a list, not a dataframe. Try df = pd.DataFrame()
(2) df.append(data2) should be df = df.append(data2) because append does not happen in-place.
data1 = data[0:273] #creating the subset
data2 = data1.copy()
df = pd.DataFrame()
for i in range(4):
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df = df.append(data2)
appended_data = pd.concat(df)

How to drop a column dataframe (df) in Pandas based on condition if the column is present in df?

I have a Python code that pulls data from a 3 rd party API.Below is the code.
for sub in sublocation_ids:
city_num_int = sub['id']
city_num_str = str(city_num_int)
city_name = sub['name']
filter_text_new = filter_text.format(city_num_str)
data = json.dumps({"filters": [filter_text_new], "sort_by":"fb_tw_and_li", "size":200, "from":1580491663000, "to":1588184960000, "content_type":"stories"})
r = requests.post(url = api_endpoint, data = data).json()
if r['articles'] != empty_list:
articles_list = r["articles"]
time.sleep(5)
articles_list_normalized = json_normalize(articles_list)
df = articles_list_normalized
df['publication_timestamp'] = pd.to_datetime(df['publication_timestamp'])
df['publication_timestamp'] = df['publication_timestamp'].apply(lambda x: x.now().strftime('%Y-%m-%d'))
df['citystate'] = city_name
df = df.drop('has_video', 1)
df.to_excel(writer, sheet_name = city_name)
writer.save()
Now city_num_int = sub['id'] is a unique ID for different cities. Now the API returns a "videos" column for few cities and not for other. I want to get rid of that video column before it gets written to Excel file.
I was able to drop "has_video" column using df.drop as that column is present in each and every city data pull. But how do do conditional dropping for "videos" column as it is only present for few cities.
You can ignore the errors raised by Dataframe.drop:
df = df.drop(['videos'], axis=1, errors='ignore')
Another way is to first check if column is present in DF, and only then delete it
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
You can use list comprehension on the column names to achieve what you want:
cols_to_keep = [c for c in df.columns if c != "videos"]
df = df[cols_to_keep]

Loop to Store multiple pandas dataframe into a List

I'm a beginner python user and I am wondering if it is possible to store multiple dataframes that is generated from a loop, into a list.
Unfortunately, I do not have a reproducible example. What I am trying to do is to read in a directory of pdf files, make row 0 into the header, drop that row and store it into one dataframe within a list.
master_df= []
for i in range(1, len(pdffiles)):
df = read_pdf(pdffiles[i])
df.columns = df.iloc[0,] #get col names
df = df.reindex(df.index.drop(0)) #drop first row
df = df.replace(np.nan, '', regex=True, inplace = True)
master_df = df
This is the code that I have but I am getting this error at df.columns, reindex and replace.
AttributeError: 'NoneType' object has no attribute 'replace'
Could anyone point me in the right direction?
Update:
May I ask why the following code does not work? I'm trying to parse in a continue when the dataframe is not a None set.
master_df = []
for i in range(len(pdffiles)):
df = read_pdf(pdffiles[i])
if df is not None:
continue
df.columns = df.iloc[0,:] # get col names
df = df.reindex(df.index.drop(0)) # drop first row
df = df.fillna('')
master_df.append(df)
It is possible to store data frames in a list:
master_df = []
for i in range(len(pdffiles)):
df = read_pdf(pdffiles[i])
df.columns = df.iloc[0,:] # get col names
df = df.reindex(df.index.drop(0)) # drop first row
df = df.fillna('')
master_df.append(df)
You can use df.fillna() to replace NaN values with ''.

Return multiple DataFrames from a function with Pandas

I am trying to parse multiple excel sheets with Pandas into separate individual DataFrames.
My code so far is:
sheet_names =[tab1, tab2]
df_names = [1,2]
def initilize_dataframes(sheet_names):
for name in sheet_names:
df = xls_file.parse(name) #parse the xlxs sheet
df = df.transpose() #transpose dates to index
new_header = df.iloc[0] #column header names
df = df[1:] #drop 1st row
df.rename(columns=new_header, inplace= True) #rename the columns
return df`
`
for i in df_names:
df_(i) = initilize_dataframes(sheet_names)#something like this idk
The last two lines I can not wrap my head around. I get that the function will return the df, but I would like it to take the values from the df_names list. And label the DataFrame accordingly.
For example, tab1 in the excel sheet the DataFrame should be named df_1 and looping for tab2 and df_2 respectively.
It is possible by globals:
for i, val in enumerate(df_names):
globals()['df_' + str(vals)] = initilize_dataframes(sheet_names[i])
But better is use dict of DataFrames, sheet_names select by positions from enumerate, but need substract 1, because python counts from 0:
dfs = {}
for i, val in enumerate(df_names):
dfs[val] = initilize_dataframes(sheet_names[i])
print (dfs[1])

Categories