I am trying to take a section of an existing pandas dataframe and duplicating that section with some updates in a loop. Basically, for all 273 rows of the section, I want to update the persons "GivenName" by replacing "Name1" with "Name2", "Name3"..."Name5".
data1 = data[0:273] #creating the subset
data2 = data1.copy()
df = []
for i in range(4):
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df.append(data2)
appended_data = pd.concat(df)
What I end up instead is with a dataframe where only the last value "Name5" is appended 4 times instead of "Name2", "Name3"..."Name5" etc. How can I update the "GivenName" values for each iteration and append all results?
Or a oneliner
pd.concat(data[0:273].assign(GivenName=f'Name{i+2}') for i in range(4))
What's happening is your list df is just getting four references to the same data2 DataFrame. In other words, the list looks like this:
[
data2,
data2,
data2,
data2
]
and you're setting data2["GivenName"] = "Name5" in the final iteration. The most straightforward way to get the behavior you're expecting is moving the DataFrame copy into the for loop:
df = []
for i in range(4):
data2 = data1.copy()
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df.append(data2)
There are a few issues here:
(1) df = [] creates a list, not a dataframe. Try df = pd.DataFrame()
(2) df.append(data2) should be df = df.append(data2) because append does not happen in-place.
data1 = data[0:273] #creating the subset
data2 = data1.copy()
df = pd.DataFrame()
for i in range(4):
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df = df.append(data2)
appended_data = pd.concat(df)
Related
I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?
You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']
If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])
I'm going over files in a folder, and I want to merge the datasets based on the variable called key.This is my code so far. And I have an example of what the datasets might looks like/what I expect the final to look like:
dfs=[]
for f in files:
for name, sheet in sheets_dict.items():
if name=="Main":
data = sheet
dfs.append(data)
Example of dfs:
df1 = {'key': ["A","B"], 'Answer':["yes","No"]}
df1 = pd.DataFrame(data=df1)
df2={'key': ["A","C"], 'Answer':["No","c"]}
df2 = pd.DataFrame(data=df2)
final output
final={'A': ["yes","No"], 'B':["No",""],'C':["","c"],'file':['df1','df2']}
final = pd.DataFrame(data=final)
This is what I have tried but I can't make it work:
df_key={'key': ["A","B","C"]}
df_key = pd.DataFrame(data=df_key)
df_final=[]
for df in dfs:
temp= pd.merge(df_key[['key']],df, on=['key'], how = 'left')
temp_t= temp.transpose()
df_final.append(temp_t)
Reshaping and concatenating the dataframes is pretty straightforward. But in order to add the file value you will need to either a) have the names of the dataframes in a list of strings, or b) generate new names as you go along.
Here is the code
dfs = [df1, df2] # populate dfs as needed
master_df = []
df_key = {'key': ["A","B","C"]}
df_key = pd.DataFrame(df_key) # assuming you already have this dataframe created
master_df.append(pd.Series(index=df_key.columns))
for i, df in enumerate(dfs):
df = df.set_index('key').squeeze()
df.loc['file'] = f'df{i+1}'
master_df.append(df)
# or iterate the dfs alongside their file names
# for fname, df in zip(file_names, dfs):
# df = df.set_index('key').squeeze()
# df.loc['file'] = fname
# master_df.append(df)
master_df = pd.concat(master_df, axis=1).T
# rearrange columns
master_df = master_df[
master_df.columns.difference(['file']).to_list() + ['file']
]
# fill NaNs with empty string
master_df.fillna('', inplace=True)
Output
A B C file
Answer yes No df1
Answer No c df2
I need to write the program that loops through the columns of data. Resetting the variable based on the cell value and each column representing a variable.
The variables in the exercise are dependent on these values that are being looped through.
How can I loop through the rows with each iteration of the loop increasing the value by 1?
df=pd.DataFrame(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1')
for i in range(0,5000):
df2 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(0)))
df3 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(1)))
df4 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(2)))
df5 = pd.read_excel(r'C:/Users/user.name/Desktop/P_list.xlsx',sheet_name = 'Sheet1'), index = list(range(i,5000,1), columns=list(range(3)))
firstname = df2
lastname = df3
address = df4
number= df5
#performed exercise
I have tried this on Jupyter. This is needed to load the Excel to df:
import numpy as np
import pandas as pd
import xlrd
df = pd.read_excel('Sample.xlsx', sheet_name = 0)
Then looping towards the column names is like this:
for col in df.columns:
print(col)
And looping towards the data is this:
for col in df.columns:
print("NEW ROW ----------")
for val in df[col]:
print (val)
This is the printed data:
Another way to do it is to loop through the columns and the rows:
columns = len(df.columns)
rows = len(df)
for column in range(columns):
print("-----------")
for row in range(rows):
print(df.loc[row][column])
I am trying to parse multiple excel sheets with Pandas into separate individual DataFrames.
My code so far is:
sheet_names =[tab1, tab2]
df_names = [1,2]
def initilize_dataframes(sheet_names):
for name in sheet_names:
df = xls_file.parse(name) #parse the xlxs sheet
df = df.transpose() #transpose dates to index
new_header = df.iloc[0] #column header names
df = df[1:] #drop 1st row
df.rename(columns=new_header, inplace= True) #rename the columns
return df`
`
for i in df_names:
df_(i) = initilize_dataframes(sheet_names)#something like this idk
The last two lines I can not wrap my head around. I get that the function will return the df, but I would like it to take the values from the df_names list. And label the DataFrame accordingly.
For example, tab1 in the excel sheet the DataFrame should be named df_1 and looping for tab2 and df_2 respectively.
It is possible by globals:
for i, val in enumerate(df_names):
globals()['df_' + str(vals)] = initilize_dataframes(sheet_names[i])
But better is use dict of DataFrames, sheet_names select by positions from enumerate, but need substract 1, because python counts from 0:
dfs = {}
for i, val in enumerate(df_names):
dfs[val] = initilize_dataframes(sheet_names[i])
print (dfs[1])
I have a problem with appending of dataframe.
I try to execute this code
df_all = pd.read_csv('data.csv', error_bad_lines=False, chunksize=1000000)
urls = pd.read_excel('url_june.xlsx')
substr = urls.url.values.tolist()
df_res = pd.DataFrame()
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
df_res.append(res)
And when I try to save df_res I get empty dataframe.
df_all looks like
ID,"url","used_at","active_seconds"
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:25,1
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:31,30
f85ce4b2f8787d48edc8612b2ccaca83,"4pda.ru/forum/index.php?showtopic=634566&view=getnewpost",2015-10-01 00:01:49,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"shop.mts.ru/smartfony/mts/smartfon-smart-sprint-4g-sim-lock-white.html?utm_source=admitad&utm_medium=cpa&utm_content=300&utm_campaign=gde_cpa&uid=3",2015-10-01 00:03:19,34
078d388438ebf1d4142808f58fb66c87,"market.yandex.ru/product/12675734/spec?hid=91491&track=char",2015-10-01 00:03:48,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"avito.ru/yoshkar-ola/telefony/mts",2015-10-01 00:04:21,4
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:25,1
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:26,9
and urls looks like
url
shoppingcart.aliexpress.com/order/confirm_order
ozon.ru/?context=order_done&number=
lk.wildberries.ru/basket/orderconfirmed
lamoda.ru/checkout/onepage/success/quick
mvideo.ru/confirmation?_requestid=
eldorado.ru/personal/order.php?step=confirm
When I print res in a loop it doesn't empty. But when I try print in a loop df_res after append, it return empty dataframe.
I can't find my error. How can I fix it?
If you look at the documentation for pd.DataFrame.append
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
(emphasis mine).
Try
df_res = df_res.append(res)
Incidentally, note that pandas isn't that efficient for creating a DataFrame by successive concatenations. You might try this, instead:
all_res = []
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
all_res.append(res)
df_res = pd.concat(all_res)
This first creates a list of all the parts, then creates a DataFrame from all of them once at the end.
If we want append based on index:
df_res = pd.DataFrame(data = None, columns= df.columns)
all_res = []
d1 = df.ix[index-10:index-1,] #it will take 10 rows before i-th index
all_res.append(d1)
df_res = pd.concat(all_res)