isd = pd.DataFrame()
ind = pd.DataFrame()
exd = pd.DataFrame()
psd = pd.DataFrame()
visd = pd.DataFrame()
vind = pd.DataFrame()
vexd = pd.DataFrame()
sd = pd.DataFrame()
ise = pd.DataFrame()
idb = pd.DataFrame()
mdd = pd.DataFrame()
add = pd.DataFrame()
Is there any alternate way to make it elegant and faster?
Use a dictionary of dataframes, especially of the code for some data frames is going to share some similarities. This will allows doing some operations using loops or functions:
dct = {n: pd.DataFrame() for n in ['isd', 'ind', 'exd']}
If you want to avoid needing to numerically index each of the DataFrames, but would rather be able to access them directly by their name:
import pandas as pd
table_names = ['df1', 'df2', 'df3']
for name in table_names:
exec('%s = pd.DataFrame()' % name, locals(), locals())
print(df1)
This approach uses exec, which essentially runs a string as if it were python code. I'm just formatting each of the predetermined names into the string in a for-loop.
You can do something like this:
dfs = ['isd', 'ind', 'exd']
df_list = [pd.DataFrame() for df in dfs ]
I think you can go this way
import pandas as pd
a, b, c, d = [pd.DataFrame()]*4
Related
I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?
You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']
If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])
I want to create n empty dataframes using for loop.
Something like :
import pandas as pd
n=6
for i in range(0,n):
df(i) = pd.DataFrame()
Output like:
df1,df2,df3,df4........dfn
You could store them in a list of dataframes:
dfs = []
n = 6
for i in range(n):
dfs.append(pd.DataFrame())
An alternative would be using a dictionary with meaningful names (these could of course also just be the numbers 1 to 6:
names = ['df1', 'df2', 'df3']
dfs = {}
for name in names:
dfs[name] = pd.DataFrame()
Create dataframes and append them to a list:
df_list = list()
for i in range(6):
d_one = pd.DataFrame()
df_list.append(d_one)
Access individual dataframe by simply indexing normally:
df_list[0]
The short answer for the example you gave is, you don't.
This is what collections (lists, dicts) are for.
With a comprehension it's a fairly trivial task.
# as a list
list_of_df = [pd.DataFrame() for _ in range(n)]
print(list_of_df[0])
If you still want to refer to them by their name it might make more sense in the form of a dict.
dict_of_df = {f'df{i}': pd.DataFrame() for i in range(1, n + 1)}
print(dict_of_df['df1'])
Although it is possible to modify the dict of globals(), if you use an IDE, your linter will hate you, and you'll be fighting it at every corner.
# don't do this
for i in range(1, n + 1):
globals()[f'df{i}'] = pd.DataFrame()
print(df1)
It's a hackier way of creating your own dict and if you do do it you'll have to hard code the variable names anyway.
I'm going over files in a folder, and I want to merge the datasets based on the variable called key.This is my code so far. And I have an example of what the datasets might looks like/what I expect the final to look like:
dfs=[]
for f in files:
for name, sheet in sheets_dict.items():
if name=="Main":
data = sheet
dfs.append(data)
Example of dfs:
df1 = {'key': ["A","B"], 'Answer':["yes","No"]}
df1 = pd.DataFrame(data=df1)
df2={'key': ["A","C"], 'Answer':["No","c"]}
df2 = pd.DataFrame(data=df2)
final output
final={'A': ["yes","No"], 'B':["No",""],'C':["","c"],'file':['df1','df2']}
final = pd.DataFrame(data=final)
This is what I have tried but I can't make it work:
df_key={'key': ["A","B","C"]}
df_key = pd.DataFrame(data=df_key)
df_final=[]
for df in dfs:
temp= pd.merge(df_key[['key']],df, on=['key'], how = 'left')
temp_t= temp.transpose()
df_final.append(temp_t)
Reshaping and concatenating the dataframes is pretty straightforward. But in order to add the file value you will need to either a) have the names of the dataframes in a list of strings, or b) generate new names as you go along.
Here is the code
dfs = [df1, df2] # populate dfs as needed
master_df = []
df_key = {'key': ["A","B","C"]}
df_key = pd.DataFrame(df_key) # assuming you already have this dataframe created
master_df.append(pd.Series(index=df_key.columns))
for i, df in enumerate(dfs):
df = df.set_index('key').squeeze()
df.loc['file'] = f'df{i+1}'
master_df.append(df)
# or iterate the dfs alongside their file names
# for fname, df in zip(file_names, dfs):
# df = df.set_index('key').squeeze()
# df.loc['file'] = fname
# master_df.append(df)
master_df = pd.concat(master_df, axis=1).T
# rearrange columns
master_df = master_df[
master_df.columns.difference(['file']).to_list() + ['file']
]
# fill NaNs with empty string
master_df.fillna('', inplace=True)
Output
A B C file
Answer yes No df1
Answer No c df2
I am trying to take a section of an existing pandas dataframe and duplicating that section with some updates in a loop. Basically, for all 273 rows of the section, I want to update the persons "GivenName" by replacing "Name1" with "Name2", "Name3"..."Name5".
data1 = data[0:273] #creating the subset
data2 = data1.copy()
df = []
for i in range(4):
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df.append(data2)
appended_data = pd.concat(df)
What I end up instead is with a dataframe where only the last value "Name5" is appended 4 times instead of "Name2", "Name3"..."Name5" etc. How can I update the "GivenName" values for each iteration and append all results?
Or a oneliner
pd.concat(data[0:273].assign(GivenName=f'Name{i+2}') for i in range(4))
What's happening is your list df is just getting four references to the same data2 DataFrame. In other words, the list looks like this:
[
data2,
data2,
data2,
data2
]
and you're setting data2["GivenName"] = "Name5" in the final iteration. The most straightforward way to get the behavior you're expecting is moving the DataFrame copy into the for loop:
df = []
for i in range(4):
data2 = data1.copy()
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df.append(data2)
There are a few issues here:
(1) df = [] creates a list, not a dataframe. Try df = pd.DataFrame()
(2) df.append(data2) should be df = df.append(data2) because append does not happen in-place.
data1 = data[0:273] #creating the subset
data2 = data1.copy()
df = pd.DataFrame()
for i in range(4):
data2["GivenName"] = "Name"+str(i+2) #for all 273 rows replace name
df = df.append(data2)
appended_data = pd.concat(df)
I have a problem with appending of dataframe.
I try to execute this code
df_all = pd.read_csv('data.csv', error_bad_lines=False, chunksize=1000000)
urls = pd.read_excel('url_june.xlsx')
substr = urls.url.values.tolist()
df_res = pd.DataFrame()
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
df_res.append(res)
And when I try to save df_res I get empty dataframe.
df_all looks like
ID,"url","used_at","active_seconds"
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:25,1
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:31,30
f85ce4b2f8787d48edc8612b2ccaca83,"4pda.ru/forum/index.php?showtopic=634566&view=getnewpost",2015-10-01 00:01:49,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"shop.mts.ru/smartfony/mts/smartfon-smart-sprint-4g-sim-lock-white.html?utm_source=admitad&utm_medium=cpa&utm_content=300&utm_campaign=gde_cpa&uid=3",2015-10-01 00:03:19,34
078d388438ebf1d4142808f58fb66c87,"market.yandex.ru/product/12675734/spec?hid=91491&track=char",2015-10-01 00:03:48,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"avito.ru/yoshkar-ola/telefony/mts",2015-10-01 00:04:21,4
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:25,1
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:26,9
and urls looks like
url
shoppingcart.aliexpress.com/order/confirm_order
ozon.ru/?context=order_done&number=
lk.wildberries.ru/basket/orderconfirmed
lamoda.ru/checkout/onepage/success/quick
mvideo.ru/confirmation?_requestid=
eldorado.ru/personal/order.php?step=confirm
When I print res in a loop it doesn't empty. But when I try print in a loop df_res after append, it return empty dataframe.
I can't find my error. How can I fix it?
If you look at the documentation for pd.DataFrame.append
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
(emphasis mine).
Try
df_res = df_res.append(res)
Incidentally, note that pandas isn't that efficient for creating a DataFrame by successive concatenations. You might try this, instead:
all_res = []
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
all_res.append(res)
df_res = pd.concat(all_res)
This first creates a list of all the parts, then creates a DataFrame from all of them once at the end.
If we want append based on index:
df_res = pd.DataFrame(data = None, columns= df.columns)
all_res = []
d1 = df.ix[index-10:index-1,] #it will take 10 rows before i-th index
all_res.append(d1)
df_res = pd.concat(all_res)