Create URLs for different data frames - python

I have a data frame that I split into different data frames of size 100 (to be able to make Python able to process it).
Therefore, I get different data frames (df1 to df..). For all those data frames, I want to create an URL as shown below.
When I use type(df), it shows me it is a data frame, however, when I use for j in dfs: print(type(j)), it is shown it is a string. I need the data frame to make it able to create the URL.
Can you please help me what the loop for creating the urls for all data frames could look like?
Thank you so much for your help!
df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
n = 100 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
dfs = {}
for idx,df in enumerate(list_df, 1):
dfs[f'df{idx}'] = df
type(df1)
for j in dfs:
print(type(j))
def create_url():
url = "https://api.twitter.com/2/tweets?{}&{}".format("ids=" + (str(str((df1['id'].tolist()))[1:-1])).replace(" ", ""), tweet_fields)
return url

dfs is dictionary so for j in dfs: gives you only keys - which are string.
You need .values()
for j in dfs.values():
or .items()
for key, j in df.items():
or you have to use dfs[j]
for j in dfs:
print(type( dfs[j] ))
EDIT:
Frankly, you could do it all in one loop without list_df
import pandas as pd
#df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
df = pd.DataFrame({'id': range(1000)})
tweet_fields = 'something'
n = 100 #chunk row size
for i in range(0, df.shape[0], n):
ids = df[i:i+n]['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)
You can also use groupby and index if index uses numbers 0,1,...
for i, group in df.groupby(df.index//100):
ids = group['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)

Related

How to keep no return (zero rows) in a concatenation loop?

I am using a code for a query, sometimes the input goes and there is no return (basically it does not find anything so the return is an empty row) so it is empty. However, when I use pd.concat, those empty rows disappear. Is there a way to keep these no return rows in the loop as well so that when I use that I can have empty rows on the final output.csv?
import numpy as np
import pandas as pd
from dl import authClient as ac, queryClient as qc
from dl.helpers.utils import convert
import openpyxl as xl
wb = xl.load_workbook('/Users/somethingfile.xlsx')
sheet = wb['Sheet 1']
df = pd.DataFrame([],columns = ['col1','col2',...,'coln'])
for row in range(3, sheet.max_row + 1):
a0, b0, r = sheet.cell(row,1).value, sheet.cell(row,2).value, 0.001
query = """
SELECT a,b,c,d,e FROM smthng
WHERE q3c_radial_query(a,b,{:f},{:f},{:f}) LIMIT 1
""".format(a0,b0,r)
response = qc.query(sql=query,format='csv')
temp_df = convert(response,'pandas')
df = pd.concat([df,temp_df])
df.to_csv('output.csv')
For your specific question, it works if you check if temp_df is empty or not in each step and make it a DataFrame of NaNs if it is empty.
Another note on the implementation is that, concatenating in each iteration will be a very expensive operation. It is much faster if you store the temp_dfs in each iteration in a list and concatenate once after the loop is over.
lst = [] # <-- empty list to fill later
for row in range(3, sheet.max_row + 1):
a0, b0, r = sheet.cell(row,1).value, sheet.cell(row,2).value, 0.001
query = """
SELECT a,b,c,d,e FROM smthng
WHERE q3c_radial_query(a,b,{:f},{:f},{:f}) LIMIT 1
""".format(a0,b0,r)
response = qc.query(sql=query,format='csv')
temp_df = convert(response,'pandas')
if temp_df.empty:
temp_df = pd.DataFrame([np.nan])
lst.append(temp_df)
df = pd.concat(lst) # <-- concat once
df.to_csv('output.csv', index=False)
So, as far as I understand, the problem is that when "temp_df" is empty you want to add a blank row. You should be able to do that using the .append() method, appending an empty Series().
if len(temp_df) == 0:
temp_df=temp_df.append(pd.Series(), ignore_index=True)
#Then concat...

Is it possible to merge several data into one DataFrames from function?

Function uses name from loop and prints 10 or more data sets. I want to merge them in one DataFrames write into csv file.
def f(name):
url = "https://apiurl"
response = requests.get(url, params = {'page': 1})
records = []
for page_number in range(1, response.json().get("pages")+1):
response = requests.get(url, params = {'page': page_number})
records += response.json().get('records')
df = pd.DataFrame(records)
return df
For loop to the function.
for row in valdf.itertuples():
name = valdf.loc[row.Index, 'Account_ID']
df1 = f(name)
print(df1)
When I tried df.to_csv('df.csv') it only take last result from loop. Is it possible to merge them in a one DataFrames and export?
create a list outside of the loop and use pd.concat()
dfs = []
for row in valdf.itertuples():
name = valdf.loc[row.Index, 'Account_ID']
df1 = f(name)
dfs.append(df1)
all_df = pd.concat(dfs)
this assumes the dfs all have the same n dimensions

How to fill cell by cell of an empty pandas dataframe which has zero columns with a loop?

I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?
You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']
If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])

How to create a loop for changing data frame names

I have several data frames and for all of them, I want to carry out the same procedure. As an example, for one data frame (df1), the code looks the following:
def create_url():
df_id = (str(str((df1['id'].tolist()))[1:-1])).replace(" ", "")
ids1 = "ids=" + df_id1
url1 = "https://api.twitter.com/2/tweets?{}&{}".format(ids1, tweet_fields)
return url1
However, for all df which I have I want to loop this code in order to get one url per df (which are names then url1, url2, etc.). To get the whole list of names of df, I used the following code:
for j in dfs:
print(j)
Hope somebody can help!
Thank you a lot in advance!
If i understand correctly
Try this:
def create_url(dfx):
df_id = (str(str((dfx['id'].tolist()))[1:-1])).replace(" ", "")
ids = "ids=" + df_id
url = "https://api.twitter.com/2/tweets?{}&{}".format(ids, tweet_fields)
return url
for j in dfs:
print(j)
create_url(j)

How to write in excel no more than 1 million records from python data-frame

I have a python data frame with more than 50 million records. I want to write them into a excel sheet where each sheet should have no more than 1 million records in them.
You can use .iloc to access certain rows of your data, and then dump them to Excel. Here's an example where 1000 rows are posted per sheet, the same basic idea will apply when you up it to 1000000:
import pandas as pd
df = pd.DataFrame({'Val': [i for i in range(5000)]})
GROUP_LENGTH = 1000
writer = pd.ExcelWriter('test.xlsx')
for i in range(0, len(df), GROUP_LENGTH):
print(i)
df.iloc[i:i+GROUP_LENGTH,].to_excel(writer, sheet_name='Row {}'.format(i))
writer.save()
writer.close()
An idea is to split your df into 50 df's inside a list and then:
for i in range(50):
list[i].to_excel("file.xlsx", sheet_name = f"Sheet{i+1}")
At first split the data you have and save in various variables...here I have fetched 2.5 million data from database and split them into three(as many you need in your case) variables...
pserializer=fetchdataSerializers(all_dataobj,many=True)
res = [item for item in pserializer.data if 1 <= item.get('id') <= 1000000]
res1 = [item for item in pserializer.data if 1 <= item.get('id') > 1000000 and
item.get('id') <= 2000000 ]
res2 = [item for item in pserializer.data if 1 <= item.get('id') > 2000000]
Then declare three(as many you need in your case) different dataframes.....
df = pd.DataFrame([])
df1 = pd.DataFrame([])
df2 = pd.DataFrame([])
Then append them and write them in an excel sheet having three(as many you need in your case) different sub sheets...
df = df.append(res)
df1 = df1.append(res1)
df2 = df2.append(res2)
writer = ExcelWriter('fetchdata_sheet15.xlsx')
df.to_excel(writer,'Sheet1',index=False)
df1.to_excel(writer,'Sheet2',index=False)
df2.to_excel(writer,'Sheet3',index=False)
writer.save()
That is it.Check if it works for you...Thank You.

Categories