How to create a loop for changing data frame names - python

I have several data frames and for all of them, I want to carry out the same procedure. As an example, for one data frame (df1), the code looks the following:
def create_url():
df_id = (str(str((df1['id'].tolist()))[1:-1])).replace(" ", "")
ids1 = "ids=" + df_id1
url1 = "https://api.twitter.com/2/tweets?{}&{}".format(ids1, tweet_fields)
return url1
However, for all df which I have I want to loop this code in order to get one url per df (which are names then url1, url2, etc.). To get the whole list of names of df, I used the following code:
for j in dfs:
print(j)
Hope somebody can help!
Thank you a lot in advance!

If i understand correctly
Try this:
def create_url(dfx):
df_id = (str(str((dfx['id'].tolist()))[1:-1])).replace(" ", "")
ids = "ids=" + df_id
url = "https://api.twitter.com/2/tweets?{}&{}".format(ids, tweet_fields)
return url
for j in dfs:
print(j)
create_url(j)

Related

How can I optimally save my data in rows?

I'm a newbie in this community and I hope you can help me with my Problem. In my current project I want to scrape a page. These are gas stations with multiple information. Now all the information from the petrol stations is stored as one variable. However, I want each gas station to have a row so that I get a large data frame. Each individual gas station is provided with an id and they are stored in the variable ids.
ids=results["objectID"].tolist()
id_details=[]
for i,id in enumerate(ids):
input_dict = {
'diff_time_zone':-1,
'objectID':id,
'poiposition':'50.5397219 8.7328552',
'stateAll':'2',
'category':1,
'language':'de',
'prognosis_offset':-1,
'windowSize':305
}
encoded_input_string = json.dumps(input_dict, indent=2).encode('utf-8')
encoded_input_string = base64.b64encode(encoded_input_string).decode("utf-8")
r = s.post("https://example.me/getObject_detail.php", headers=headers, data="post="+encoded_input_string)
soup = BeautifulSoup(r.text, "lxml")
lists= soup.find('div', class_='inside')
rs= lists.find_all("p")
final = []
for lists in rs:
txt = lists if type(lists) == NavigableString else lists.text
id_details.append(txt)
df= pd.DataFrame(id_details,columns = ['place'])
well, personally I would use a database rather than a data frame in that case and probably not saving as a file. as I can see there is dictionary-based data that can be easily implemented in Elastic Search for example.
If there is any reason(that forced not using any kind of databases) for doing that (Using Dataframe) accessing to file and appending it to the end of it would work fine, and you should maximize your chunks because accessing to file and writing to it is working like a bottleneck here but saying chunk because Ram is not unlimited.
---Update asking for the second way.
Some parts of your code are missing but u will get the idea.
file_name = 'My_File.csv'
cols = ['place'] # e.x creating an empty csv with only one column - place using pandas
data = dict(zip(cols,[[] for i in range(len(cols))]))
df = pd.DataFrame(data) #creating df
df.to_csv(file_name, mode='w', index=False, header=True) #saving
id_details={'place':[]}
for i, id in enumerate(ids):
#Some algo...
for lists in rs:
id_details['place'].append(txt)
if i %100==0:
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
id_details['place'] = []
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)

Scraping Table Data from Multiple URLS, but first link is repeating

I'm looking to iterate through the URL with "count" as variables between 1 and 65.
Right now, I'm close but really struggling to figure out the last piece. I'm receiving the same table (from variable 1) 65 times, instead of receiving the different tables.
import requests
import pandas as pd
url = 'https://basketball.realgm.com/international/stats/2023/Averages/Qualified/All/player/All/desc/{count}'
res = []
for count in range(1, 65):
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
res.append(df)
print(res)
df.to_csv('my data.csv')
Any thoughts?
A few errors:
Your URL was templated incorrectly. It remains at .../{count} literally, without substituting or updating from the loop variable.
If you want to get page 1 to 65, use range(1, 66)
Unless you want to export only the last dataframe, you need to concatenate all of them first
# No count here, we will add it later
url = 'https://basketball.realgm.com/international/stats/2023/Averages/Qualified/All/player/All/desc'
res = []
for count in range(1, 66):
# pd.read_html accepts a URL too so no need to make a separate request
df_list = pd.read_html(f"{url}/{count}")
res.append(df_list[-1])
pd.concat(res).to_csv('my data.csv')

Is it possible to merge several data into one DataFrames from function?

Function uses name from loop and prints 10 or more data sets. I want to merge them in one DataFrames write into csv file.
def f(name):
url = "https://apiurl"
response = requests.get(url, params = {'page': 1})
records = []
for page_number in range(1, response.json().get("pages")+1):
response = requests.get(url, params = {'page': page_number})
records += response.json().get('records')
df = pd.DataFrame(records)
return df
For loop to the function.
for row in valdf.itertuples():
name = valdf.loc[row.Index, 'Account_ID']
df1 = f(name)
print(df1)
When I tried df.to_csv('df.csv') it only take last result from loop. Is it possible to merge them in a one DataFrames and export?
create a list outside of the loop and use pd.concat()
dfs = []
for row in valdf.itertuples():
name = valdf.loc[row.Index, 'Account_ID']
df1 = f(name)
dfs.append(df1)
all_df = pd.concat(dfs)
this assumes the dfs all have the same n dimensions

Create URLs for different data frames

I have a data frame that I split into different data frames of size 100 (to be able to make Python able to process it).
Therefore, I get different data frames (df1 to df..). For all those data frames, I want to create an URL as shown below.
When I use type(df), it shows me it is a data frame, however, when I use for j in dfs: print(type(j)), it is shown it is a string. I need the data frame to make it able to create the URL.
Can you please help me what the loop for creating the urls for all data frames could look like?
Thank you so much for your help!
df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
n = 100 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
dfs = {}
for idx,df in enumerate(list_df, 1):
dfs[f'df{idx}'] = df
type(df1)
for j in dfs:
print(type(j))
def create_url():
url = "https://api.twitter.com/2/tweets?{}&{}".format("ids=" + (str(str((df1['id'].tolist()))[1:-1])).replace(" ", ""), tweet_fields)
return url
dfs is dictionary so for j in dfs: gives you only keys - which are string.
You need .values()
for j in dfs.values():
or .items()
for key, j in df.items():
or you have to use dfs[j]
for j in dfs:
print(type( dfs[j] ))
EDIT:
Frankly, you could do it all in one loop without list_df
import pandas as pd
#df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
df = pd.DataFrame({'id': range(1000)})
tweet_fields = 'something'
n = 100 #chunk row size
for i in range(0, df.shape[0], n):
ids = df[i:i+n]['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)
You can also use groupby and index if index uses numbers 0,1,...
for i, group in df.groupby(df.index//100):
ids = group['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)

Why is my for loop overwriting instead of appending CSV?

I am trying to scrape IB website. So, what I am doing, I have created the urls to iterate over, and I am able to extract the required information, but seems the dataframe keeps being overwritten vs appending.
import pandas as pd
from pandas import DataFrame as df
from bs4 import BeautifulSoup
import csv
import requests
base_url = "https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK&p=&cc=&limit=100"
n = 1
url_list = []
while n <= 2:
url = (base_url + "&page=%d" % n)
url_list.append(url)
n = n+1
def parse_websites(url_list):
for url in url_list:
html_string = requests.get(url)
soup = BeautifulSoup(html_string.text, 'lxml') # Parse the HTML as a string
table = soup.find('div',{'class':'table-responsive no-margin'}) #Grab the first table
df = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
df.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
print(df)
df.to_csv('path_to_file\\test1.csv')
parse_websites(url_list)
Can you please take a look at my code at advise what I am doing wrong ?
One solution if you want to append the data frames on the file is to write in append mode:
df.to_csv('path_to_file\\test1.csv', mode='a', header=False)
otherwise you should create the data frame outside as mentioned in the comments.
If you define a data structure from within a loop, each iteration of the loop
will redefine the data structure, meaning that the work is being rewritten.
The dataframe should be defined outside of the loop if you do not want it to be overwritten.

Categories