Function uses name from loop and prints 10 or more data sets. I want to merge them in one DataFrames write into csv file.
def f(name):
url = "https://apiurl"
response = requests.get(url, params = {'page': 1})
records = []
for page_number in range(1, response.json().get("pages")+1):
response = requests.get(url, params = {'page': page_number})
records += response.json().get('records')
df = pd.DataFrame(records)
return df
For loop to the function.
for row in valdf.itertuples():
name = valdf.loc[row.Index, 'Account_ID']
df1 = f(name)
print(df1)
When I tried df.to_csv('df.csv') it only take last result from loop. Is it possible to merge them in a one DataFrames and export?
create a list outside of the loop and use pd.concat()
dfs = []
for row in valdf.itertuples():
name = valdf.loc[row.Index, 'Account_ID']
df1 = f(name)
dfs.append(df1)
all_df = pd.concat(dfs)
this assumes the dfs all have the same n dimensions
Related
I have a function that loops through two lists, makes API calls based on the values contained in those lists and then merges responses into a single dataframe as below:
master_df = pd.DataFrame()
links = [109340678,60375713,...]
ids = [353474,184335,...]
for id in ids:
url = "https://myurl/{}.json".format(id)
response = requests.get(url, headers=headers)
json_obj = json.loads(response.content)
data = json_obj['data']
df = pd.DataFrame.from_dict(data)
df = df.set_index('field').T
df = df.replace(to_replace='.*patern.*', value=np.NaN, regex=True) # replace a particular pattern with NaN, insert content from the loop below
for link in links:
url_download= "https://myurl/{id}/{link}".format(id= id, link= link)
response = requests.get(url_act, headers=headers)
df.at[0,link] = response.content
master_df = master_df.append(df)
I save the results in an Excel workbook, but get the following, where value is the response.content
I inspected every iteration of the loop and can't quite understand why, despite specifying 0 as row index, I get and extra row created every time.
I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?
You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']
If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])
I have a data frame that I split into different data frames of size 100 (to be able to make Python able to process it).
Therefore, I get different data frames (df1 to df..). For all those data frames, I want to create an URL as shown below.
When I use type(df), it shows me it is a data frame, however, when I use for j in dfs: print(type(j)), it is shown it is a string. I need the data frame to make it able to create the URL.
Can you please help me what the loop for creating the urls for all data frames could look like?
Thank you so much for your help!
df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
n = 100 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
dfs = {}
for idx,df in enumerate(list_df, 1):
dfs[f'df{idx}'] = df
type(df1)
for j in dfs:
print(type(j))
def create_url():
url = "https://api.twitter.com/2/tweets?{}&{}".format("ids=" + (str(str((df1['id'].tolist()))[1:-1])).replace(" ", ""), tweet_fields)
return url
dfs is dictionary so for j in dfs: gives you only keys - which are string.
You need .values()
for j in dfs.values():
or .items()
for key, j in df.items():
or you have to use dfs[j]
for j in dfs:
print(type( dfs[j] ))
EDIT:
Frankly, you could do it all in one loop without list_df
import pandas as pd
#df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
df = pd.DataFrame({'id': range(1000)})
tweet_fields = 'something'
n = 100 #chunk row size
for i in range(0, df.shape[0], n):
ids = df[i:i+n]['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)
You can also use groupby and index if index uses numbers 0,1,...
for i, group in df.groupby(df.index//100):
ids = group['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)
I have a Python code that pulls data from a 3 rd party API.Below is the code.
for sub in sublocation_ids:
city_num_int = sub['id']
city_num_str = str(city_num_int)
city_name = sub['name']
filter_text_new = filter_text.format(city_num_str)
data = json.dumps({"filters": [filter_text_new], "sort_by":"fb_tw_and_li", "size":200, "from":1580491663000, "to":1588184960000, "content_type":"stories"})
r = requests.post(url = api_endpoint, data = data).json()
if r['articles'] != empty_list:
articles_list = r["articles"]
time.sleep(5)
articles_list_normalized = json_normalize(articles_list)
df = articles_list_normalized
df['publication_timestamp'] = pd.to_datetime(df['publication_timestamp'])
df['publication_timestamp'] = df['publication_timestamp'].apply(lambda x: x.now().strftime('%Y-%m-%d'))
df['citystate'] = city_name
df = df.drop('has_video', 1)
df.to_excel(writer, sheet_name = city_name)
writer.save()
Now city_num_int = sub['id'] is a unique ID for different cities. Now the API returns a "videos" column for few cities and not for other. I want to get rid of that video column before it gets written to Excel file.
I was able to drop "has_video" column using df.drop as that column is present in each and every city data pull. But how do do conditional dropping for "videos" column as it is only present for few cities.
You can ignore the errors raised by Dataframe.drop:
df = df.drop(['videos'], axis=1, errors='ignore')
Another way is to first check if column is present in DF, and only then delete it
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
You can use list comprehension on the column names to achieve what you want:
cols_to_keep = [c for c in df.columns if c != "videos"]
df = df[cols_to_keep]
I have a problem with appending of dataframe.
I try to execute this code
df_all = pd.read_csv('data.csv', error_bad_lines=False, chunksize=1000000)
urls = pd.read_excel('url_june.xlsx')
substr = urls.url.values.tolist()
df_res = pd.DataFrame()
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
df_res.append(res)
And when I try to save df_res I get empty dataframe.
df_all looks like
ID,"url","used_at","active_seconds"
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:25,1
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:31,30
f85ce4b2f8787d48edc8612b2ccaca83,"4pda.ru/forum/index.php?showtopic=634566&view=getnewpost",2015-10-01 00:01:49,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"shop.mts.ru/smartfony/mts/smartfon-smart-sprint-4g-sim-lock-white.html?utm_source=admitad&utm_medium=cpa&utm_content=300&utm_campaign=gde_cpa&uid=3",2015-10-01 00:03:19,34
078d388438ebf1d4142808f58fb66c87,"market.yandex.ru/product/12675734/spec?hid=91491&track=char",2015-10-01 00:03:48,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"avito.ru/yoshkar-ola/telefony/mts",2015-10-01 00:04:21,4
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:25,1
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:26,9
and urls looks like
url
shoppingcart.aliexpress.com/order/confirm_order
ozon.ru/?context=order_done&number=
lk.wildberries.ru/basket/orderconfirmed
lamoda.ru/checkout/onepage/success/quick
mvideo.ru/confirmation?_requestid=
eldorado.ru/personal/order.php?step=confirm
When I print res in a loop it doesn't empty. But when I try print in a loop df_res after append, it return empty dataframe.
I can't find my error. How can I fix it?
If you look at the documentation for pd.DataFrame.append
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
(emphasis mine).
Try
df_res = df_res.append(res)
Incidentally, note that pandas isn't that efficient for creating a DataFrame by successive concatenations. You might try this, instead:
all_res = []
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
all_res.append(res)
df_res = pd.concat(all_res)
This first creates a list of all the parts, then creates a DataFrame from all of them once at the end.
If we want append based on index:
df_res = pd.DataFrame(data = None, columns= df.columns)
all_res = []
d1 = df.ix[index-10:index-1,] #it will take 10 rows before i-th index
all_res.append(d1)
df_res = pd.concat(all_res)