I'm a newbie in this community and I hope you can help me with my Problem. In my current project I want to scrape a page. These are gas stations with multiple information. Now all the information from the petrol stations is stored as one variable. However, I want each gas station to have a row so that I get a large data frame. Each individual gas station is provided with an id and they are stored in the variable ids.
ids=results["objectID"].tolist()
id_details=[]
for i,id in enumerate(ids):
input_dict = {
'diff_time_zone':-1,
'objectID':id,
'poiposition':'50.5397219 8.7328552',
'stateAll':'2',
'category':1,
'language':'de',
'prognosis_offset':-1,
'windowSize':305
}
encoded_input_string = json.dumps(input_dict, indent=2).encode('utf-8')
encoded_input_string = base64.b64encode(encoded_input_string).decode("utf-8")
r = s.post("https://example.me/getObject_detail.php", headers=headers, data="post="+encoded_input_string)
soup = BeautifulSoup(r.text, "lxml")
lists= soup.find('div', class_='inside')
rs= lists.find_all("p")
final = []
for lists in rs:
txt = lists if type(lists) == NavigableString else lists.text
id_details.append(txt)
df= pd.DataFrame(id_details,columns = ['place'])
well, personally I would use a database rather than a data frame in that case and probably not saving as a file. as I can see there is dictionary-based data that can be easily implemented in Elastic Search for example.
If there is any reason(that forced not using any kind of databases) for doing that (Using Dataframe) accessing to file and appending it to the end of it would work fine, and you should maximize your chunks because accessing to file and writing to it is working like a bottleneck here but saying chunk because Ram is not unlimited.
---Update asking for the second way.
Some parts of your code are missing but u will get the idea.
file_name = 'My_File.csv'
cols = ['place'] # e.x creating an empty csv with only one column - place using pandas
data = dict(zip(cols,[[] for i in range(len(cols))]))
df = pd.DataFrame(data) #creating df
df.to_csv(file_name, mode='w', index=False, header=True) #saving
id_details={'place':[]}
for i, id in enumerate(ids):
#Some algo...
for lists in rs:
id_details['place'].append(txt)
if i %100==0:
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
id_details['place'] = []
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
Related
My code are scraping data from public site and storing in a df and saving as csv, but my code are not working well.
for ano in lista_ano:
for distribuidora in lista_distribuidores:
for mes in lista_mes:
scraping = pd.read_html('https://www2.aneel.gov.br/aplicacoes/indicadores_de_qualidade/decFecSegMensal.cfm?mes={}&ano={}®iao=SE&distribuidora={}&tipo=d'.format(mes,ano,distribuidora))
dfs= pd.DataFrame(scraping[0])
dfs.drop(dfs.tail(3).index,inplace=True)
dfs.drop(dfs.head(2).index,inplace=True)
dfs = dfs.assign(MES = '{}'.format(mes))
dfs = dfs.assign(ANO = '{}'.format(ano))
dfs = dfs.assign(DISTRIBUIDORA = '{}'.format(distribuidora))
all_dfs = pd.DataFrame(dfs)
all_dfs.to_csv('final_data.csv', encoding= 'utf-8')
My problem here is my all_dfs.to_csv are creating a new csv for each looping and not storing data in the same local.
You are overwriting the existing csv on each iteration.
To fix it, simply indicate that you want to append instead of write by
all_dfs.to_csv('final_data.csv', encoding='utf-8', mode='a')
I'm working with an API in and trying to pull out a complete list of surveys by looping through every users API token. My idea for the loop is that it reads each API token (stored in a list) one at a time, stores the data, converts it to a pandas dataframe and then stores the data in a CSV file. I've so far created a script that successfully loops through the list of API tokens but just overwrites the CSV file each time. Here is my script currently:
apiToken = ["n0000000000001", "N0000000002"]
for x in apiToken:
baseUrl = "https://group.qualtrics.com/API/v3/surveys"
headers = {
"x-api-token": x,
}
response = requests.get(baseUrl, headers=headers)
surveys = response.text
surveys2 = json.loads(response.text)
surveys3 = surveys2["result"]["elements"]
df = pd.DataFrame(surveys3)
df.to_csv('survey_list.csv', index=False)
What changes do I need to make in order for the new rows to be appended on to the CSV rather than overwrite it?
Assuming your code is right, this should work. You make a list of the separate dataframes, and concat them together. This works if all the DataFrames in the list of DataFrames (dfs) have the same column names.
apiToken = ["n0000000000001", "N0000000002"]
dfs = []
for x in apiToken:
baseUrl = "https://group.qualtrics.com/API/v3/surveys"
headers = {
"x-api-token": x,
}
response = requests.get(baseUrl, headers=headers)
surveys = response.text
surveys2 = json.loads(response.text)
surveys3 = surveys2["result"]["elements"]
dfs.append(pd.DataFrame(surveys3))
final_df = pd.concat(dfs)
final_df .to_csv('survey_list.csv', index=False)
Using the csv package you can append files to your csv as follows:
import csv
...
with open(filename,'a') as filecsv:
writer = csv.writer(filecsv)
line = # whatever csv line you want to append
writer.writerow(line)
I'm new to python and trying my luck,
I have a Json to extract particular items and those items will be saved in variables and using FOR loop i'm displaying the entire json data as a output.
Basically, I want the entire output console in an excel with the help of dataframe(Panda) or if there is alternative way much appreciable.
import pandas as pd
import json
with open('i4.json', encoding = 'utf-8-sig') as f:
data = json.load(f)
for ib in data['documents']:
tit = ib['title']
stat = ib['status']
print(tit, stat)
df = pd.DataFrame({'Title' : [tit], 'Status' : [stat]})
df.to_excel('fromSIM.xls', index= False)
Output is: (Ex:)
title1 pass
title2 fail
The problem with excel is:
Am getting the excel saved as below,
Title Status
title2 fail
Anyone can en-light the above code to make all the output to be saved in the excel below each values one by one
The problem is that you are overwriting the data frame in each loop iteration. You should create the data frame out of the for, and thus only append the new rows in the DF inside the for.
import pandas as pd
columns = ['Title', 'Status']
df_ = pd.DataFrame( columns=columns)
for ib in data['documents']:
tit = ib['title']
stat = ib['status']
print(tit, stat)
df_ = df_.append(pd.Series([tit,stat], index=df_.columns ), ignore_index=True)
df_.to_excel('fromSIM.xls', index= False)
I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.
Background: My first Excel-related script. Using openpyxl.
There is an Excel sheet with loads of different types of data about products in different columns.
My goal is to extract certain types of data from certain columns (e.g. price, barcode, status), assign those to the unique product code and then output product code, price, barcode and status to a new excel doc.
I have succeeded in extracting the data and putting it the following dictionary format:
productData = {'AB123': {'barcode': 123456, 'price': 50, 'status': 'NEW'}
My general thinking on getting this output to a new report is something like this (although I know that this is wrong):
newReport = openpyxl.Workbook()
newSheet = newReport.active
newSheet.title = 'Output'
newSheet['A1'].value = 'Product Code'
newSheet['B1'].value = 'Price'
newSheet['C1'].value = 'Barcode'
newSheet['D1'].value = 'Status'
for row in range(2, len(productData) + 1):
newSheet['A' + str(row)].value = productData[productCode]
newSheet['B' + str(row)].value = productPrice
newSheet['C' + str(row)].value = productBarcode
newSheet['D' + str(row)].value = productStatus
newReport.save('ihopethisworks.xlsx')
What do I actually need to do to output the data?
I would suggest using Pandas for that. It has the following syntax:
df = pd.read_excel('your_file.xlsx')
df['Column name you want'].to_excel('new_file.xlsx')
You can do a lot more with it. Openpyxl might not be the right tool for your task (Openpyxl is too general).
P.S. I would leave this in the comments, but stackoverflow, in their widom decided to let anyone to leave answers, but not to comment.
The logic you use to extract the data is missing but I suspect the best approach is to use it to loop over the two worksheets in parallel. You can then avoid using a dictionary entirely and just append loops to the new worksheet.
Pseudocode:
ws1 # source worksheet
ws2 # new worksheet
product = []
code = ws1[…] # some lookup
barcode = ws1[…]
price = ws1[…]
status = ws1[…]
ws2.append([code, price, barcode, status])
Pandas will work best for this
here are some examples
import pandas as pd
#df columns: Date Open High Low Close Volume
#reading data from an excel
df = pd.read_excel('GOOG-NYSE_SPY.xls')
#set index to the column of your choice, in this case it would be date
df.set_index('Date', inplace = True)
#choosing the columns of your choice for further manipulation
df = df[['Open', 'Close']]
#divide two colums to get the % change
df = (df['Open'] - df['Close']) / df['Close'] * 100
print(df.head())