My code are scraping data from public site and storing in a df and saving as csv, but my code are not working well.
for ano in lista_ano:
for distribuidora in lista_distribuidores:
for mes in lista_mes:
scraping = pd.read_html('https://www2.aneel.gov.br/aplicacoes/indicadores_de_qualidade/decFecSegMensal.cfm?mes={}&ano={}®iao=SE&distribuidora={}&tipo=d'.format(mes,ano,distribuidora))
dfs= pd.DataFrame(scraping[0])
dfs.drop(dfs.tail(3).index,inplace=True)
dfs.drop(dfs.head(2).index,inplace=True)
dfs = dfs.assign(MES = '{}'.format(mes))
dfs = dfs.assign(ANO = '{}'.format(ano))
dfs = dfs.assign(DISTRIBUIDORA = '{}'.format(distribuidora))
all_dfs = pd.DataFrame(dfs)
all_dfs.to_csv('final_data.csv', encoding= 'utf-8')
My problem here is my all_dfs.to_csv are creating a new csv for each looping and not storing data in the same local.
You are overwriting the existing csv on each iteration.
To fix it, simply indicate that you want to append instead of write by
all_dfs.to_csv('final_data.csv', encoding='utf-8', mode='a')
Related
I'm a newbie in this community and I hope you can help me with my Problem. In my current project I want to scrape a page. These are gas stations with multiple information. Now all the information from the petrol stations is stored as one variable. However, I want each gas station to have a row so that I get a large data frame. Each individual gas station is provided with an id and they are stored in the variable ids.
ids=results["objectID"].tolist()
id_details=[]
for i,id in enumerate(ids):
input_dict = {
'diff_time_zone':-1,
'objectID':id,
'poiposition':'50.5397219 8.7328552',
'stateAll':'2',
'category':1,
'language':'de',
'prognosis_offset':-1,
'windowSize':305
}
encoded_input_string = json.dumps(input_dict, indent=2).encode('utf-8')
encoded_input_string = base64.b64encode(encoded_input_string).decode("utf-8")
r = s.post("https://example.me/getObject_detail.php", headers=headers, data="post="+encoded_input_string)
soup = BeautifulSoup(r.text, "lxml")
lists= soup.find('div', class_='inside')
rs= lists.find_all("p")
final = []
for lists in rs:
txt = lists if type(lists) == NavigableString else lists.text
id_details.append(txt)
df= pd.DataFrame(id_details,columns = ['place'])
well, personally I would use a database rather than a data frame in that case and probably not saving as a file. as I can see there is dictionary-based data that can be easily implemented in Elastic Search for example.
If there is any reason(that forced not using any kind of databases) for doing that (Using Dataframe) accessing to file and appending it to the end of it would work fine, and you should maximize your chunks because accessing to file and writing to it is working like a bottleneck here but saying chunk because Ram is not unlimited.
---Update asking for the second way.
Some parts of your code are missing but u will get the idea.
file_name = 'My_File.csv'
cols = ['place'] # e.x creating an empty csv with only one column - place using pandas
data = dict(zip(cols,[[] for i in range(len(cols))]))
df = pd.DataFrame(data) #creating df
df.to_csv(file_name, mode='w', index=False, header=True) #saving
id_details={'place':[]}
for i, id in enumerate(ids):
#Some algo...
for lists in rs:
id_details['place'].append(txt)
if i %100==0:
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
id_details['place'] = []
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
I need to read data from several sheets in a xlsx file, and save data as a dataframe with the same name as sheet name. Here is the code I use. It can read data from different sheets, however, all dataframes are named as temp. How should I change it. Thanks.
import pandas as pd
sheet_name_list = ['sheet1','sheet2','sheet3']
for temp in sheet_name_list:
temp = pd.read_excel("data_spreadsheet.xlsx", sheet_name = temp)
You can use dictionary:
pd_dict = {}
for temp in sheet_name_list:
pd_dict[temp] = pd.read_excel("data_spreadsheet.xlsx", sheet_name=temp)
I try to load my all data-set files in python using pandas but the results are not shown.
import os
print(os.listdir("C:/Users/Smile/.spyder-py3/datasets"))
# Any results you write to the current directory are saved as output.
data = ["name","version","tool_name","wmc","dit","noc","cbo","rfc","lcom","ca","ce","npm","lcom3","loc","dam","moa","mfa","cam","ic","cbm","amc","max_cc","avg_cc","bug"]
data = pd.DataFrame()
for file in os.listdir():
if file.endswith('.csv'):
data = pd.read_csv(file)
data.set_index('name',inplace = True)
data = data.append(data, ignore_index=True
)
print(data.head(5))
************************************************************************
My output is given below:
Empty DataFrame
Columns: []
Index: []
you overwrite data each time you read a new CSV
replace the data variable with a temp variable, like this:
data = pd.DataFrame()
for file in os.listdir():
if file.endswith('.csv'):
csv_data = pd.read_csv(file)
csv_data.set_index('name',inplace = True)
data = data.append(csv_data, ignore_index=True)
print(data.head(5))
by using data to read a new csv data each time 'data = pd.read_csv(file)', you overwrite the data you already appended in the last iteration, you need to keep it intact in order to keep appending to it, so each CSV read must be separated.
I'm new to python and trying my luck,
I have a Json to extract particular items and those items will be saved in variables and using FOR loop i'm displaying the entire json data as a output.
Basically, I want the entire output console in an excel with the help of dataframe(Panda) or if there is alternative way much appreciable.
import pandas as pd
import json
with open('i4.json', encoding = 'utf-8-sig') as f:
data = json.load(f)
for ib in data['documents']:
tit = ib['title']
stat = ib['status']
print(tit, stat)
df = pd.DataFrame({'Title' : [tit], 'Status' : [stat]})
df.to_excel('fromSIM.xls', index= False)
Output is: (Ex:)
title1 pass
title2 fail
The problem with excel is:
Am getting the excel saved as below,
Title Status
title2 fail
Anyone can en-light the above code to make all the output to be saved in the excel below each values one by one
The problem is that you are overwriting the data frame in each loop iteration. You should create the data frame out of the for, and thus only append the new rows in the DF inside the for.
import pandas as pd
columns = ['Title', 'Status']
df_ = pd.DataFrame( columns=columns)
for ib in data['documents']:
tit = ib['title']
stat = ib['status']
print(tit, stat)
df_ = df_.append(pd.Series([tit,stat], index=df_.columns ), ignore_index=True)
df_.to_excel('fromSIM.xls', index= False)
I have multiple (25k) .csv files that I'm trying to append into a HDFStore file. They all share identical headers. I am using the below code, but for some reason whenever I run it the dataframe isn't appended with all of the files, but rather is only the last file in the list.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
store = pd.HDFStore('store.h5')
store.put('df', pd.read_csv(filenames[0],dtype=dtypes,parse_dates=
["date"])) #store one data frame
for f in filenames:
try:
temp_csv = pd.DataFrame()
temp_csv = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
store.append('df', temp_csv)
except:
pass
I've tried using a subset of the filenames list, but always get the last entry. For some reason, the loop is not appending my file, but rather overwriting it every single time. Any advice would be appreciated as this is driving me bonkers. (python 3, windows)
I think the problem is related to:
store.append('df', temp_csv)
If I correctly understand what you're trying to do, 'df' should change every iteration, you're just overwriting it now.
You're creating/storing a new DataFrame with each iteration, like #SeaMonkey said. Your consolidated dataframe should be instantiated outside your loop, something like this.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
df = pd.DataFrame()
for f in filenames:
df_tmp = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
df = df.append(df_tmp)
store = pd.HDFStore('store.h5')
store.put('df', df)