I have to append the data in CSV, the problem I am facing is intead of appending I am overwriting the data, not able to retain the old data, example :
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)"]
df["tergetz"] = ["str(target_path)"]
df["TMP"] = ["total_matching_points"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False)
Now if I add a new value
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)_New"]
df["tergetz"] = ["str(target_path)_New"]
df["TMP"] = ["total_matching_points_New"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False)
It is keeping the latest data in csv instead I want both data to be updated in csv. any idea?
I have tried to create a new csv with pandas dataframe and I want to append the values instead overwriting
I have tried:
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)"]
df["tergetz"] = ["str(target_path)"]
df["TMP"] = ["total_matching_points"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False, mode='a+')
But the problem is heading is repeating csv:
sourcez,tergetz,TMP
str(source_Path),str(target_path),total_matching_points
sourcez,tergetz,TMP
str(source_Path)_New,str(target_path)_New,total_matching_points_New
How to remove repeated headings sourcez,tergetz,TMP
Related
I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?
You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']
If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])
I am scanning a directory of text files and adding them to a Pandas dataframe:
text_path = "/home/tdun0002/stash/cloud_scripts/aws_scripts/output_files/memory_stats/text/"
filelist = os.listdir(text_path)
final_df = pd.DataFrame()
for filename in filelist:
my_file = text_path + filename
try:
df = pd.read_csv(my_file, delim_whitespace=True, header=None)
final_df = final_df.append(df)
pd.options.display.max_rows
print(f"\n***Full Data Frame: {df}\n***")
Each file in the directory holds the memory of a server:
bastion001-memory.txt
permissions001-memory.txt
haproxy001-memory.txt
The contents of the files look something like this:
cat haproxy001-memory.txt
7706172
On each pass of adding the file, it reports this:
Data Frame: Empty DataFrame
Columns: [7706172]
Index: []
And when I print out the full data frame it only has the last entry:
***Full Data Frame:
Empty DataFrame
Columns: [7706172]
Index: []
***
Why is it reporting that the dataframe is empty? Why is it only showing the last file that was input? I think I may need to append the data.
2 things:
You need to provide header=None in pd.read_csv command to consider the value in text file as data. This is because by default, pandas assumes the first row to be header.
Since you are reading multiple files, you need to append each dataframe into another. Currently you are overwriting df on each iteration.
Code should be like:
text_path = "/home/tdun0002/stash/cloud_scripts/aws_scripts/output_files/memory_stats/text/"
filelist = os.listdir(text_path)
final_df = pd.DataFrame()
for filename in filelist:
my_file = text_path + filename
try:
df = pd.read_csv(my_file, delim_whitespace=True, header=None)
final_df = final_df.append(df)
print(f"Data Frame: {df}")
pd.options.display.max_rows
print(f"\n***Full Data Frame: {df}\n***")
I have a Python code that pulls data from a 3 rd party API.Below is the code.
for sub in sublocation_ids:
city_num_int = sub['id']
city_num_str = str(city_num_int)
city_name = sub['name']
filter_text_new = filter_text.format(city_num_str)
data = json.dumps({"filters": [filter_text_new], "sort_by":"fb_tw_and_li", "size":200, "from":1580491663000, "to":1588184960000, "content_type":"stories"})
r = requests.post(url = api_endpoint, data = data).json()
if r['articles'] != empty_list:
articles_list = r["articles"]
time.sleep(5)
articles_list_normalized = json_normalize(articles_list)
df = articles_list_normalized
df['publication_timestamp'] = pd.to_datetime(df['publication_timestamp'])
df['publication_timestamp'] = df['publication_timestamp'].apply(lambda x: x.now().strftime('%Y-%m-%d'))
df['citystate'] = city_name
df = df.drop('has_video', 1)
df.to_excel(writer, sheet_name = city_name)
writer.save()
Now city_num_int = sub['id'] is a unique ID for different cities. Now the API returns a "videos" column for few cities and not for other. I want to get rid of that video column before it gets written to Excel file.
I was able to drop "has_video" column using df.drop as that column is present in each and every city data pull. But how do do conditional dropping for "videos" column as it is only present for few cities.
You can ignore the errors raised by Dataframe.drop:
df = df.drop(['videos'], axis=1, errors='ignore')
Another way is to first check if column is present in DF, and only then delete it
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
You can use list comprehension on the column names to achieve what you want:
cols_to_keep = [c for c in df.columns if c != "videos"]
df = df[cols_to_keep]
I'm wondering how to get parsed tables from panda into a single CSV, I have managed to get each table into a separate CSV for each one, but would like them all on one CSV. This is my current code to get multiple CSVs:
import pandas as pd
import csv
url = "https://fasttrack.grv.org.au/RaceField/ViewRaces/228697009?
raceId=318809897"
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
for i, datas in enumerate(data):
datas.to_csv("new{}.csv".format(i), header = False, index = False)
I think need concat only, because data is list of DataFrames:
df = pd.concat(data, ignore_index=True)
df.to_csv(file, header=False, index=False)
You have 2 options:
You can tell pandas to append data while writing to the CSV file.
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
for datas in data:
datas.to_csv("new.csv", header=False, index=False, mode='a')
Merge all the tables into one DataFrame and then write that into the CSV file.
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
df = pd.concat(data, ignore_index=True)
df.to_csv("new.csv", header=False, index=False)
Edit
To still separate the dataframes on the csv file, we shall have to stick with option #1 but with a few additions
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
with open('new.csv', 'a') as csv_stream:
for datas in data:
datas.to_csv(csv_stream, header=False, index=False)
csv_stream.write('\n')
all_dfs = []
for i, datas in enumerate(data):
all_dfs.append(datas.to_csv("new{}.csv".format(i), header = False, index = False))
result = pd.concat(all_dfs)
I have used the following code to read the data from the files. I tried to make a time series data in one data frame but I am missing somewhere.
files = glob.glob('*.txt')
files.sort()
for infile in files:
year,formatt = infile.split('.')
year = year.split('_')[1]
ws = [4,9,7,7,7,7,7,7,7,7,7,7,7]
df = pd.read_fwf(infile,widths=ws,header=9, nrows=31, keep_default_na = False)
df = df.drop('Day', 1)
df = np.array(df.T)
df = df[df != '']
data = pd.DataFrame([])
data['Discharge'] = df
data = data.set_index(pd.date_range(year, periods=len(data), freq='D'),
drop=True, append=False, inplace=False, verify_integrity=False)
new = pd.DataFrame([])
all_ = new.append(data)
print all_
can anyone help me to figure out my problem?
my sample data is in this link: https://drive.google.com/open?id=0B2rkXkOkG7ExSWQ5djloNkpNWmc