Add rows back to the top of a dataframe - python

I have a raw dataframe that looks like this
I am trying to import this data as a csv, do some calculations on the data, and then export the data. Before doing this, however, I need to remove the three lines of "header information", but keep the data as I will need to add it back to the dataframe prior to exporting. I have done this using the following lines of code:
import pandas as pd
data = pd.read_csv(r"test.csv", header = None)
info = data.iloc[0:3,]
data = data.iloc[3:,]
data.columns = data.iloc[0]
data = data[1:]
data = data.reset_index(drop = True)
The problem I am having is, how do I add the rows stored in "info" back to the top of the dataframe to make the format equivalent to the csv I imported.
Thank you

You can just use the append() function of pandas to merge two data frames. Please check by printing the final_data.
import pandas as pd
data = pd.read_csv(r"test.csv", header = None)
info = data.iloc[0:3,]
data = data.iloc[3:,]
data.columns = data.iloc[0]
data = data[1:]
data = data.reset_index(drop = True)
# Here first row of data is column header so converting back to row
data = data.columns.to_frame().T.append(data, ignore_index=True)
data.columns = range(len(data.columns))
final_data = info.append(data)
final_data = final_data.reset_index(drop = True)

Related

Append Values to CSV and retain the old data

I have to append the data in CSV, the problem I am facing is intead of appending I am overwriting the data, not able to retain the old data, example :
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)"]
df["tergetz"] = ["str(target_path)"]
df["TMP"] = ["total_matching_points"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False)
Now if I add a new value
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)_New"]
df["tergetz"] = ["str(target_path)_New"]
df["TMP"] = ["total_matching_points_New"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False)
It is keeping the latest data in csv instead I want both data to be updated in csv. any idea?
I have tried to create a new csv with pandas dataframe and I want to append the values instead overwriting
I have tried:
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)"]
df["tergetz"] = ["str(target_path)"]
df["TMP"] = ["total_matching_points"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False, mode='a+')
But the problem is heading is repeating csv:
sourcez,tergetz,TMP
str(source_Path),str(target_path),total_matching_points
sourcez,tergetz,TMP
str(source_Path)_New,str(target_path)_New,total_matching_points_New
How to remove repeated headings sourcez,tergetz,TMP

How can I export for loop result to CAV or Excel with pandas?

I have a for loop gets datas from a website and would like to export it to xlsx or csv file.
Normally when I print result of loop I can get all list but when I export that to xlsx file only get last item. Where is the problem can you help?
for item1 in spec:
spec2 = item1.find_all('th')
expl2 = item1.find_all('td')
spec2x = spec2[a].text
expl2x = expl2[a].text
yazim = spec2x + ': ' + expl2x
cumle = yazim
patern = r"(Brand|Series|Model|Operating System|CPU|Screen|MemoryStorage|Graphics Card|Video Memory|Dimensions|Screen Size|Touchscreen|Display Type|Resolution|GPU|Video Memory|Graphic Type|SSD|Bluetooth|USB)"
if re.search(patern, cumle):
speclist = translator.translate(cumle, lang_tgt='tr')
specl = speclist
#print(specl)
import pandas as pd
exp = [{ 'Prospec': specl,},]
df = pd.DataFrame(exp, columns = ['Prospec',])
df.to_excel('output1.xlsx',)
Create an empty list and, at each iteration in your for loop, append a data frame to the list. You will end up with a list of data frames. After the loop, use pd.concat() to create a new data frame by concatenating every element of your list. You can then save the resulting df to an excel file.
Your code would look something like this:
import pandas as pd
df_list = []
for item1 in spec:
......
if re.search(patern, cumle):
....
df_list.append(pd.DataFrame(.....))
df = pd.concat(df_list)
df.to_excel(.....)

Reading Excel file dynamically in Python

I am trying to read an excel which has some blank rows as well as columns. The process becomes more complicated as it has some junk values before the header as well.
Currently, I am hardcoding a column name to extract the table. This has two drawbacks what if the column is not present in the table and what if the column name repeats in the column value. Is there a way to dynamically write a program that automatically detects the table header and reads the table?
snippet of the code:
raw_data = pd.read_excel('test_data1.xlsx','Sheet8',header=None)
data_duplicate = pd.DataFrame()
for row in range(raw_data.shape[0]):
for col in range(raw_data.shape[1]):
if raw_data.iloc[row,col] == 'Currency':
data_duplicate = raw_data.iloc[(row+1):].reset_index(drop=True)
data_duplicate.columns = list(raw_data.iloc[row])
break
data_duplicate.dropna(axis=1, how='all',inplace=True)
data_duplicate
Also, the number of bank rows + garbage rows before the header is not fixed.
Here's my way: You can drop all rows and all columns containing Nan
data = pd.read_excel('test.xlsx')
data = data.dropna(how='all', axis = 1)
data = data.dropna(how='all', axis = 0)
data = data.reset_index(drop = True)
better if you put it into a function if you need to open multiple DataFrame in the same code:
data = pd.read_excel('test.xlsx')
def remove_nans(df):
x = df.dropna(how = 'all', axis = 1)
x = x.dropna(how = 'all', axis = 0)
x = x.reset_index(drop = True)
return x
df = remove_nans(data)
print(df)

comparing data frames from multiple data frames to filter data and extract relevant features

I try to load my all data-set files in python using pandas but the results are not shown.
import os
print(os.listdir("C:/Users/Smile/.spyder-py3/datasets"))
# Any results you write to the current directory are saved as output.
data = ["name","version","tool_name","wmc","dit","noc","cbo","rfc","lcom","ca","ce","npm","lcom3","loc","dam","moa","mfa","cam","ic","cbm","amc","max_cc","avg_cc","bug"]
data = pd.DataFrame()
for file in os.listdir():
if file.endswith('.csv'):
data = pd.read_csv(file)
data.set_index('name',inplace = True)
data = data.append(data, ignore_index=True
)
print(data.head(5))
************************************************************************
My output is given below:
Empty DataFrame
Columns: []
Index: []
you overwrite data each time you read a new CSV
replace the data variable with a temp variable, like this:
data = pd.DataFrame()
for file in os.listdir():
if file.endswith('.csv'):
csv_data = pd.read_csv(file)
csv_data.set_index('name',inplace = True)
data = data.append(csv_data, ignore_index=True)
print(data.head(5))
by using data to read a new csv data each time 'data = pd.read_csv(file)', you overwrite the data you already appended in the last iteration, you need to keep it intact in order to keep appending to it, so each CSV read must be separated.

Variable containing list of values to be saved in an excel using python

I'm new to python and trying my luck,
I have a Json to extract particular items and those items will be saved in variables and using FOR loop i'm displaying the entire json data as a output.
Basically, I want the entire output console in an excel with the help of dataframe(Panda) or if there is alternative way much appreciable.
import pandas as pd
import json
with open('i4.json', encoding = 'utf-8-sig') as f:
data = json.load(f)
for ib in data['documents']:
tit = ib['title']
stat = ib['status']
print(tit, stat)
df = pd.DataFrame({'Title' : [tit], 'Status' : [stat]})
df.to_excel('fromSIM.xls', index= False)
Output is: (Ex:)
title1 pass
title2 fail
The problem with excel is:
Am getting the excel saved as below,
Title Status
title2 fail
Anyone can en-light the above code to make all the output to be saved in the excel below each values one by one
The problem is that you are overwriting the data frame in each loop iteration. You should create the data frame out of the for, and thus only append the new rows in the DF inside the for.
import pandas as pd
columns = ['Title', 'Status']
df_ = pd.DataFrame( columns=columns)
for ib in data['documents']:
tit = ib['title']
stat = ib['status']
print(tit, stat)
df_ = df_.append(pd.Series([tit,stat], index=df_.columns ), ignore_index=True)
df_.to_excel('fromSIM.xls', index= False)

Categories