I am trying to add a row(list of int) to pandas dataframe but i am somehow unable to append it. The thing is the dataframe has no header and everywhere the put data through specifying column names. I dont understand how to do it without header. Below is my dataframe named sheet
sheet = pd.read_csv("labels.csv",header = None) #sheet = [[1,1,1,2,2 ...]]
i want to append resulting list named simple_list = [1,2,3,4,2,1,1,1] to it.
Does the following work?
sheet = sheet.append(pd.DataFrame([[1,2,3,4,2,1,1,1]]), ignore_index=True)
You need to create a series out of simple_list.
simple_series = pd.Series(simple_list)
Then you can append a row to your dataframe with argument ignore_index=True:
sheet = sheet.append(simple_series, ignore_index=True)
This work for me. sheet.appends returns a dataframe, you need to assign it to the same var or a new one to us it. If not "sheet" will have always the original CSV content
import pandas as pd
if __name__ == '__main__':
sheet = pd.read_csv("labels.csv", header=None)
sheet = sheet.append([[41, 42, 42, 44, 45]], ignore_index=True)
Related
I have this code which appends a column of a csv file as a row to another csv file:
def append_pandas(s,d):
import pandas as pd
df = pd.read_csv(s, sep=';', header=None)
df_t = df.T
df_t.iloc[0:1, 0:1] = 'Time Point'
df_t.at[1, 0] = 1
df_t.columns = df_t.iloc[0]
df_new = df_t.drop(0)
pdb = pd.read_csv(d, sep=';')
newpd = pdb.append(df_new)
from pandas import DataFrame
newpd.to_csv(d, sep=';')
The result is supposed to look like this:
Instead, every time the row is appended, there is an extra "Unnamed" column appearing on the left:
Do you know how to fix that?..
Please, help :(
My csv documents from which I select a column look like this:
You have to add index=False to your to_csv() method
I have a list of values, if they appear in a the column 'Books' I would like that row to be returned.
I think I have achieved this with the below code:
def return_Row():
file = 'TheFile.xls'
df = pd.read_excel(file)
listOfValues = ['A','B','C']
return(df.loc[df['Column'].isin(listOfValues)])
This currently only seems to work on the first Worksheet as there are multiple worksheets in 'TheFile.xls' how would I go about looping through these worksheets to return any rows where the listOfValues is found in the 'Books' column of all the other sheets?
Any help would be greatly appreciated.
Thank you
The thing is, pd.read_excel() returns a dataframe for the first sheet only if you didn't specify the sheet_name argument. If you want to get all the sheets in excel file without specifying their names, you can pass None to sheet_name as follows:
df = pd.read_excel(file, sheet_name=None)
This will give you a different dataframe for each sheet on which you can loop and do whatever you want. For example you can append the results that you need to a list and return the list:
def return_Row():
file = 'TheFile.xls'
results = []
dfs = pd.read_excel(file, sheet_name=None)
listOfValues = ['A','B','C']
for df in dfs.values():
results.append(df.loc[df['Column'].isin(listOfValues)])
return(results)
In the following code I'm trying to write multiple sheets from an Excel file,remove the empty cells, group the columns and store the result in another excel file:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
df = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
When I execute my script iget the following error:
AttributeError: 'dict' object has no attribute 'dropna'
how can I fix it?
When you provide a list of sheets for sheet_name param, your return object is dict of DataFrame as described here
dropna is method of DataFrame so you have to select the sheet first. for example
df['R9_14062021'].dropna(inplace=True)
Taken from pandas documentation for pd.read_excel:
If you give sheet_name a list, you will receive a list of dataframes.
Meaning you'll have to go over each dataframe and dropna() separately because you can't dropna() on a dictionary, your code will look like this:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
dfs_list = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
for i in dfs_list:
df = dfs_list[i]
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
The main difference here is the usage of
for i in dfs_list:
df = dfs_list[i]
in order to apply each change you are doing to each dataframe, if you want a specific dataframe you should do: df[0].dropna() for example.
Hope this helps and this is what you were aiming for.
I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())
Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)
You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)
It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!
I have a lot of different table (and other unstructured data in an excel sheet) .. I need to create a dataframe out of range 'A3:D20' from 'Sheet2' of Excel sheet 'data'.
All examples that I come across drilldown up to sheet level, but not how to pick it from an exact range.
import openpyxl
import pandas as pd
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.get_sheet_by_name('Sheet2')
range = ['A3':'D20'] #<-- how to specify this?
spots = pd.DataFrame(sheet.range) #what should be the exact syntax for this?
print (spots)
Once I get this, I plan to look up data in column A and find its corresponding value in column B.
Edit 1: I realised that openpyxl takes too long, and so have changed that to pandas.read_excel('data.xlsx','Sheet2') instead, and it is much faster at that stage at least.
Edit 2: For the time being, I have put my data in just one sheet and:
removed all other info
added column names,
applied index_col on my leftmost column
then used wb.loc[]
Use the following arguments from pandas read_excel documentation:
skiprows : list-like
Rows to skip at the beginning (0-indexed)
nrows: int, default None
Number of rows to parse.
parse_cols : int or list, default None
If None then parse all columns,
If int then indicates last column to be parsed
If list of ints then indicates list of column numbers to be parsed
If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)
I imagine the call will look like:
df = read_excel(filename, 'Sheet2', skiprows = 2, nrows=18, parse_cols = 'A:D')
EDIT:
in later version of pandas parse_cols has been renamed to usecols so the above call should be rewritten as:
df = read_excel(filename, 'Sheet2', skiprows = 2, nrows=18, usecols= 'A:D')
One way to do this is to use the openpyxl module.
Here's an example:
from openpyxl import load_workbook
wb = load_workbook(filename='data.xlsx',
read_only=True)
ws = wb['Sheet2']
# Read the cell values into a list of lists
data_rows = []
for row in ws['A3':'D20']:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
# Transform into dataframe
import pandas as pd
df = pd.DataFrame(data_rows)
my answer with pandas O.25 tested and worked well
pd.read_excel('resultat-elections-2012.xls', sheet_name = 'France entière T1T2', skiprows = 2, nrows= 5, usecols = 'A:H')
pd.read_excel('resultat-elections-2012.xls', index_col = None, skiprows= 2, nrows= 5, sheet_name='France entière T1T2', usecols=range(0,8))
So :
i need data after two first lines ; selected desired lines (5) and col A to H.
Be carefull #shane answer's need to be improved and updated with the new parameters of Pandas