I am copying data from my clipboard that contains no headers. I dont want the index column and I want to name the columns dynamically skipping the first column by count(ie 1,2,3...). The output data set would like the following below.
1 2 3 4 5 6 7 8 9 10
1981 5012.0 8269.0 10907.0 11805.0 13539.0 16181.0 18009.0 18608.0 18662.0 18834.0
Here is the code I'm starting with. The codes works but the column headers aren't dynamic and the data set may not always have the same number of columns. I'm not sure how to make the column headers be dynamic
import pandas as pd
df = pd.read_clipboard(index_col = 0, names = ["","1","2","3","4","5","6","7","8","9","10"])
To get exactly what you are looking for you can use:
df = pd.read_clipboard(header=None, index_col=0).rename_axis(None)
Related
This question already has answers here:
Python Pandas: How to read only first n rows of CSV files in?
(3 answers)
Closed last month.
How to read the first cell from my csv file and store it as a variable
for example, my list is
header 1
header 2
AM
Depth
Value
10
20
30
122
60
222
how can I read the (AM) cell and store it as "x" variable?
and how I can I ignore AM cell later on and start my data frame from my headers (Depth, value)?
You should be able to get a specific row/column using indexing. iloc should be able to help.
For example, df.iloc[0,0] returns AM.
Also, pandas.read_csv allows you to skip rows when reading the data, You can use pd.read_csv("test.csv", sep="\t",skiprows=1) to skip first row.
Result:
0 10 20
1 30 122
2 60 222
Use pd.read_csv and then select the first row:
import pandas as pd
df = pd.read_csv('your file.csv')
x = df.iloc[0]['header 1']
Then, to delete it, use df.drop:
df.drop(0, inplace=True)
Hi I am using dummy csv file which is generated using data you posted in this question.
import pandas as pd
# read data
df = pd.read_csv('test.csv')
File contents are as follows:
header 1 header 2
0 AM NaN
1 Depth Value
2 10 20
3 30 122
4 60 222
One can use usecols parameter to access different columns in the data. If you are interested in just first column in this case it can be just 0 or 1. Using 0 or 1 you can access individual columns in the data.
You can save contents of this to x or whichever variable you want as follows:
# Change usecols to load various columns in the data
x = pd.read_csv('test.csv',usecols=[0])
Header:
# number of line which you want to use as a header set it using header parameter
pd.read_csv('test.csv',header=2)
Depth Value
0 10 20
1 30 122
2 60 222
I have used pivot table in pandas and have got the desired format of dataframe but now I have two rows of header. The resultant dataframe after pivot table is as follows:
scenario Actual Plan
LY_USD_AMT USD_AMT LY_USD_AMT USD_AMT
package
Africa 3 3 0 0
Brazil 1 1 1 1
Canada 1 1 1 1
Mexico 0 0 1 1
I have managed to delete the last row of the header using the following:
pd_piv.columns = pd_piv.columns.droplevel(-1)
But at this point, it becomes difficult to identify which row is which as it renders column names like
LY_USD_AMT USD_AMT LY_USD_AMT USD_AMT
Is there anyway to resolve this issue, maybe combine the two headers and get a simpler tabular dataframe like the one below. I need a simple table since I am going to feed this to an external system which recognises only one header line.
ACTUAL_LY_USD_AMT ACTUAL_USD_AMT Plan_LY_USD_AMT Plan_USD_AMT
You can combine both the headers:
df.columns = [c[0] + "_" + c[1] for c in df.columns]
This would change the multiple headers to a combined header.
Eg.:
My dataframe with multiple headers:
location location2
S1 S2 S3 S1 S2 S3
a -1.268587 0.014928 0.121195 -1.250765 0.321319 0.017481
Output from the above code:
location_S1 location_S2 location_S3 location2_S1 location2_S2 location2_S3
a -1.268587 0.014928 0.121195 -1.250765 0.321319 0.017481
You can replace the columns with a list of whatever you want, and it will be converted to a proper index Pandas needs under the hood, so if the values that make up your column headings are strings, you can do something as simple as this:
pd_piv.columns = ['_'.join(header).upper() for header in pd_piv.columns]
So your columns end up being:
ACTUAL_LY_USD_AMT ACTUAL_USD_AMT PLAN_LY_USD_AMT PLAN_USD_AMT
I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:
pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')
Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?
I.e. how can
pd.read_excel(file_name, sheet_name=None)
be passed a varying header argument or choose at least the 2 first (non empty) rows as header?
edit
dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.
edit2
Description of exact file structure:
... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2
where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.
I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)
I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9