How to get column names after importing pickle file into Pandas - python

I am new to hands-on python and programming in general. I have imported a 6gb pickle file into pandas and been able to display the results of the file. It doesn't look well ordered however. My dataframe has varying rows and 842 columns.
My next task is to;
get column names of all 842 columns so i can find columns that have similar features.
create a new column (series) with data from (1) above
"append" new column to original dataframe
Thus far i have tried the "functions" column, col, dataframe.columns, to get column names but no one is working.
Please see what my program looks like;code and output

You can get list of your dataframe column names using this :
list(your_dataframe.columns)
for adding new columns, check this :
new-columns-in pandas

Related

Accessing imported data from Excel in Pandas

I'm new to python and just trying to redo my first project from matlab. I've written a code in vscode to import an excel file using pandas
filename=r'C:\Users\user\Desktop\data.xlsx'
sheet=['data']
with pd.ExcelFile(filename) as xls:
Dateee=pd.read_excel(xls, sheet,index_col=0)
Then I want to access data in a row and column.
I tried to print data using code below:
for key in dateee.keys():
print(dateee.keys())
but this returns nothing.
Is there anyway to access the data (as a list)?
You can iterate on each column, making the contents of each a list:
for c in df:
print(df[c].to_list())
df is what the dataframe was assigned as. (OP had inconsistent syntax & so I didn't use that.)
Look into df.iterrows() or df.itertuples() if you want to iterate by row. Example:
for row in df.itertuples():
print(row)
Look into df.iloc and df.loc for row and column selection of individual values, see Pandas iloc and loc – quickly select rows and columns in DataFrames.
Or df.iat or df.at for getting or setting single values, see here, here, and here.

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

How to make the pandas row a column name?

When I create the Pandas dataframe, it detects the empty line at the top of the excel file as the column name and shows it as unnamed. But my column names should be the concentration names on the bottom line of it. How can I do this in a pandas? (Editing in Excel is a solution, but I want to automatically edit multiple excel files with python)
I think the column over there is not representing any column it is simply indication that there are many number of columns there. If it is a column and u don't want it u can simply drop it
df.drop("...")
if still it is still not resolved do comment.

How to write dataframe to csv with a single row header(5k columns)?

I am trying to export a pandas dataframe with to_csv so it can be processed by another tool before using it again with python. It is a token dataset with 5k columns. When exported the header is split in two rows. This might not be an issue for pandas but in this case I need to export it on a single row csv. Is this a pandas limitation or a csv format one?
Currently, searching returned no compatible results. The only solution I came up is writing the column names and the values separately, eg. writing an str column list first and then a numpy array to the csv. Can this be implemented, and if so how?
For me this problem was caused by having multiple indexes. The easiest way to resolve this issue is to specify your own headers. I found reference to an option called tupleize_cols but it doesn't exist in current (1.2.2) pandas.
I was using the following aggregation:
df.groupby(["device"]).agg({
"outage_length":["count","sum"],
}).to_csv("example.csv")
This resulted in the following csv output:
,outage_length,outage_length
,count,sum
device,,
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0
I specified my own headers in the call to to_csv; excluding my group_by, as follows:
}).to_csv("example.csv",header=("flaps","downtime"))
And got the following csv output, which was much more pleasing to spreadsheet software:
device,flaps,downtime
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0

Iterate Through Folder and Add One Column of Each CSV to Dataframe

I have a folder that contains ~90 CSV files. Each relevant file is named xxxxx-2012 and has the same column names.
I would like to create a single DataFrame with a specific column power(MW) from each file, i.e. 90 columns in total, naming the column in the resulting DataFrame by the file name.
My objective with problems like this is to get to a simple datastructure as quickly as possible. In this case, that could be a dictionary of filenames to DataFrames.
frames = {filename: pd.read_csv(filename) for filename is os.listdir()}
You may have to filter out bad filenames, e.g. by extension, or you may be better off using glob... in either case it breaks up the problem, this shouldn't be too bad.
Then the question becomes much easier*:
How do I get one column from a DataFrame. df[colname].
How do I concat a list of columns to a DataFrame.
*Assuming you know your way around python datastructure e.g. list comprehensions.
Another option is to just concat the entire dict:
pd.concat(frames)
(which gives you a MultiIndex with all the information.)

Categories