Adding filename to column header in dataframe - python

I have a dataframe which i created from merging one column from 7 different excel file. Below is the code i used:
import pandas as pd
import glob
my_excel_files = glob.glob(r"C:\Users\.........\*.xlsx")
total_dataframe = pd.DataFrame()
for file in my_excel_files:
new_df = df['Comments']
total_dataframe = pd.concat([total_dataframe, new_df], axis=1) # Puts together all Comments columns
As you can see from the code i grab the 'Comments' column from each excel and put them together into a new df, the only issue is i want to be able to add the filename into the column name so i know which column comes from which excel file, all of them are just called 'Comments' right now. So ideally one of the column headers would be 'Comments (first_response.xlsx)'

lets use pathlib and pd.concat
using a dict comprehension we can grab the .name attribute from the pathlib object and when using concat the filename will be set as the index.
from pathlib import Path
dfs = pd.concat({f.name : pd.read_excel(f) for f in Path(r'C:\Users\..').glob('*.xlsx')})
this will create an index with the file name, you can reset_index if you want to place it as a column.

Related

Looping over a list of dataframes and appending values to different dataframes

I have a list of files and a list of Dataframes, and I want to use 1 "for" loop to open the first file of the list, extract some data and write it into the first Dataframe, then open the second file, do the same thing and write it into the second dataframe, etc. So I wrote this:
import pandas as pd
filename1 = 'file1.txt'
filename2 = 'file2.txt'
filenames = [filename1, filename2]
columns = ['a', 'b', 'c']
df1 = pd.DataFrame(columns = columns)
df2 = pd.DataFrame(columns = columns)
dfs = [df1, df2]
for name, df in zip(filenames, dfs):
info = open(name, 'r')
# go through the file, find some values
df = df.append({'''dictionary with found values'''})
However, when I run the code, instead of having my data written into the df1 and df2, which I created in the beginning, those dataframes stay empty, and a new dataframe appears in the list of variables, called df, where my data is stored, also it seems to be re-written at every execution of the loop... How do I solve this in the simplest way? The main goal is to have several different dataframes, each corresponding to a different file, in the end of the loop over the list of files. So I don't really care when and how the dataframes are created, I only want a new dataframe to be filled with values when a new file is open.
Each time you loop through dfs, df is actually a copy of the DataFrame object, not the actual object you created. Thus, when you assign a new DataFrame to df, the result is assigned to a new variable. Re-write your code like this:
dfs = []
for name in filenames:
with open(name, 'r') as info:
dfs.append(pd.read_csv(info))
If the text files are dictionaries or can be converted to dictionaries with keys: a, b, and c, after reading; just like the dataframes columns you created (a, b, c). Then they can be assigned this way
import pandas as pd
filename1 = 'file1.txt'
filename2 = 'file2.txt'
filenames = [filename1, filename2]
columns = ['a', 'b', 'c']
df1 = pd.DataFrame(columns = columns)
df2 = pd.DataFrame(columns = columns)
dfs = [df1, df2]
for name, df in zip(filenames, dfs):
with open(name, 'r') as info:
for key in info.keys():
df[key] = info[key]
The reason for this is that Python doesn't know you're trying to re-assign the variable names "df1" and "df2". The list you declare "dfs" is simply a list of two empty dataframes. You never alter that list after creation, so it remains a list of two empty dataframes, which happen to individually be referenced as "df1" and "df2".
I don't know how you're constructing a DF from the file, so I'm just going to assume you have a function somewhere called make_df_from_file(filename) that handles the open() and parsing of a CSV, dict, whatever.
If you want to have a list of dataframes, it's easiest to just declare a list and add them one at a time, rather than trying to give each DF a separate name:
df_list = []
for name in filenames:
df_list.append(make_df_from_file(name))
If you want to get a bit slicker (and faster) about it, you can use a list comprehension which combines the previous script into a single line:
df_list = [make_df_from_file(name) for name in filenames]
To reference individual dataframes in that list, you get just pull them out by index as you would any other list:
df1 = df_list[0]
df2 = df_list[1]
...
but that's often more trouble than it's worth.
If you want to then combine all the DFs into a single one, pandas.concat() is your friend:
from pandas import concat
dfs = concat(df_list)
or, if you don't care about df_list other than as an intermediate step:
from pandas import concat
dfs = concat([make_df_from_file(name) for name in filenames])
And if you absolutely need to give separate names to all the dataframes, you can get ultra-hacky with it. (Seriously, you shouldn't normally do this, but it's fun and awful. See this link for more bad ideas along these lines.)
for n, d in enumerate(dfs):
locals()[f'df{n+1}'] = d

Deleting an unnamed column from a csv file Pandas Python

I ma trying to write a code that deletes the unnamed column , that comes right before Unix Timestamp. After deleting I will save the modified dataframe into data.csv. How would I be able to get the Expected Output below?
import pandas ads pd
data = pd.read_csv('data.csv')
data.drop('')
data.to_csv('data.csv')
data.csv file
,Unix Timestamp,Date,Symbol,Open,High,Low,Close,Volume
0,1635686220,2021-10-31 13:17:00,BTCUSD,60638.0,60640.0,60636.0,60638.0,0.4357009185659157
1,1635686160,2021-10-31 13:16:00,BTCUSD,60568.0,60640.0,60568.0,60638.0,3.9771881707839967
2,1635686100,2021-10-31 13:15:00,BTCUSD,60620.0,60633.0,60565.0,60568.0,1.3977284440628714
Updated csv (Expected Output):
Unix Timestamp,Date,Symbol,Open,High,Low,Close,Volume
1635686220,2021-10-31 13:17:00,BTCUSD,60638.0,60640.0,60636.0,60638.0,0.4357009185659157
1635686160,2021-10-31 13:16:00,BTCUSD,60568.0,60640.0,60568.0,60638.0,3.9771881707839967
1635686100,2021-10-31 13:15:00,BTCUSD,60620.0,60633.0,60565.0,60568.0,1.3977284440628714
This is the index. Use index=False in to_csv.
data.to_csv('data.csv', index=False)
Set the first column as index df = pd.read_csv('data.csv', index_col=0) and set index=False when writing the results.
you can follow below code.it will take column from 1st position and then you can save that df to csv without index values.
df = df.iloc[:,1:]
df.to_csv("data.csv",index=False)

Creating a dataframe from several .txt files - each file being a row with 25 values

So, I have 7200 txt files, each with 25 lines. I would like to create a dataframe from them, with 7200 rows and 25 columns -- each line of the .txt file would be a value a column.
For that, first I have created a list column_names with length 25, and tested importing one single .txt file.
However, when I try this:
pd.read_csv('Data/fake-meta-information/1-meta.txt', delim_whitespace=True, names=column_names)
I get 25x25 dataframe, with values only on the first column. How do I read this into the dataframe in a way that I can get the txt lines to be imputed as values into the columns, and not imputing everything into the first column and creating 25 rows?
My next step would be creating a for loop to append each text file as a new row.
Probably something like this:
dir1 = *folder_path*
list = os.listdir(dir1)
number_files = len(list)
for i in range(number_files):
title = list[i]
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True, names=column_names)
df = df.append(df_temp,ignore_index=True)
I hope I have been clear. Thank you all in advance!
read_csv generates a row per line in the source file but you want them to be columns. You could read the rows and pivot to columns, but since these files have a single value per line, you can just read them in numpy and use each resulting array as a row in a dataframe.
import numpy as np
import pandas as pd
from pathlib import Path
dir1 = Path(".")
df = pd.DataFrame([np.loadtxt(filename) for filename in dir1.glob("*.txt")])
print(df)
tdelaney's answer is probably "better" than mine, but if you want to keep your code more stylistically closer to what you are currently doing the following is another option.
You are getting your current output (25x25 with data in the first column only) because your read data is 25x1 but you are forcing the dataframe to have 25 columns with your names=column_names parameter.
To solve, just wait until the end to apply the column names:
Get a 25x1 df (drop the names param):
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True)
Append the 25x1 df forming a 25x7200 df: df = df.append(df_temp,ignore_index=True)
Transpose the df forming the final 7200x25 df: df=df.T
Add column names: df.columns=column_names

How to concatenate multiple selected sheets from many XL spredhseets

I'm relatively new to python and pandas and I face the following problem: I have 20+ spreadsheets with multiple sheets. I'd like to concatenate the second sheet from each spreadsheet into a single spreadsheet. I'm using the below code, which works to the point that it creates a list of sheets but doesn't concatenate them correctly, the combined file has the only sheet from the first file. Each sheet has the same header row and the same structure.
Any help would be appreciated. The code I'm using is below:
import os
import glob
import pandas as pd
os.chdir(r"C:\Users\Site_Users")
extension = 'xlsx'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
xl_list=[]
for f in all_filenames:
df=pd.read_excel(f, sheet_name = 1)
xl_list.append(df)
combined = pd.concat(xl_list, ignore_index = True)
combined.to_excel( "combined.xlsx", index=False)
Working under the assumption that you have a list of df's, try adding axis=0 to your concat.
i.e.
combined = pd.concat(xl_list, axis = 0, ignore_index = True)
Just to close the loop on this. I found the answer. The code was correct but there was number of rows which looked empty but they had formulas in them, which for the code looked like not empty cells, so it was adding those rows to the combined sheet. Because of this I missed the added rows as they were 400 rows below empty rows.

Python Pandas Dataframe Append Rows

I'm trying to append the data frame values as rows but its appending them as columns. I have 32 files that i would like to take the second column from (called dataset_code) and append it. But its creating 32 rows and 101 columns. I would like 1 column and 3232 rows.
import pandas as pd
import os
source_directory = r'file_path'
df_combined = pd.DataFrame(columns=["dataset_code"])
for file in os.listdir(source_directory):
if file.endswith(".csv"):
#Read the new CSV to a dataframe.
df = pd.read_csv(source_directory + '\\' + file)
df = df["dataset_code"]
df_combined=df_combined.append(df)
print(df_combined)
You already have two perfectly good answers, but let me make a couple of recommendations.
If you only want the dataset_code column, tell pd.read_csv directly (usecols=['dataset_code']) instead of loading the whole file into memory only to subset the dataframe immediately.
Instead of appending to an initially-empty dataframe, collect a list of dataframes and concatenate them in one fell swoop at the end. Appending rows to a pandas DataFrame is costly (it has to create a whole new one), so your approach creates 65 DataFrames: one at the beginning, one when reading each file, one when appending each of the latter — maybe even 32 more, with the subsetting. The approach I am proposing only creates 33 of them, and is the common idiom for this kind of importing.
Here is the code:
import os
import pandas as pd
source_directory = r'file_path'
dfs = []
for file in os.listdir(source_directory):
if file.endswith(".csv"):
df = pd.read_csv(os.join.path(source_directory, file),
usecols=['dataset_code'])
dfs.append(df)
df_combined = pd.concat(dfs)
df["dataset_code"] is a Series, not a DataFrame. Since you want to append one DataFrame to another, you need to change the Series object to a DataFrame object.
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
>>> type(df['dataset_code'])
<class 'pandas.core.series.Series'>
To make the conversion, do this:
df = df["dataset_code"].to_frame()
Alternatively, you can create a dataframe with double square brackets:
df = df[["dataset_code"]]

Categories