Why is pandas adding new columns to my new excel file - python

I am trying to concatenate two excel files with the same column names together, but there seems to be a problem as there are new empty columns/spaces being added to my new excel file, and i don't know why.
I used pd.concat() function which was supposed to concat the two files into one single sheet and make a new file, but when it adds the table in the second file to the first file, new columns/spaces are added to the new merged file.
file_list = glob.glob(path + "/*.xlsx")
dfs = pd.DataFrame()
dfs = [pd.read_excel(p,) for p in file_list]
print(dfs[0].shape)
res = pd.concat(dfs)
That is a snippet of my code
I also added a picture of what the result i am getting now looks like

Concat respects the column names, so is not like a plain vector concatenate, try to check if the column names are the same among all your source files. If no, you can normalize them, rename them or move to a vector base format like numpy arrays.

Related

How to read in excel files from a folder and join them into a single df?

First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.

Saving each DataFrame column to separate CSV files

I have some dataframes, one of them is the following:
L_M_P = pd.read_csv('L_M_P.csv') # dimensions 17520x33
I would like to be able to save each column as an independent csv file, without having to do it manually as follows:
L_M_P[:,0].to_csv('column0.csv')
L_M_P[:,1].to_csv('column1.csv')
...
In that case, I would have 33 new '.csv' files, each with dimensions 17520x1.
You can iterate through columns and write it to files.
for column in df.columns:
df[column].to_csv(column + '.csv')
Note: Assuming language to be python as the question has pd mentioned in it and all mentioned code is part of pandas

How to use a dictionary to loop through different file names to create multiple databases in python?

I have two file location I would like to iterate through to search read the .tsv files. The first location is:
"C:\Users\User\Documents\Research\STITCH\0NN-human-STITCH\stitch.tsv"
The second is:
"C:\Users\User\Documents\Research\STITCH\1AQ-human-STITCH\stitch.tsv"
Both tsv files are the same name, but located in different folders.
Instead of using glob, I'd like to create a loop and dictionary to search through each of the files like, this:
import pandas as pd
file_name = 'C:/Users/User/Documents/Research/STITCH/{}-human-STITCH/stitch_interactions.tsv'
df_list = []
for i in range('ONN','1AQ'):
df_list.append(pd.read_csv(file_name.format(i)))
df = pd.concat(df_list)
After searching through one file, I'd then like to add an element from that file to an excel sheet.
I receive an error:
for i in range('ONN','1AQ'):
TypeError: 'str' object cannot be interpreted as an integer
Thanks
Range returns a sequence of numbers. This will not work with strings.
When there are only the two values, you can simply iterate over them as tuple.
file_name = 'C:/Users/User/Documents/Research/STITCH/{}-human-STITCH/stitch_interactions.tsv'
df_list = []
for i in ('ONN','1AQ'):
df_list.append(pd.read_csv(file_name.format(i)))
df = pd.concat(df_list)
TRY f-string with list comprehension:
concat_df = pd.concat([pd.read_csv(
f'C:/Users/User/Documents/Research/STITCH/{i}-human-STITCH/stitch_interactions.tsv') for i in range('ONN', '1AQ')])
Edit: For the 'str' error, remove range in the for-loop.
If you want to write the whole dataframe to excel you can use df.to_excel
pandas.DataFrame.to_excel
You can also append using an ExcelWriter (see example when you scroll down)
If you want to write a specific row to excel you can use "iloc" if you know which row number or "loc" for row name/identifier
pandas.DataFrame.iloc
pandas.DataFrame.loc
Scroll down for examples on how to use the functions.

Creating and assigning different variables using a for loop

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

How to read in multiple files into pandas?

I have a folder that has hundreds or files which contain comma separated data, however, the files themselves have no file extensions (i.e., EPI or DXPX; NOT EPI.csv or DXPX.csv).
I am trying to create a loop that reads in only certain files that I need (between 15-20 files). I do not want to concat or append the dfs. I merely want to read each df into memory and be able to call the df by name.
Even though there is no extension, I can read the file in as .csv
YRD = pd.read_csv('YRD', low_memory=False)
My expected result from the loop below is two dfs: one labeled YRD and another labeled HOUSE. However, I only get one df named df_raw and it is only the final file in the list. Sorry if this is a silly question, but I cannot figure out what I am missing.
df_list = ['YRD','HOUSE']
for raw_df in df_list:
raw_df = pd.read_csv(raw_df, low_memory=False)
This is because you reassign the value raw_df every time you encounter a new file...
You should create new variables, not reuse the old ones:
mydfs=[]
for raw_df in df_list:
mydfs.append( pd.read_csv(raw_df, low_memory=False))
or you can put them into a dictionnary:
mydfs={}
for raw_df in df_list:
mydfs[raw_df]= pd.read_csv(raw_df, low_memory=False)

Categories