Check csv columns before adding to df?

Check csv columns before adding to df? - python

I want to import csv files to Dataframe, I use pd.read_csv.
But I have many csv files to import which have not exactly the same columns, but still a few in common.
I can not change the csv files has they come from different sources but are mixed when I get them, and with the name i can not filter them. Also, I can not import it all and then filter the DataFrame because some columns are in common.
Is ther a way to check the number of columns or if a certain column is in the csv fil before adding it to the Dataframe ?
something like:
read_csv(source) if 'XXXX' is in CSV
thank you !

If answer is useful to anyone:
As I was using list comprehension I added the if statement:
files = glob.glob(path + "/*.csv")
df = pd.concat([pd.read_csv(f) for f in files if all(c in list(pd.read_csv(f, nrows=1))
for c in colonnes_data) ], keys=files, axis=0)

Related

Why is pandas adding new columns to my new excel file

I am trying to concatenate two excel files with the same column names together, but there seems to be a problem as there are new empty columns/spaces being added to my new excel file, and i don't know why.
I used pd.concat() function which was supposed to concat the two files into one single sheet and make a new file, but when it adds the table in the second file to the first file, new columns/spaces are added to the new merged file.
file_list = glob.glob(path + "/*.xlsx")
dfs = pd.DataFrame()
dfs = [pd.read_excel(p,) for p in file_list]
print(dfs[0].shape)
res = pd.concat(dfs)
That is a snippet of my code
I also added a picture of what the result i am getting now looks like

Concat respects the column names, so is not like a plain vector concatenate, try to check if the column names are the same among all your source files. If no, you can normalize them, rename them or move to a vector base format like numpy arrays.

How to read in excel files from a folder and join them into a single df?

First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!

In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.

Saving each DataFrame column to separate CSV files

I have some dataframes, one of them is the following:
L_M_P = pd.read_csv('L_M_P.csv') # dimensions 17520x33
I would like to be able to save each column as an independent csv file, without having to do it manually as follows:
L_M_P[:,0].to_csv('column0.csv')
L_M_P[:,1].to_csv('column1.csv')
...
In that case, I would have 33 new '.csv' files, each with dimensions 17520x1.

You can iterate through columns and write it to files.
for column in df.columns:
df[column].to_csv(column + '.csv')
Note: Assuming language to be python as the question has pd mentioned in it and all mentioned code is part of pandas

Extracting a column from a collection of csv files and constructing a new table with said data

I'm a newbie when it comes to Python with a bit more experience in MATLAB. I'm currently trying to write a script that basically loops through a folder to pick up all the .csv files, extract column 14 from csv file 1 and adding it to column 1 of the new table, extract column 14 from csv file 2 and adding it to column 2 of the new table, to build up a table of column 14 from all csvfiles in the folder. I'd ideally like to have the headers of the new table to show the respective filename that said column 14 has been extracted from.
I've considered that Python is base0 so I've double checked that it reads the desired column, but as my code stands, i can only get it to print all the files' 14th columns in the one array and I'm not sure how to split it up to put it into a table. Perhaps via dataframe, although I'm not entirely sure how they work.
Any help would be greatly appreciated!
Code attached below:
import os
import sys
import csv
pathName = "D:/GLaDOS-CAMPUS/data/TestData-AB/"
numFiles = []
fileNames = os.listdir(pathName)
for fileNames in fileNames:
if fileNames.endswith(".csv"):
numFiles.append(fileNames)
print(numFiles)
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for column in reader:
print(column[13])

Finding files.
I'm not sure if your way of finding files is right or not. Since I do not have a folder with csv files. But I can say it is way better to use glob for getting list of files:
from glob import glob
files = glob("/Path/To/Files/*.csv")
This will return all csv files.
Reading CSV files
Now we need to find a way to read all files and get 13th column. I don't know if it is an overkill but I prefer to use pandas and numpy to get 13th column.
To read a column of a csv file using pandas one can use:
pd.read_csv(file, usecols=[COL])
Now we can loop over files and get 13th columns:
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
Notice we converted all values to numpy arrays.
Merging all columns
In columns we have our each column as an element of a list. So it is technical rows. Not columns.
Now we should get the transpose of the array so it will become columns:
pd.DataFrame(np.transpose(columns))
The code
The whole code would look like:
from glob import glob
import pandas as pd
import numpy as np
files = glob("/Path/To/Files/*.csv")
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
print(pd.DataFrame(np.transpose(columns)))

Combining multiple .csv files using pandas and keeping the original structure

I have around 60 .csv files which i would like to combine in pandas. So far i've used this:
import pandas as pd
import glob
total_files = glob.glob("something*.csv")
data = []
for csv in total_files:
list = pd.read_csv(csv, encoding="utf-8", sep='delimiter', engine='python')
data.append(list)
biggerlist = pd.concat(data, ignore_index=True)
biggerlist.to_csv("output.csv")
This works somewhat, only the files I would like to combine all have the same structure of 15 columns with the same headers. When I use this code, only one column is filled with info of the entire row, and every column name is add-up of all column names (e.g. SEARCH_ROW, DATE, TEXT, etc.).
How can I combine these csv files, while keeping the same structure of the original files?
Edit:
So perhaps I should be a bit more specific regarding my data. This is a snapshot of one of the .csv files i'm using:
As you can see it is just newspaper-data, where the last column is 'TEXT', which isn't shown completely when you open the file.
This is a part of how it looks when i have combined the data using my code.
Apart, i can read any of these .csv files no problem using
data = pd.read_csv("something.csv",encoding="utf-8", sep='delimiter', engine='python')

I solved it!
The problem was the amount of comma's in the text part of my .csv files. So after removing all comma's (just using search/replace), I used:
import pandas
import glob
filenames = glob.glob("something*.csv")
df = pandas.DataFrame()
for filename in filenames:
df = df.append(pandas.read_csv(filename, encoding="utf-8", sep=";"))
Thanks for all the help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check csv columns before adding to df? - python

If answer is useful to anyone: As I was using list comprehension I added the if statement: files = glob.glob(path + "/*.csv") df = pd.concat([pd.read_csv(f) for f in files if all(c in list(pd.read_csv(f, nrows=1)) for c in colonnes_data) ], keys=files, axis=0)

Related

Why is pandas adding new columns to my new excel file

How to read in excel files from a folder and join them into a single df?

Saving each DataFrame column to separate CSV files

Extracting a column from a collection of csv files and constructing a new table with said data

Combining multiple .csv files using pandas and keeping the original structure

Categories

Resources