Merge CSV columns with irregular timestamps and different header names per file

Merge CSV columns with irregular timestamps and different header names per file - python

I have long CSV files with different headers in every file.
The first column is always a timestamp which is irregular with its timings, so it rarely matches.
file1.csv
time,L_pitch,L_roll,L_yaw
2020-08-21T09:58:07.570,-0.0,-6.1,0.0
2020-08-21T09:58:07.581,-0.0,-6.1,0.0
2020-08-21T09:58:07.591,-0.0,-6.1,0.0
....
file2.csv
time,R_pitch,R_roll,R_yaw
2020-08-21T09:58:07.591,1.3,-5.7,360.0
2020-08-21T09:58:07.607,1.3,-5.7,360.0
2020-08-21T09:58:07.617,1.3,-5.7,360.0
....
file3.csv
time,L_accel_lat,L_accel_long,L_accel_vert
2020-08-21T09:58:07.420,-0.00,-0.00,0.03
2020-08-21T09:58:07.430,-0.00,0.00,0.03
2020-08-21T09:58:07.440,-0.00,0.00,0.03
....
At the moment there can be up to 6 CSV files in that format in a folder.
I would like to merge these CSV into one file where all columns are recognized and sorted according to the timestamps. When timestamps are matching, data gets merged into its corresponding line. If time is not matched, it gets a separate line with empty fields.
The result should look like this.
time,L_pitch,L_roll,L_yaw,R_pitch,R_roll,R_yaw,L_accel_lat,L_accel_long,L_accel_vert
2020-08-21T09:58:07.420,,,,,,,-0.00,-0.00,0.03
2020-08-21T09:58:07.430,,,,,,,-0.00,0.00,0.03
2020-08-21T09:58:07.440,,,,,,,-0.00,0.00,0.03
....
2020-08-21T09:58:07.581,-0.0,-6.1,0.0,,,,,,
2020-08-21T09:58:07.591,-0.0,-6.1,0.0,1.3,-5.7,360.0,,,
Last line would be an example of a matching timecode and with this also datamerging into one line
So far I tried this Github Link, but this merges with filenames into the CSV and no sorting.
Panda in Python seems to be up to the task, but my skills are not. I also tried some python files from GitHub...
This one seemed the most promising with changing the user, but it runs with no end (files to big?).
Is this possible to do this in a PowerShell ps1 or a somewhat (for me) "easy" python script?
I would build this into a batch file to work in several folders.
Thanks in advance
goam

As you mentioned, you can solve your problems rather conveniently using pandas.
import pandas as pd
import glob
tmp=[]
for f in glob.glob("file*"):
print(f)
tmp.append(pd.read_csv(f, index_col=0, parse_dates=True))
pd.concat(tmp,axis=1,sort=True).to_csv('merged')
Some explanation:
Here, we use glob to get the list of files using the wildcard pattern file*. We loop over this list and read each file using pandas read_csv. Note, we parse the dates of the file (converts to dtype datetime64[ns]) and use the date column as an index of the dataframe. We store the dataframes in a list called tmp. Finally we concatinate the individual dataframes (of the individual files) in tmp using concat and immediately write it to a file called merged.csv using pandas to_csv.

Related

How to merge 3 columns into a new column and add resulting column to existing CSV file in Python (without using Panda)

Assume we have a file called 'teams.csv'. We want to do the operation below to all the rows in the file 'teams.csv' and return a file with the same name but now with only 3 columns instead of 5. And we also need to name our new column 'sport'. In the file '***' indicate that a person does not play that particular sport.
I have a CSV with the following columns:
And want the CSV file with only 3 cols as shown below

You could use something like this answer to create a list of objects based on the contents of the CSV file, manipulate the data as necessary and then write back to the CSV file.
Sharing the code you have already tried would also be a good idea ;-)

How to merge multiple csv files and copy the data into an existing txt file

I have multiple folders and each folder contain 4 csv files with each file containing one column of data. For each folder, I want to merge these files together in a way that new dataframe will carry 4 columns from those csv files.
Next I want to copy these 4 columns into an existing txt file that already has 3 columns. So the 4 columns (from csv files) will be placed next to existing columns. This operation will be done for multiple folders. I will greatly appreciate some help.

You can use pandas for it.
You can use this link: https://stackoverflow.com/a/21232849/4561068 to make a list of ALL the csv filename and make a dataframe out of each and append of a list of dataframes, and then at the end of iterating through all of them, you can concatenate all the dataframe in that list together as shown in the link.
Afterward, you can simply write the dataframe to a txt file.
Hope that helps!!

Can you use pandas/python to concatenate a folder of .xlsx files based on row 2?

I'm having trouble using pandas to concatenate a very large folder of .xlsx files. The issue is we have some text written in the first row of each document that can't be removed.
My path to the folder is set and the concatenate works. The issue is after the first file, it's removing the ID #'s in the first 2 columns when concatenating the rest of the files. So not only does the data not match going down each column, but I also have lost my unique identifiers. My best guess is this is due to the 1st row of text in each document.
This is what I have so far.
files = [f for f in os.listdir(path) if f.endswith('.xlsx')]
iep_boy_df = pd.concat([pd.read_excel(os.path.join(path, f), sheetname='Academic Outlier List', encoding='utf-8') for f in files],
keys=files, names=['File Name', 'Row']).reset_index()
I've seen some ways to parse files using Python, but can you parse 50+ excel documents to skip row 1 and then pass them into pandas to concat into a DF? All in all I want row 1 to be excluded when concatenating.
Still an intermediate here with Python so any help would be greatly appreciated!

I'm not sure whether this will completely solve your import problems, but Pandas read_excel() has a skipped_rows parameter that you could pass to skip the first row. Note that its type is a zero-indexed list.
Reference: http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_excel.html

I would echo piRSQUARED's answer. pd.read_excel has skiprows but remember to pass skip rows as an iterable.

can I use Numpy genfromtxt to get names then skip a specifc row before continuing?

I have a series of large CSV files contain sensor data samples.
The format of each file is
Row 0 Column names
Row 1 Column unit of measurement.
Row 2 to EOF: time-stamp of sample was taken,voltage measurement from all sensors.....
I cannot modify the original file
So i would like to use numpy.gentromtxt(filename,Names=True,delimiter=',',dtype=None)
So far to avoid corrupting the output i have skipped the header lines and manually added the column names later.
This is not ideal as each file potentially a different order of sensors- and the information is there for the taking.
Any help/Direction would be greatly appreciated.

I can see several options:
Open the file to read just the header line, parse the names yourself; run genfromtxt with the custom names and skip_header
trick genfromtxt into treating the 2nd line as a comment line
open the file yourself; pass lines through a filter to genfromtxt. The filter function would remove the second line. (genfromtxt works with a list of lines, or anything that feeds it lines).
-

Portable, machine-readable way to store column names from several CSV files

Right now I have a folder with several comma-separated data files and I would like to extract their column names to store in some kind of index for later reference. This data will be used by multiple people on both Mac and Windows machines (so newlines could pose a problem), in both R and Python.
Ideally, I'd like to write or use a script that takes a regex as an argument and returns a list of file names that contain that column name. E.g. I could write, say, cl col 'Years at' and return all of the files with a column containing the text Years at, or cl file 'Academic Data' and return all of the column names in that file.
I only have a few files and only a few columns in each, but I would like to be able to scale this up to situations where I have a large number of files, and/or where each file has a large number of columns.
Is there a "best practice" in this situation? Is there a "right way" to store this data? I'm thinking about JSON, but the only way I can think of getting it into JSON format would be by manually echoing all the braces and newlines, which would be ugly. And I'd have no idea how to get the data back out.
This is my current solution:
find . -iname "*.csv" | while read f; do
echo -e "$f\n$(tr "\r" "\n" < "$f" | head -n1)\n" >> column_index.txt
done
which produces:
./File 1.csv
column 1, column 2, column 3
./File 2.csv
column 1, column 2, column 3
There are two problems with it: 1) it's in bash, so a Windows user can't use it without Cygwin 2) the output is readable but hard to parse safely. Problem 2 is the point of the question. But I'll be happy to hear suggestions that also tackle Problem 1 somehow.

You can use Python.Pandas to manipulate CSV files.
df= pd.read_csv(name)
print df.columns # will print all the columns,
Check this tutorial for more details
I would suggest:
Loop in all the csv files, store columns and their original files.
Store these informational on another csv files (or json).
Write a python script that does the research inside this csv, so any win/mac user can use it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.