Add a third column to existing CSV - python

I have a CSV file with 2 Columns (x,y) and 5653 rows formated like this
0,0
1,0
2,0
3,0
4,0
5,0
....
102,0
102,1
101,1
....
0,1
0,2
1,2
....
Now I want to add a third column to it out of another csv with meassured values eg -89 etc those are mean values.
these are also 5653 rows and its the first column of that file?
Now how can I read the first file read the second file and put it like this
0,0,-89
1,0,-89
2,0,-89
3,0,-89
4,0,-90
5,0,-90
6,0,-89
7,0,-89
8,0,-89
9,0,-89
So I want the values to be the same just in one CSV

You could use the library pandas which is build to work with tabular data.
typical workflow:
import pandas as pd
df1 = pd.read_csv("your_path") # df is a shorthand for dataframe, a name for tabular data.
df2 = pd.read_csv("csv2")
# concanating: http://pandas.pydata.org/pandas-docs/stable/merging.html
result = pd.concat([df1, df2], axis=1) # join by row, not by column
result.to_csv("path")

You can use the csv module which unlike pandas does not require you to install any third-party libraries. You can just zip the two readers:
import csv
with open('in1.csv') as fin1:
with open('in2.csv') as fin2:
with open('out.csv') as fout:
r1 = csv.reader(fin1) # iterator over lists of strings
r2 = csv.reader(fin2)
w = csv.reader(fout)
for row1, row2 in zip(r1, r2):
w.writerow(row1 + row2[:1]) # row from 1 + first column from 2

Related

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data

Convert .txt file to .csv with specific columns PYTHON

I have some text file that I want to load into my python code, but the format of the txt file is not suitable.
Here is what it contains:
SEQ MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNY
SS3 CCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
95024445656543114678678999999999999999888889998886
SS8 CCHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
96134445555554311253378999999999999999999999999987
SA EEEbBBBBBBBBBBbEbEEEeeEeBeEbBEEbbEeBeEbbeebBbBbBbb
41012123422000000103006262214011342311110000030001
TA bhHHHHHHHHHHHHHgIihiHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
00789889988663201010099999999999999999898999998741
CD NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
54433221111112221122124212411342243234323333333333
I want to convert it into panda Dataframe to have SEQ SS4 SA TA CD SS8 as columns of the DataFrame and the line next to them as the rows.
Like this:
I tried pd.read_csv but it doesn't give me the result I want.
Thank you !
To read a text file using pandas.read_csv() method, the text file should contain data separated with comma.
SEQ, SS3, ....
MSSSSWLLLSLVAVTAAQSTIEEQ..., CCCHHHHHHHHHHHHCCCCCCHHHHHHH.....
Steps
Use pd.read_fwf() to read files in a fixed-width format.
Fill the missing values with the last available value by df.ffill().
Assign group number gp for the row number in the output using a groupby-cumcount construct.
Move gp=(0,1) to columns by df.pivot, and then transpose again into the desired output.
Note: this solution works with arbitrary (includes zero, and of course not too many) consecutive lines with omitted values in the first column.
Code
# data (3 characters for the second column only)
file_path = "/mnt/ramdisk/input.txt"
df = pd.read_fwf(file_path, names=["col", "val"])
# fill the blank values
df["col"].ffill(inplace=True)
# get correct row location
df["gp"] = df.groupby("col").cumcount()
# pivot group (0,1) to columns and then transpose.
df_ans = df.pivot(index="col", columns="gp", values="val").transpose()
Result
print(df_ans) # show the first 3 characters only
col CD SA SEQ SS3 SS8 TA
gp
0 NNN EEE MSS CCC CCH bhH
1 544 410 NaN 950 961 007
Then you can save the resulting DataFrame using df_ans.to_csv().
You can use this script to load the .txt file to DataFrame and save it as csv file:
import pandas as pd
data = {}
with open('<your file.txt>', 'r') as f_in:
for line in f_in:
line = line.split()
if len(line) == 2:
data[line[0]] = [line[1]]
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Saves this CSV:

How to read an array in a dataframe?

I have a tsv file containing an array which has been read using read_csv().
The dtype of the array is shown as dtype: object. How do I read it and access it as an array?
For example:
df=
id values
1 [0,1,0,3,5]
2 [0,0,2,3,4]
3 [1,1,0,2,3]
4 [2,4,0,3,5]
5 [3,5,0,3,5]
Currently I am unpacking it as below:
for index,row in df.iterrows():
string = row['col2']
string=string.replace('[',"")
string=string.replace(']',"")
v1,v2,v3,v4,v5=string.split(",")
v1=int(v1)
v2=int(v2)
v3=int(v3)
v4=int(v4)
v5=int(v5)
Is there any alternative to this?
I want to do this because I want to create another column in the dataframe taking the average of all the values.
Adding additional details:col2
My tsv file looks as below:
id values
1 [0,1,0,3,5]
2 [0,0,2,3,4]
3 [1,1,0,2,3]
4 [2,4,0,3,5]
5 [3,5,0,3,5]
I am reading the tsv file as follows:
df=pd.read_csv('tsv_file_name.tsv',sep='\t', header=0)
You can use json to simplify your parsing:
import json
df['col2'] = df.col2.apply(lambda t: json.loads(t))
edit: following your comment, getting the average is easy:
# using numpy
df['col2_mean'] df.col2.apply(lambda t: np.array(t).mean())
# by hand
df['col2_mean'] df.col2.apply(lambda t: sum(t)/len(t))
import csv
with open('myfile.tsv) as tsvfile:
line = csv.reader(tsvfile, delimiter='...')
...
OR
from pandas import DataFrame
df = DataFrame.from_csv("myfile.tsv", sep="...")

Merging csv columns while checking ID of the first column

I have a 4 csv files exported from e-shop database I need to merge them by columns, which I would maybe manage to do alone. But the problem is to match the right columns
First file:
"ep_ID","ep_titleCS","ep_titlePL".....
"601","Kancelářská židle šedá",NULL.....
...
Second file:
"pe_photoID","pe_productID","pe_sort"
"459","603","1"
...
Third file:
"epc_productID","epc_categoryID","epc_root"
"2155","72","1"
...
Fourth file:
"ph_ID","ph_titleCS"...
"379","5391132275.jpg"
...
I need to match the rows so rows with same "ep_ID" and "epc_productID" are merged together and rows with same "ph_ID", "pe_photoID" too. I don't really know where to start, hopefully, I wrote it understandably
Update:
I am using :
files = ['produkty.csv', 'prirazenifotek.csv', 'pprirazenikategorii.csv', 'adresyfotek.csv']
dfs = []
for f in files:
df = pd.read_csv(f,low_memory=False)
dfs.append(df)
first_and_third =pd.merge(dfs[0],dfs[1],left_on = "ep_ID",right_on="pe_photoID")
first_and_third.to_csv('new_filepath.csv', index=False)
Ok this code works, but it does two things in another way than I need:
When there is a row in file one with ID = 1 for example and in the next file two there is 5 rows with bID = 1, then it creates 5 rows int the final file I would like to have one row that would have multiple values from every row with bID = 1 in file number two. Is it possible?
And it seems to be deleting some rows... not sure till i get rid of the "duplicates"...
You can use pandas's merge method to merge the csvs together. In your question you only provide keys between the 1st and 3rd files, and the 2nd and 4th files. Not sure if you want one giant table that has them all together -- if so you will need to find another intermediary key, maybe one you haven't listed(?).
import pandas as pd
files = ['path_to_first_file.csv', 'second_file.csv', 'third_file.csv', 'fourth_file.csv']
dfs = []
for f in files:
df = pd.read_csv(f)
dfs.append(df)
first_and_third = dfs[0].merge(dfs[2], left_on='ep_ID', right_on='epc_productID', how='left')
second_and_fourth = dfs[1].merge(dfs[3], left_on='pe_photoID', right_on='ph_ID', how='left')
If you want to save the dataframe back down to a file you can do so:
first_and_third.to_csv('new_filepath.csv', index=False)
index=False assumes you have no index set on the dataframe and don't want the dataframe's row numbers to be included in the final csv.

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Categories