Python how to add a column to a text file - python

I'm new to python and I have a challenge. I need to add a column in a text file delimited by ";". So far so good ... except that the value of this column depends on the value of another column. I will leave an example in case I was not clear
My file looks like this:
Account;points
1;500
2;600
3;1500
If the value of the points column is greater than 1000, enter 2, if less, enter 1.
In this case the file would look like this:
Account;points;column_created
1;500;1
2;600;1
3;1500;2

Approach without using pandas, this code assumes your points column will always be at the second position.
with open('stats.txt', 'r+') as file:
lines = file.readlines()
file.seek(0,0)
for line in lines:
columns = line.strip().split(";")
if int(columns[1])>1000:
file.write(";".join(columns)+";2\n")
else:
file.write(";".join(columns) + ";1\n")

File (hard drive) can't add new item between new elements. You have to read all data to memory, add new column, and write all back to file.
You could use pandas to easily add new column based on value from other colum.
In example I use io.StringIO() only to create minimal working code so everyone can copy it and text. Use read_csv('input.csv', sep=';') with your file
import pandas as pd
import io
text = '''Account;points
1;500
2;600
3;1500'''
#df = pd.read_csv('input.csv', sep=';')
df = pd.read_csv(io.StringIO(text), sep=';')
print('--- before ---')
print(df)
df['column_created'] = df['points'].apply(lambda x: 2 if x > 1000 else 1)
print('--- after ---')
print(df) #
df.to_csv('output.csv', sep=';', index=False)
Result
--- before ---
Account points
0 1 500
1 2 600
2 3 1500
--- after ---
Account points column_created
0 1 500 1
1 2 600 1
2 3 1500 2

You can use python library to create csv files. Here is the link to the documentation.
https://docs.python.org/3/library/csv.html

Related

Convert .txt file to .csv with specific columns PYTHON

I have some text file that I want to load into my python code, but the format of the txt file is not suitable.
Here is what it contains:
SEQ MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNY
SS3 CCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
95024445656543114678678999999999999999888889998886
SS8 CCHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
96134445555554311253378999999999999999999999999987
SA EEEbBBBBBBBBBBbEbEEEeeEeBeEbBEEbbEeBeEbbeebBbBbBbb
41012123422000000103006262214011342311110000030001
TA bhHHHHHHHHHHHHHgIihiHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
00789889988663201010099999999999999999898999998741
CD NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
54433221111112221122124212411342243234323333333333
I want to convert it into panda Dataframe to have SEQ SS4 SA TA CD SS8 as columns of the DataFrame and the line next to them as the rows.
Like this:
I tried pd.read_csv but it doesn't give me the result I want.
Thank you !
To read a text file using pandas.read_csv() method, the text file should contain data separated with comma.
SEQ, SS3, ....
MSSSSWLLLSLVAVTAAQSTIEEQ..., CCCHHHHHHHHHHHHCCCCCCHHHHHHH.....
Steps
Use pd.read_fwf() to read files in a fixed-width format.
Fill the missing values with the last available value by df.ffill().
Assign group number gp for the row number in the output using a groupby-cumcount construct.
Move gp=(0,1) to columns by df.pivot, and then transpose again into the desired output.
Note: this solution works with arbitrary (includes zero, and of course not too many) consecutive lines with omitted values in the first column.
Code
# data (3 characters for the second column only)
file_path = "/mnt/ramdisk/input.txt"
df = pd.read_fwf(file_path, names=["col", "val"])
# fill the blank values
df["col"].ffill(inplace=True)
# get correct row location
df["gp"] = df.groupby("col").cumcount()
# pivot group (0,1) to columns and then transpose.
df_ans = df.pivot(index="col", columns="gp", values="val").transpose()
Result
print(df_ans) # show the first 3 characters only
col CD SA SEQ SS3 SS8 TA
gp
0 NNN EEE MSS CCC CCH bhH
1 544 410 NaN 950 961 007
Then you can save the resulting DataFrame using df_ans.to_csv().
You can use this script to load the .txt file to DataFrame and save it as csv file:
import pandas as pd
data = {}
with open('<your file.txt>', 'r') as f_in:
for line in f_in:
line = line.split()
if len(line) == 2:
data[line[0]] = [line[1]]
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Saves this CSV:

Pandas: Remove rows from the dataframe that begin with a letter and save CSV

Here is a sample CSV I'm working with
Here is my code:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
#(2) Filter every row where the first letter is 's' from search term
df = df[df['productOMS'].str.contains('^[a-z]+')]
#REGEX to filter anything that would ^ (start with) a letter
inputFile = inputFile
deleteSearchTerm(inputFile)
What I want to do:
Anything in the column ProductOMS that begins with a letter would be a row that I don't want. So I'm trying to delete them based on a condition and I was also trying would regular expressions just so I'd get a little bit more comfortable with them.
I tried to do that with:
df = df[df['productOMS'].str.contains('^[a-z]+')]
where if any of the rows starts with any lower case letter I would drop it (I think)
Please let me know if I need to add anything to my post!
Edit:
Here is a link to a copy of the file I'm working with.
https://drive.google.com/file/d/1Dsw2Ana3WVIheNT43Ad4Dv6C8AIbvAlJ/view?usp=sharing
Another Edit: Here is the dataframe I'm working with
productNum,ProductOMS,productPrice
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
New Edit:
Here's some updated code using an answer
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
inputFile = inputFile
deleteSearchTerm(inputFile)
When I run this code and print out the dataframes this gets rid of the rows that start with 'search'. However my CSV file is not updating
The issue here is that you're most likely dealing with mixed data types.
if you just want numeric values you can use pd.to_numeric
df = pd.DataFrame({'A' : [0,1,2,3,'a12351','123a6']})
df[~pd.to_numeric(df['A'],errors='coerce').isnull()]
A
0 0
1 1
2 2
3 3
but if you only want to test the first letter then :
df[~df['A'].astype(str).str.contains('^[a-z]')==True]
A
0 0
1 1
2 2
3 3
5 123a6
Edit, it seems the first solution works, but you need to write this back to your csv?
you need to use the to_csv method, i'd recommend you read 10 minutes to pandas here
As for your function, lets edit it a little to take a source csv file and throw out an edited version, it will save the file to the same location with _edited added on. feel free to edit/change.
from pathlib import Path
def delete_search_term(input_file, column):
"""
Takes in a file and removes any strings from a given column
input_file : path to your file.
column : column with strings that you want to remove.
"""
file_path = Path(input_file)
if not file_path.is_file():
raise Exception('This file path is not valid')
df = pd.read_csv(input_file)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df[column],errors='coerce').isnull()]
print(f"Creating file as:\n{file_path.parent.joinpath(f'{file_path.stem}_edited.csv')}")
return df.to_csv(file_path.parent.joinpath(f"{file_path.stem}_edited.csv"),index=False)
Solution:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df.to_csv(inputFile)
inputFile = filePath
inputFile = deleteSearchTerm(inputFile)
Data from the source csv as shared at the google drive location:
'''
productNum,ProductOMS,productPrice,Unnamed: 3
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
'''
import pandas as pd
df = pd.read_clipboard()
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
4 2463434 search?searchTerm=2463434 NaN NaN
5 2463423 1002622932 469.95 NaN
.
df1 = df.loc[df['ProductOMS'].str.isdigit()]
print(df1)
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
5 2463423 1002622932 469.95 NaN
I hope it helps you:
df = pd.read_csv(filename)
df = df[~df['ProductOMS'].str.contains('^[a-z]+')]
df.to_csv(filename)
For the most part your function is fine but you seem to have forgotten to save the CSV, which is done by df.to_csv() method.
Let me rewrite the code for you:
import pandas as pd
def processAndSaveCSV(filename):
# Read the CSV file
df = pd.read_csv(filename)
# Retain only the rows with `ProductOMS` being numeric
df = df[df['ProductOMS'].str.contains('^\d+')]
# Save CSV File - Rewrites file
df.to_csv(filename)
Hope this helps :)
It looks like a scope problem to me.
First we need to return df:
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df
Then replace the line
DeleteSearchTerm(InputFile)
with:
InputFile = DeleteSearchTerm(InputFile)
Basically your function is not returning anything.
After you fix that you just need to redefine your inputFile variable to the new dataframe your function is returning.
If you already defined df earlier in your code and you're trying to manipulate it, then the function is not actually changing your existing global df variable. Instead it's making a new local variable under the same name.
To fix this we first return the local df and then re-assign the global df to the local one.
You should be able to find more information about variable scope at this link:
https://www.geeksforgeeks.org/global-local-variables-python/
It also appears you never actually update your original file.
Try adding this to the end of your code:
df.to_csv('CSV file name', index=True)
Index just says whether you want to have a line index.

Python script that efficiently drops columns from a CSV file

I have a csv file, where the columns are separated by tab delimiter but the number of columns is not constant. I need to read the file up to the 5th column. (I dont want to ready the whole file and then extract the columns, I would like to read for example line by line and skip the remaining columns)
You can use usecols argument in pd.read_csv to limit the number of columns to be read.
# test data
s = '''a,b,c
1,2,3'''
with open('a.txt', 'w') as f:
print(s, file=f)
df1 = pd.read_csv("a.txt", usecols=range(1))
df2 = pd.read_csv("a.txt", usecols=range(2))
print(df1)
print()
print(df2)
# output
# a
#0 1
#
# a b
#0 1 2
You can use pandas nrows to read only a certain number of csv lines:
import pandas as pd
df = pd.read_csv('out122.txt', usecols=[0,1,2,3,4])

Having trouble removing headers when using pd.read_csv

I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius
Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254
I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600
I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

How can I combine two csv files into one, by adding one column to the end of the first one?

I simply need to add the column of the second CSV file to the first CSV file.
Example CSV file #1
Time Press RH Dewpt Alt
Value Value Value Value Value
For N number of rows.
Example CSV file #2
SmoothedTemperature
Value
I simply want to make it
Time Press RH Dewpt Alt SmoothedTemperature
Value Value Value Value Value Value
Also one has headers the other does not.
Here is sample code of what I have so far, however the output is the final row of file 1 repeated with the full data set of File #2 next to it.
##specifying what they want to open
File = open(askopenfilename(), 'r')
##reading in the other file
Averaged = open('Moving_Average_Adjustment.csv','r')
##opening the new file via raw input to write to
filename = raw_input("Enter desired filename, EX: YYYYMMDD_SoundingNumber_Time.csv; must end in csv")
New_File = open(filename,'wb')
R = csv.reader(File, delimiter = ',')
## i feel the issue is here in my loop, i don't know how to print the first columns
## then also print the last column from the other CSV file on the end to make it mesh well
Write_New_File = csv.writer(New_File)
data = ["Time,Press,Dewpt,RH,Alt,AveragedTemp"]
Write_New_File.writerow(data)
for i, line in enumerate(R):
if i <=(header_count + MovingAvg/2):
continue
Time,Press,Temp,Dewpt,RH,Ucmp,Vcmp,spd,Dir,Wcmp,Lon,Lat,Ele,Azi,Alt,Qp,Qt,Qrh,Qu,Qv,QdZ=line
for i, line1 in enumerate(Averaged):
if i == 1:
continue
SmoothedTemperature = line1
Calculated_Data = [Time,Press,Dewpt,RH,Alt,SmoothedTemperature]
Write_New_File.writerow(Calculated_Data)
If you want to go down this path, pandas makes csv manipulation very easy. Say your first two sample tables are in files named test1.csv and test2.csv:
>>> import pandas as pd
>>> test1 = pd.read_csv("test1.csv")
>>> test2 = pd.read_csv("test2.csv")
>>> test3 = pd.concat([test1, test2], axis=1)
>>> test3
Time Press RH Dewpt Alt SmoothedTemperature
0 1 2 3 4 5 6
[1 rows x 6 columns]
This new table can be saved to a .csv file with the DataFrame method to_csv.
If, as you mention, one of the files has no headers, you can specify this when reading the file:
>>> test2 = pd.read_csv('test2.csv', header=None)
and then change the header row manually in pandas.

Categories