How to mass delete unwanted info in column in a CSV file?

How to mass delete unwanted info in column in a CSV file? - python

So I recently concatenated multiple csv files into one. Since the filenames were dates, I also included "filename" as a column for reference. However, the filename has info that I would not like to include such as the time and file extension. As a beginner, I'm only familiar with importing and printing the file to view. What code is recommended to mass remove the info after the date?
answer filename
7 2018-04-12 21_01_01.csv
7 2018-04-18 18_36_30.csv
7 2018-04-18 21_01_32.csv
8 2018-04-20 15_21_02.csv
7 2018-04-20 21_00_44.csv
7 2018-04-22 21_01_05.csv

It could be done with regular python, not that difficult, but an very easy way with pandas would be:
import pandas as pd
df = pd.read_csv(<your name of the csv here>, sep='\s\s+', engine='python')
df['filename'] = df['filename'].str.rstrip('.csv')
print(df)

When working with tabular data in python I highly recommend using the pandas package.
import pandas as pd
df = pd.read_csv("../test_data.csv")
def rem_part(string):
return string.split(' ')[0] # could also split on '.' if you want to keep the time
df['date'] = df['filename'].apply(rem_part)
df.drop('filename', axis = 1, inplace=True) # remove the filename column if you so please
df.to_csv("output.csv"). # save the file as a new CSV or overwrite the old
The test_data.csv file contains the following:
answer,filename
7,2018-04-12 21_01_01.csv
7,2018-04-18 18_36_30.csv
7,2018-04-18 21_01_32.csv
8,2018-04-20 15_21_02.csv
7,2018-04-20 21_00_44.csv
7,2018-04-22 21_01_05.csv

Related

Converting date and time format when importing csv file in Python

I haven't been able to find a solution in similar questions yet so I'll have to give it a go here.
I am importing a csv file looking like this in notepad:
",""ItemName"""
"Time,""Raw Values"""
"7/19/2019 10:31:29 PM,"" 0"","
"7/19/2019 10:32:01 PM,"" 1"","
What I want when I save it as a new csv, is to reformat the date/time and the corresponding value to this (required by analysis software): The semicolon as separator and in the end is important, and I don't really need a header.
2019-07-19 22:31:29;0;
2019-07-19 22:32:01;1;
This is what it looks like in Python:
Item1 = pd.read_csv(r'.\Datafiles\ItemName.csv')
Item1
#Output:
# ,"ItemName"
# 0 Time,"Raw Values"
# 1 7/19/2019 10:31:29 AM," 0",
# 2 7/19/2019 10:32:01 AM," 1",
valve_G1.dtypes
# ,"ItemName" object
# dtype: object
I have tried using datetime without any luck but there might be something fishy with the datatypes that I am not aware of.

What you want in principle is read to DataFrame, convert datetime column and export df to csv again. I think you will need to get rid of the quote-chars to get the import correct. You can do so by reading the file content to a string, replace the '"', and feed that string to pandas.read_csv. EX:
import os
from io import StringIO
import pandas as pd
# this is just to give an example:
s='''",""ItemName"""
"Time,""Raw Values"""
"7/19/2019 10:31:29 PM,"" 0"","
"7/19/2019 10:32:01 PM,"" 1"","'''
f = StringIO(s)
# in your script, make f a file pointer instead, e.g.
# with open('path_to_input.csv', 'r') as f:
# now get rid of the "
csvcontent = ''
for row in f:
csvcontent += row.replace('"', '')
# read to DataFrame
df = pd.read_csv(StringIO(csvcontent), sep=',', skiprows=1, index_col=False)
df['Time'] = pd.to_datetime(df['Time'])
# save cleaned output as ;-separated csv
dst = 'path_where_to_save.csv'
df.to_csv(dst, index=False, sep=';', line_terminator=';'+os.linesep)

Adding the year to dates in date column

I am doing some data cleaning and I have a csv with a date column containing “month day”, for example: April 12. I want to add the year 2020 to each date in that column, so that I have: April 12 2020.
I’ve tried using pandas and datetime, but I feel like I am clearly missing an easy answer.
Thanks!
edit:
I should have said this before, I have already imported the csv and I want to add the year after the fact. Furthermore, I have already told pandas that the ‘onset’ column contains dates.
edit 2:
Thanks to: You can try df['onset'] = df['onset'].apply(lambda dt: dt.replace(year=2020)) in that case – MrNobody33 13
That worked! Thanks for all the help,I’ll try to make my future posts more clear in the future and add my data when asking a question. I knew there had to be a simple answer!

try this...
df['onset'] = df['onset'].astype(str) +'2020'

If you are trying to edit the csv itself, you can try something like this:
import pandas as pd
path = 'test.csv' #path of the .csv
df = pd.read_csv(path) #reads the file
df['onset'] = df['onset'].astype(str) +' 2020' #Add the year
df.to_csv("test.csv", index=False) #modify the file
Or, if you want to edit the dataframe imported from that csv, you can try this:
import pandas as pd
path = 'test.csv'
path = 'pathofthecsv'
df = pd.read_csv(path) #reads the file
df['onset'] = df['onset'].astype(str) +' 2020' #Add the year

Pandas: Remove rows from the dataframe that begin with a letter and save CSV

Here is a sample CSV I'm working with
Here is my code:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
#(2) Filter every row where the first letter is 's' from search term
df = df[df['productOMS'].str.contains('^[a-z]+')]
#REGEX to filter anything that would ^ (start with) a letter
inputFile = inputFile
deleteSearchTerm(inputFile)
What I want to do:
Anything in the column ProductOMS that begins with a letter would be a row that I don't want. So I'm trying to delete them based on a condition and I was also trying would regular expressions just so I'd get a little bit more comfortable with them.
I tried to do that with:
df = df[df['productOMS'].str.contains('^[a-z]+')]
where if any of the rows starts with any lower case letter I would drop it (I think)
Please let me know if I need to add anything to my post!
Edit:
Here is a link to a copy of the file I'm working with.
https://drive.google.com/file/d/1Dsw2Ana3WVIheNT43Ad4Dv6C8AIbvAlJ/view?usp=sharing
Another Edit: Here is the dataframe I'm working with
productNum,ProductOMS,productPrice
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
New Edit:
Here's some updated code using an answer
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
inputFile = inputFile
deleteSearchTerm(inputFile)
When I run this code and print out the dataframes this gets rid of the rows that start with 'search'. However my CSV file is not updating

The issue here is that you're most likely dealing with mixed data types.
if you just want numeric values you can use pd.to_numeric
df = pd.DataFrame({'A' : [0,1,2,3,'a12351','123a6']})
df[~pd.to_numeric(df['A'],errors='coerce').isnull()]
A
0 0
1 1
2 2
3 3
but if you only want to test the first letter then :
df[~df['A'].astype(str).str.contains('^[a-z]')==True]
A
0 0
1 1
2 2
3 3
5 123a6
Edit, it seems the first solution works, but you need to write this back to your csv?
you need to use the to_csv method, i'd recommend you read 10 minutes to pandas here
As for your function, lets edit it a little to take a source csv file and throw out an edited version, it will save the file to the same location with _edited added on. feel free to edit/change.
from pathlib import Path
def delete_search_term(input_file, column):
"""
Takes in a file and removes any strings from a given column
input_file : path to your file.
column : column with strings that you want to remove.
"""
file_path = Path(input_file)
if not file_path.is_file():
raise Exception('This file path is not valid')
df = pd.read_csv(input_file)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df[column],errors='coerce').isnull()]
print(f"Creating file as:\n{file_path.parent.joinpath(f'{file_path.stem}_edited.csv')}")
return df.to_csv(file_path.parent.joinpath(f"{file_path.stem}_edited.csv"),index=False)
Solution:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df.to_csv(inputFile)
inputFile = filePath
inputFile = deleteSearchTerm(inputFile)

Data from the source csv as shared at the google drive location:
'''
productNum,ProductOMS,productPrice,Unnamed: 3
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
'''
import pandas as pd
df = pd.read_clipboard()
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
4 2463434 search?searchTerm=2463434 NaN NaN
5 2463423 1002622932 469.95 NaN
.
df1 = df.loc[df['ProductOMS'].str.isdigit()]
print(df1)
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
5 2463423 1002622932 469.95 NaN

I hope it helps you:
df = pd.read_csv(filename)
df = df[~df['ProductOMS'].str.contains('^[a-z]+')]
df.to_csv(filename)

For the most part your function is fine but you seem to have forgotten to save the CSV, which is done by df.to_csv() method.
Let me rewrite the code for you:
import pandas as pd
def processAndSaveCSV(filename):
# Read the CSV file
df = pd.read_csv(filename)
# Retain only the rows with `ProductOMS` being numeric
df = df[df['ProductOMS'].str.contains('^\d+')]
# Save CSV File - Rewrites file
df.to_csv(filename)
Hope this helps :)

It looks like a scope problem to me.
First we need to return df:
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df
Then replace the line
DeleteSearchTerm(InputFile)
with:
InputFile = DeleteSearchTerm(InputFile)
Basically your function is not returning anything.
After you fix that you just need to redefine your inputFile variable to the new dataframe your function is returning.
If you already defined df earlier in your code and you're trying to manipulate it, then the function is not actually changing your existing global df variable. Instead it's making a new local variable under the same name.
To fix this we first return the local df and then re-assign the global df to the local one.
You should be able to find more information about variable scope at this link:
https://www.geeksforgeeks.org/global-local-variables-python/
It also appears you never actually update your original file.
Try adding this to the end of your code:
df.to_csv('CSV file name', index=True)
Index just says whether you want to have a line index.

convert csv to pandas

I am struggling to read simple csv to pandas, actual problem is that it doesnt separate ",".
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv', header=0, delimiter ="\t")
print(df)
I have tried sep=',' and it does not separate..
Event," 2016-02-01"," 2016-02-02"," 2016-02-03"," 2016-02-04","
Contact joined,"5","7","18","20",
Launch first time,"30","62","86","110",
It should looks like 1 header with Dates and 2 rows:
2016-02-01 2016-02-02 etc
0 5 7
1 30 62
UPDATE: Yes, the problem was in cdv itself with unnecessary quotes and characters.

You seem to be using both delimiter= and sep=, which both do the same thing. If it is actually comma seperated, try:
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv')
print(df)
sep=',' is the default, so it's not necessary to explicitly state that. The same goes for header=0. delimiter= is just an alias for sep=.
You still seem to have a problem with the formating of your column names. If you post an example of your csv, I can try to fix that...

Calculate monthly value from csv file

I have a csv file as follows:
Date,Data
01-01-01,111
02-02-02,222
03-03-03,333
The Date has the following format YEAR-MONTH-DAY. I would like to calculate from these dates the monthly average values of the data (there are way more than 3 dates in my file).
For that I wish to use the following code:
import pandas as pd
import dateutil
import datetime
import os,sys,math,time
from os import path
os.chdir("in/base/dir")
data = pd.DataFrame.from_csv("data.csv")
data['Month'] = pd.DatetimeIndex(data['Date']).month
mean_data = data.groupby('Month').mean()
with open("data_monthly.csv", "w") as f:
print(mean_data, file=f)
For some reason this gives me the error KeyError: 'Date'.
So it seems that the header is not read by pandas. Does anyone know how to fix that?

Your Date column header is read but put into the index. You got to use:
data['Month'] = pd.DatetimeIndex(data.reset_index()['Date']).month
Another solution is to use index_col=None while making the dataframe from csv.
data = pd.DataFrame.from_csv("data.csv", index_col=None)
After which your code would be fine.
The ideal solution would be to use read_csv().
data = pd.read_csv("data.csv")

Use the read_csv method. By Default it is comma separated.
import pandas as pd
df = pd.read_csv(filename)
print(pd.to_datetime(df["Date"]))
Output:
0 2001-01-01
1 2002-02-02
2 2003-03-03
Name: Date, dtype: datetime64[ns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to mass delete unwanted info in column in a CSV file? - python

It could be done with regular python, not that difficult, but an very easy way with pandas would be: import pandas as pd df = pd.read_csv(<your name of the csv here>, sep='\s\s+', engine='python') df['filename'] = df['filename'].str.rstrip('.csv') print(df)

Related

Converting date and time format when importing csv file in Python

Adding the year to dates in date column

Pandas: Remove rows from the dataframe that begin with a letter and save CSV

convert csv to pandas

Calculate monthly value from csv file

Categories

Resources