I am struggling to read simple csv to pandas, actual problem is that it doesnt separate ",".
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv', header=0, delimiter ="\t")
print(df)
I have tried sep=',' and it does not separate..
Event," 2016-02-01"," 2016-02-02"," 2016-02-03"," 2016-02-04","
Contact joined,"5","7","18","20",
Launch first time,"30","62","86","110",
It should looks like 1 header with Dates and 2 rows:
2016-02-01 2016-02-02 etc
0 5 7
1 30 62
UPDATE: Yes, the problem was in cdv itself with unnecessary quotes and characters.
You seem to be using both delimiter= and sep=, which both do the same thing. If it is actually comma seperated, try:
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv')
print(df)
sep=',' is the default, so it's not necessary to explicitly state that. The same goes for header=0. delimiter= is just an alias for sep=.
You still seem to have a problem with the formating of your column names. If you post an example of your csv, I can try to fix that...
Related
Here is a sample CSV I'm working with
Here is my code:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
#(2) Filter every row where the first letter is 's' from search term
df = df[df['productOMS'].str.contains('^[a-z]+')]
#REGEX to filter anything that would ^ (start with) a letter
inputFile = inputFile
deleteSearchTerm(inputFile)
What I want to do:
Anything in the column ProductOMS that begins with a letter would be a row that I don't want. So I'm trying to delete them based on a condition and I was also trying would regular expressions just so I'd get a little bit more comfortable with them.
I tried to do that with:
df = df[df['productOMS'].str.contains('^[a-z]+')]
where if any of the rows starts with any lower case letter I would drop it (I think)
Please let me know if I need to add anything to my post!
Edit:
Here is a link to a copy of the file I'm working with.
https://drive.google.com/file/d/1Dsw2Ana3WVIheNT43Ad4Dv6C8AIbvAlJ/view?usp=sharing
Another Edit: Here is the dataframe I'm working with
productNum,ProductOMS,productPrice
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
New Edit:
Here's some updated code using an answer
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
inputFile = inputFile
deleteSearchTerm(inputFile)
When I run this code and print out the dataframes this gets rid of the rows that start with 'search'. However my CSV file is not updating
The issue here is that you're most likely dealing with mixed data types.
if you just want numeric values you can use pd.to_numeric
df = pd.DataFrame({'A' : [0,1,2,3,'a12351','123a6']})
df[~pd.to_numeric(df['A'],errors='coerce').isnull()]
A
0 0
1 1
2 2
3 3
but if you only want to test the first letter then :
df[~df['A'].astype(str).str.contains('^[a-z]')==True]
A
0 0
1 1
2 2
3 3
5 123a6
Edit, it seems the first solution works, but you need to write this back to your csv?
you need to use the to_csv method, i'd recommend you read 10 minutes to pandas here
As for your function, lets edit it a little to take a source csv file and throw out an edited version, it will save the file to the same location with _edited added on. feel free to edit/change.
from pathlib import Path
def delete_search_term(input_file, column):
"""
Takes in a file and removes any strings from a given column
input_file : path to your file.
column : column with strings that you want to remove.
"""
file_path = Path(input_file)
if not file_path.is_file():
raise Exception('This file path is not valid')
df = pd.read_csv(input_file)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df[column],errors='coerce').isnull()]
print(f"Creating file as:\n{file_path.parent.joinpath(f'{file_path.stem}_edited.csv')}")
return df.to_csv(file_path.parent.joinpath(f"{file_path.stem}_edited.csv"),index=False)
Solution:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df.to_csv(inputFile)
inputFile = filePath
inputFile = deleteSearchTerm(inputFile)
Data from the source csv as shared at the google drive location:
'''
productNum,ProductOMS,productPrice,Unnamed: 3
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
'''
import pandas as pd
df = pd.read_clipboard()
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
4 2463434 search?searchTerm=2463434 NaN NaN
5 2463423 1002622932 469.95 NaN
.
df1 = df.loc[df['ProductOMS'].str.isdigit()]
print(df1)
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
5 2463423 1002622932 469.95 NaN
I hope it helps you:
df = pd.read_csv(filename)
df = df[~df['ProductOMS'].str.contains('^[a-z]+')]
df.to_csv(filename)
For the most part your function is fine but you seem to have forgotten to save the CSV, which is done by df.to_csv() method.
Let me rewrite the code for you:
import pandas as pd
def processAndSaveCSV(filename):
# Read the CSV file
df = pd.read_csv(filename)
# Retain only the rows with `ProductOMS` being numeric
df = df[df['ProductOMS'].str.contains('^\d+')]
# Save CSV File - Rewrites file
df.to_csv(filename)
Hope this helps :)
It looks like a scope problem to me.
First we need to return df:
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df
Then replace the line
DeleteSearchTerm(InputFile)
with:
InputFile = DeleteSearchTerm(InputFile)
Basically your function is not returning anything.
After you fix that you just need to redefine your inputFile variable to the new dataframe your function is returning.
If you already defined df earlier in your code and you're trying to manipulate it, then the function is not actually changing your existing global df variable. Instead it's making a new local variable under the same name.
To fix this we first return the local df and then re-assign the global df to the local one.
You should be able to find more information about variable scope at this link:
https://www.geeksforgeeks.org/global-local-variables-python/
It also appears you never actually update your original file.
Try adding this to the end of your code:
df.to_csv('CSV file name', index=True)
Index just says whether you want to have a line index.
I'm currently trying to use Python read a text file into Sqlite3 using Pandas. Here are a few entries from the text file:
1 Michael 462085 2.2506 Jessica 302962 1.5436
2 Christopher 361250 1.7595 Ashley 301702 1.5372
3 Matthew 351477 1.7119 Emily 237133 1.2082
The data consists of popular baby names, and I have to separate male names and female names into their own tables and perform queries on them. My method consists of first placing all the data into both tables, then dropping the unneeded columns afterwards. My issue is that when I try to add names to the columns, I get a value error: The expected axis has 6 elements, but 7 values. I'm assuming it's because Pandas possibly isn't reading the last values of each line, but I can't figure out how to fix it. My current delimiter is a whitespace delimiter that you can see below.
Here is my code:
import sqlite3
import pandas as pd
import csv
con = sqlite3.connect("C:\\****\\****\\****\\****\\****\baby_names.db")
c=con.cursor()
# Please note that most of these functions will be commented out, because they will only be run once.
def create_and_insert():
# load data
df = pd.read_csv('babynames.txt', index_col=0, header=None, sep= '\s+', engine = 'python')
# Reading the textfile
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber', 'Girlpercent']
# Adding Column names
df.columns = df.columns.str.strip()
con = sqlite3.connect("*************\\baby_names.db")
# drop data into database
df.to_sql("Combined", con)
df.to_sql("Boys", con)
df.to_sql("Girls", con)
con.commit()
con.close()
create_and_insert()
def test():
c.execute("SELECT * FROM Boys WHERE Rank = 1")
print(c.fetchall())
test()
con.commit()
con.close()
I've tried adding multiple delimiters, but it didn't seem to do anything. Using just regular space as the delimiter seems to just create 'blank' column names. From reading the Pandas docs, it says that multiple delimiters are possible, but I can't quite figure it out. Any help would be greatly appreciated!
Note that:
your input file contains 7 columns,
but the initial column is set as the index (you passed index_col=0),
so your DataFrame contains only 6 regular columns.
Print df to confirm it.
Now, when you run df.columns = ['Rank', ...], you attempt to assing the
7 passed names to existing 6 data columns.
Probably you should:
read the DataFrame without setting the index (for now),
assign all 7 column names,
set Rank column as the index.
The code to do it is:
df = pd.read_csv('babynames.txt', header=None, sep='\s+', engine='python')
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent']
df.set_index('Rank', inplace=True)
Or even shorter (all in one):
df = pd.read_csv('babynames.txt', sep='\s+', engine='python',
names=['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent'], index_col='Rank')
So I recently concatenated multiple csv files into one. Since the filenames were dates, I also included "filename" as a column for reference. However, the filename has info that I would not like to include such as the time and file extension. As a beginner, I'm only familiar with importing and printing the file to view. What code is recommended to mass remove the info after the date?
answer filename
7 2018-04-12 21_01_01.csv
7 2018-04-18 18_36_30.csv
7 2018-04-18 21_01_32.csv
8 2018-04-20 15_21_02.csv
7 2018-04-20 21_00_44.csv
7 2018-04-22 21_01_05.csv
It could be done with regular python, not that difficult, but an very easy way with pandas would be:
import pandas as pd
df = pd.read_csv(<your name of the csv here>, sep='\s\s+', engine='python')
df['filename'] = df['filename'].str.rstrip('.csv')
print(df)
When working with tabular data in python I highly recommend using the pandas package.
import pandas as pd
df = pd.read_csv("../test_data.csv")
def rem_part(string):
return string.split(' ')[0] # could also split on '.' if you want to keep the time
df['date'] = df['filename'].apply(rem_part)
df.drop('filename', axis = 1, inplace=True) # remove the filename column if you so please
df.to_csv("output.csv"). # save the file as a new CSV or overwrite the old
The test_data.csv file contains the following:
answer,filename
7,2018-04-12 21_01_01.csv
7,2018-04-18 18_36_30.csv
7,2018-04-18 21_01_32.csv
8,2018-04-20 15_21_02.csv
7,2018-04-20 21_00_44.csv
7,2018-04-22 21_01_05.csv
I have a .csv file having multiple rows and columns:
chain Resid Res K B C Tr Kw Bw Cw Tw
A 1 ASP 1 0 0.000104504 NA 0 0 0.100087974 0.573972285
A 2 GLU 2 627 0.000111832 0 0.033974309 0.004533331 0.107822844 0.441666022
Whenever I open the file using pandas or using with open, it shows that there are only column and multiple rows:
629 rows x 1 columns
Here is the code im using:
data= pd.read_csv("F:/file.csv", sep='\t')
print(data)
and the result I'm getting is this"
A,1,ASP,1,0,0.0001045041279130...
I want the output to be in a dataframe form so that I can carry out future calculations. Any help will be highly appreciated
There is separator ,, so is psosible omit parameter sep, because sep=',' is deafault separator in read_csv:
data= pd.read_csv("F:/file.csv")
you can read the csv using the following code snippet
import pandas as pd
data = pd.read_csv('F:/file.csv', sep=',')
Don't use '\t', because you don't have four consecutive spaces together (a tab between), so use the below:
data = pd.read_csv("F:/file.csv")
Or if really needed, use:
data = pd.read_csv("F:/file.csv", sep='\s{2,}', engine='python')
If your data values have spaces.
I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius
Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254
I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600
I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.