I have a text data that look like this:
3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"
I want to transfor it to be table like this:
a b e r
1 2 5 7
23 45 76 76
I've tried to use a pandas data frame for that, but the data size is quite big, like 40 Mb.
So what should I do to solve it?
Sorry for my bad explanation. I hope you can understand what I mean. Thanks!
import os
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("12test.txt"), sep=",", header=None, error_bad_lines=False)
df = pd.DataFrame([row.split('.') for row in a.split('\n')])
print(df)
I've tried this but it doesn't work.
Some errors occurred like "'DataFrame' object has no attribute 'split' ", the data frame containing a string "12test.txt" not the data inside the file, memory problem, etc.
Try:
>>> s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
>>> pd.DataFrame([[x.strip('"') for x in i.split(',')[1:]] for i in s.splitlines()[1:]], columns=[x.strip('"') for x in s.splitlines()[0].split(',')[1:]])
a b e r
0 1 2 5 7
1 23 45 76 76
>>>
Use a list comprehension then convert it to a pandas.DataFrame.
To read files or binary text data you can use StringIO, removing first digit of string and digits alongside \n make a readable input string when pass to read_csv.
import io
import re
import pandas as pd
s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
s = re.sub(r'[\n][0-9]', "\n", s)
df = pd.read_csv(io.StringIO(s))
# remove column generated by first character that contains NAN values
df.drop(df.columns[0], axis=1)
Related
Suppose I have this csv file named sample.csv:
CODE AGE SEX CITY
---- --- --- ----
E101 25 M New York
E102 42 F New York
E103 31 M Chicago
E104 67 F Chicago
I wish to count the number of males and females in the data. For instance, for this one, the answer would be:
M : 2
F : 2
Where should I start and how should I code it?
You can do this:
import pandas as pd
df = pd.read_csv("sample.csv")
print(f"M : {len(df[df['SEX'] == 'M'])}")
print(f"F : {len(df[df['SEX'] == 'F'])}")
>>> import csv
>>> M,F = 0,0
>>> with open('file.csv') as csvfile:
... data = csv.reader(csvfile)
... for row in data:
... M += 1 if row[2] == "M" else F += 1
Import the CSV file.
section out the 'SEX' column.
import pandas as pd
data = pd.read_csv('sample.csv')
num_males = sum(data['SEX'] == 'M')
num_females = len(data['SEX']) - num_males
Another solution is using the pandas packages to do so.
import pandas as pd
csv_path_file = '' # your csv path file
separator = ';'
df = pd.read_csv(csv_path_file, sep = separator)
df['SEX'].value_counts()
will return a pd.Series object with 'M' and 'F' as index and count as values.
It is also a great workaround for checking wrong data, you'll immediately notice it if you have another category, or missing data.
The simplest way is using Pandas to read data from csv and group by:
import pandas as pd
df = pd.read_csv('sample.csv') // read data from csv
result = df.groupby('sex').size() // use .size() to get the row counts
Output:
sex
f 2
m 2
dtype: int64
After you read from file using either external pandas or built-in csv module, you might builtin module collections' Counter to count occurences, consider example:
import collections
import pandas as pd
df = pd.DataFrame({'CODE':['E101','E102','E103','E104'],'SEX':['M','F','M','F']})
for key, value in collections.Counter(df['SEX']).items():
print(key,":",value)
Output:
M : 2
F : 2
Note I hardcoded data for simplicity. Explanation: collections.Counter is dict-like object, which accept iterable during creation and count occurences in said iterable.
Here is a sample CSV I'm working with
Here is my code:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
#(2) Filter every row where the first letter is 's' from search term
df = df[df['productOMS'].str.contains('^[a-z]+')]
#REGEX to filter anything that would ^ (start with) a letter
inputFile = inputFile
deleteSearchTerm(inputFile)
What I want to do:
Anything in the column ProductOMS that begins with a letter would be a row that I don't want. So I'm trying to delete them based on a condition and I was also trying would regular expressions just so I'd get a little bit more comfortable with them.
I tried to do that with:
df = df[df['productOMS'].str.contains('^[a-z]+')]
where if any of the rows starts with any lower case letter I would drop it (I think)
Please let me know if I need to add anything to my post!
Edit:
Here is a link to a copy of the file I'm working with.
https://drive.google.com/file/d/1Dsw2Ana3WVIheNT43Ad4Dv6C8AIbvAlJ/view?usp=sharing
Another Edit: Here is the dataframe I'm working with
productNum,ProductOMS,productPrice
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
New Edit:
Here's some updated code using an answer
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
inputFile = inputFile
deleteSearchTerm(inputFile)
When I run this code and print out the dataframes this gets rid of the rows that start with 'search'. However my CSV file is not updating
The issue here is that you're most likely dealing with mixed data types.
if you just want numeric values you can use pd.to_numeric
df = pd.DataFrame({'A' : [0,1,2,3,'a12351','123a6']})
df[~pd.to_numeric(df['A'],errors='coerce').isnull()]
A
0 0
1 1
2 2
3 3
but if you only want to test the first letter then :
df[~df['A'].astype(str).str.contains('^[a-z]')==True]
A
0 0
1 1
2 2
3 3
5 123a6
Edit, it seems the first solution works, but you need to write this back to your csv?
you need to use the to_csv method, i'd recommend you read 10 minutes to pandas here
As for your function, lets edit it a little to take a source csv file and throw out an edited version, it will save the file to the same location with _edited added on. feel free to edit/change.
from pathlib import Path
def delete_search_term(input_file, column):
"""
Takes in a file and removes any strings from a given column
input_file : path to your file.
column : column with strings that you want to remove.
"""
file_path = Path(input_file)
if not file_path.is_file():
raise Exception('This file path is not valid')
df = pd.read_csv(input_file)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df[column],errors='coerce').isnull()]
print(f"Creating file as:\n{file_path.parent.joinpath(f'{file_path.stem}_edited.csv')}")
return df.to_csv(file_path.parent.joinpath(f"{file_path.stem}_edited.csv"),index=False)
Solution:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df.to_csv(inputFile)
inputFile = filePath
inputFile = deleteSearchTerm(inputFile)
Data from the source csv as shared at the google drive location:
'''
productNum,ProductOMS,productPrice,Unnamed: 3
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
'''
import pandas as pd
df = pd.read_clipboard()
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
4 2463434 search?searchTerm=2463434 NaN NaN
5 2463423 1002622932 469.95 NaN
.
df1 = df.loc[df['ProductOMS'].str.isdigit()]
print(df1)
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
5 2463423 1002622932 469.95 NaN
I hope it helps you:
df = pd.read_csv(filename)
df = df[~df['ProductOMS'].str.contains('^[a-z]+')]
df.to_csv(filename)
For the most part your function is fine but you seem to have forgotten to save the CSV, which is done by df.to_csv() method.
Let me rewrite the code for you:
import pandas as pd
def processAndSaveCSV(filename):
# Read the CSV file
df = pd.read_csv(filename)
# Retain only the rows with `ProductOMS` being numeric
df = df[df['ProductOMS'].str.contains('^\d+')]
# Save CSV File - Rewrites file
df.to_csv(filename)
Hope this helps :)
It looks like a scope problem to me.
First we need to return df:
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df
Then replace the line
DeleteSearchTerm(InputFile)
with:
InputFile = DeleteSearchTerm(InputFile)
Basically your function is not returning anything.
After you fix that you just need to redefine your inputFile variable to the new dataframe your function is returning.
If you already defined df earlier in your code and you're trying to manipulate it, then the function is not actually changing your existing global df variable. Instead it's making a new local variable under the same name.
To fix this we first return the local df and then re-assign the global df to the local one.
You should be able to find more information about variable scope at this link:
https://www.geeksforgeeks.org/global-local-variables-python/
It also appears you never actually update your original file.
Try adding this to the end of your code:
df.to_csv('CSV file name', index=True)
Index just says whether you want to have a line index.
I am struggling to read simple csv to pandas, actual problem is that it doesnt separate ",".
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv', header=0, delimiter ="\t")
print(df)
I have tried sep=',' and it does not separate..
Event," 2016-02-01"," 2016-02-02"," 2016-02-03"," 2016-02-04","
Contact joined,"5","7","18","20",
Launch first time,"30","62","86","110",
It should looks like 1 header with Dates and 2 rows:
2016-02-01 2016-02-02 etc
0 5 7
1 30 62
UPDATE: Yes, the problem was in cdv itself with unnecessary quotes and characters.
You seem to be using both delimiter= and sep=, which both do the same thing. If it is actually comma seperated, try:
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv')
print(df)
sep=',' is the default, so it's not necessary to explicitly state that. The same goes for header=0. delimiter= is just an alias for sep=.
You still seem to have a problem with the formating of your column names. If you post an example of your csv, I can try to fix that...
I have a .csv file having multiple rows and columns:
chain Resid Res K B C Tr Kw Bw Cw Tw
A 1 ASP 1 0 0.000104504 NA 0 0 0.100087974 0.573972285
A 2 GLU 2 627 0.000111832 0 0.033974309 0.004533331 0.107822844 0.441666022
Whenever I open the file using pandas or using with open, it shows that there are only column and multiple rows:
629 rows x 1 columns
Here is the code im using:
data= pd.read_csv("F:/file.csv", sep='\t')
print(data)
and the result I'm getting is this"
A,1,ASP,1,0,0.0001045041279130...
I want the output to be in a dataframe form so that I can carry out future calculations. Any help will be highly appreciated
There is separator ,, so is psosible omit parameter sep, because sep=',' is deafault separator in read_csv:
data= pd.read_csv("F:/file.csv")
you can read the csv using the following code snippet
import pandas as pd
data = pd.read_csv('F:/file.csv', sep=',')
Don't use '\t', because you don't have four consecutive spaces together (a tab between), so use the below:
data = pd.read_csv("F:/file.csv")
Or if really needed, use:
data = pd.read_csv("F:/file.csv", sep='\s{2,}', engine='python')
If your data values have spaces.
I have the foll. csv with 1st row as header:
A B
test 23
try 34
I want to read in this as a dictionary, so doing this:
dt = pandas.read_csv('file.csv').to_dict()
However, this reads in the header row as key. I want the values in column 'A' to be the keys. How do I do that i.e. get answer like this:
{'test':'23', 'try':'34'}
dt = pandas.read_csv('file.csv', index_col=1, skiprows=1).T.to_dict()
Duplicating data:
import pandas as pd
from io import StringIO
data="""
A B
test 23
try 34
"""
df = pd.read_csv(StringIO(data), delimiter='\s+')
Converting to dictioanry:
print(dict(df.values))
Will give:
{'try': 34, 'test': 23}