pandas: Split a column on delimiter, and get unique values

pandas: Split a column on delimiter, and get unique values - python

I am translating some code from R to python to improve performance, but I am not very familiar with the pandas library.
I have a CSV file that looks like this:
O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739
I would like to split the second column on a delimiter (here, a space), and get the unique values in this column. In this case, the code should return [GO:0005737, GO:0015630, GO:0005654 GO:0005794, GO:0005739].
In R, I would do this using the following code:
df <- read.csv("data.csv")
unique <- unique(unlist(strsplit(df[,2], " ")))
In python, I have the following code using pandas:
df = pd.read_csv("data.csv")
split = df.iloc[:, 1].str.split(' ')
unique = pd.unique(split)
But this produces the following error:
TypeError: unhashable type: 'list'
How can I get the unique values in a column of a CSV file after splitting on a delimiter in python?

setup
from io import StringIO
import pandas as pd
txt = """O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739"""
s = pd.read_csv(StringIO(txt), header=None, squeeze=True, index_col=0)
solution
pd.unique(s.str.split(expand=True).stack())
array(['GO:0005737', 'GO:0015630', 'GO:0005654', 'GO:0005794', 'GO:0005739'], dtype=object)

Related

Handle variable as file with pandas dataframe

I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )

import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]

It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.

Unable to clean the csv file in python

I am trying to load a CSV file into python and clean the text. but I keep getting an error. I saved the CSV file in a variable called data_file and the function below cleans the text and supposed to return the clean data_file.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
df = pd.read_csv("/Users/yoshithKotla/Desktop/janTweet.csv")
data_file = df
print(data_file)
def cleanTxt(text):
text = re.sub(r'#[A-Za-z0-9]+ ', '', text) # removes # mentions
text = re.sub(r'#[A-Za-z0-9]+', '', text)
text = re.sub(r'RT[\s]+', '', text)
text = re.sub(r'https?:\/\/\S+', '', text)
return text
df['data_file'] = df['data_file'].apply(cleanTxt)
df
I get a key error here.

the key error comes from the fact that you are trying to apply a function to the column data_file of the dataframe df which does not contain such a column.
You juste created a copy of df in your line data_file = df.
To change the column names of your dataframe df use:
df.columns = [list,of,values,corresponding,to,your,columns]
Then you can either apply the function to the right column or on the whole dataframe.
To apply a function on the whole dataframe you may want to use the .applymap() method.
EDIT
For clarity's sake:
To print your column names and the length of your dataframe columns:
print(df.columns)
print(len(df.columns))
To modify your column names:
df.columns = [list,of,values,corresponding,to,your,columns]
To apply your function on a column:
df['your_column_name'] = df['your_column_name'].apply(cleanTxt)
To apply your function to your whole dataframe:
df = df.applymap(cleanTxt)

How to process .dat file Dataframe with multiple columns in different rows?

I am trying to import data from .dat files.
The files have the following structure (and there are a few hundred for each measurement):
#-G8k5perc
#acf0
4e-07 1.67466
8e-07 1.57061
...
13.4217728 0.97419
&
#fit0
2.4e-06 1.5376
3.2e-06 1.5312
...
13.4 0.99578
&
...
#cnta0
#with g2
#cnta0
0 109.74
0.25 107.97
...
19.75 104.05
#rate0 107.2
I have tried:
1)
df = pd.read_csv("G8k5perc-1.dat")
which only gives one column.
Adding ,sep=' ', ,delimiter=' ' or ,delim_whitespace=True leads to
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
2)
I have seen someone using:
from string import find, rfind, split, strip
Which raises the error: ImportError: cannot import name 'find' from 'string' for all four.
3)
Creating slices and changing them afterwards wont work either:
acf=df[1:179]
acf["#-G8k5perc"]= acf["#-G8k5perc"].str.split(" ", n = 1, expand = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
app.launch_new_instance()
Any Ideas on how to get two columns for each set of data (acf0, fit0, etc.) in the files?

You cannot use csv reader with a data format .dat.
Try the code below:
import csv
datContent = [i.strip().split() for i in open("./yourdata.dat").readlines()]
with open("./yourdata.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
Then try to use pandas to make new columns:
import pandas as pd
def your_func(row):
return row['x-momentum'] / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./yourdata.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
print dataframe
Replace yourdata.csv with your input file name.

How to use Word Tokenize on a single column in a Data Frame (Python)

I'm trying to use NLTK word_tokenize on an excel file I've opened as a data frame. The column I want to use word_tokenize on contains sentences. How can I pull out that specific column from my data frame to tokenize it? The name of the column I'm trying to access is called "Complaint / Query Detail".
import pandas as pd
from nltk import word_tokenize
file = "List of Complaints.xlsx"
df = pd.read_excel(file, sheet_name = "All Complaints" )
token = df["Complaint / Query Detail"].apply(word_tokenize)
I tried this method but I keep getting errors.

Try this:
df['Complaint / Query Detail'] = df.apply(lambda row:
nltk.word_tokenize(row['Complaint / Query Detail']), axis=1)

This is a for loop for tokenizing columns in a dataframe.
This is where you see DF put in yoru CSV file
def tokenize_text(df):
for columns in df.columns:
dataframe["tokenized_"+ columns] = dataframe.apply(lambda row: nltk.word_tokenize(row[columns]), axis=1)
return dataframe
print(df)
I hope it's helpful.

Load data from txt with pandas

I am loading a txt file containig a mix of float and string data. I want to store them in an array where I can access each element. Now I am just doing
import pandas as pd
data = pd.read_csv('output_list.txt', header = None)
print data
This is the structure of the input file: 1 0 2000.0 70.2836942112 1347.28369421 /file_address.txt.
Now the data are imported as a unique column. How can I divide it, so to store different elements separately (so I can call data[i,j])? And how can I define a header?

You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns is for naming your columns.

I'd like to add to the above answers, you could directly use
df = pd.read_fwf('output_list.txt')
fwf stands for fixed width formatted lines.

You can do as:
import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")
(like, df = pd.read_csv('F:\Desktop\ds\text.txt', delimiter = "\t")

#Pietrovismara's solution is correct but I'd just like to add: rather than having a separate line to add column names, it's possible to do this from pd.read_csv.
df = pd.read_csv('output_list.txt', sep=" ", header=None, names=["a", "b", "c"])

you can use this
import pandas as pd
dataset=pd.read_csv("filepath.txt",delimiter="\t")

If you don't have an index assigned to the data and you are not sure what the spacing is, you can use to let pandas assign an index and look for multiple spaces.
df = pd.read_csv('filename.txt', delimiter= '\s+', index_col=False)

Based on the latest changes in pandas, you can use, read_csv , read_table is deprecated:
import pandas as pd
pd.read_csv("file.txt", sep = "\t")

If you want to load the txt file with specified column name, you can use the code below. It worked for me.
import pandas as pd
data = pd.read_csv('file_name.txt', sep = "\t", names = ['column1_name','column2_name', 'column3_name'])

You can import the text file using the read_table command as so:
import pandas as pd
df=pd.read_table('output_list.txt',header=None)
Preprocessing will need to be done after loading

I usually take a look at the data first or just try to import it and do data.head(), if you see that the columns are separated with \t then you should specify sep="\t" otherwise, sep = " ".
import pandas as pd
data = pd.read_csv('data.txt', sep=" ", header=None)

You can use it which is most helpful.
df = pd.read_csv(('data.txt'), sep="\t", skiprows=[0,1], names=['FromNode','ToNode'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas: Split a column on delimiter, and get unique values - python

Related

Handle variable as file with pandas dataframe

Unable to clean the csv file in python

How to process .dat file Dataframe with multiple columns in different rows?

How to use Word Tokenize on a single column in a Data Frame (Python)

Load data from txt with pandas

Categories

Resources