I am trying to load a CSV file into python and clean the text. but I keep getting an error. I saved the CSV file in a variable called data_file and the function below cleans the text and supposed to return the clean data_file.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
df = pd.read_csv("/Users/yoshithKotla/Desktop/janTweet.csv")
data_file = df
print(data_file)
def cleanTxt(text):
text = re.sub(r'#[A-Za-z0-9]+ ', '', text) # removes # mentions
text = re.sub(r'#[A-Za-z0-9]+', '', text)
text = re.sub(r'RT[\s]+', '', text)
text = re.sub(r'https?:\/\/\S+', '', text)
return text
df['data_file'] = df['data_file'].apply(cleanTxt)
df
I get a key error here.
the key error comes from the fact that you are trying to apply a function to the column data_file of the dataframe df which does not contain such a column.
You juste created a copy of df in your line data_file = df.
To change the column names of your dataframe df use:
df.columns = [list,of,values,corresponding,to,your,columns]
Then you can either apply the function to the right column or on the whole dataframe.
To apply a function on the whole dataframe you may want to use the .applymap() method.
EDIT
For clarity's sake:
To print your column names and the length of your dataframe columns:
print(df.columns)
print(len(df.columns))
To modify your column names:
df.columns = [list,of,values,corresponding,to,your,columns]
To apply your function on a column:
df['your_column_name'] = df['your_column_name'].apply(cleanTxt)
To apply your function to your whole dataframe:
df = df.applymap(cleanTxt)
Related
First time post here and new to python. My program should take a json file and convert it to csv. I have to check each field for validity. For a record that does not have all valid fields, I need to output those records to file. My question is, how would I take the a invalid data entry and save it to a text file? Currently, the program can check for validity but I do not know how to extract the data that is invalid.
import numpy as np
import pandas as pd
import logging
import re as regex
from validate_email import validate_email
# Variables for characters
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
# Read in json file to dataframe df variable
# Read in data as a string
df = pd.read_json('j2.json', dtype={'string'})
# Find nan values and replace it with string
#df = df.replace(np.nan, 'Error.log', regex=True)
# Data validation check for columns
df['accountValid'] = df['account'].str.contains(nameRegex, regex=True)
df['userNameValid'] = df['userName'].str.contains(nameRegex, regex=True)
df['valid_email'] = df['email'].apply(lambda x: validate_email(x))
df['valid_number'] = df['phone'].apply(lambda x: len(str(x)) == 11)
# Prepend 86 to phone number column
df['phone'] = ('86' + df['phone'])
Convert dataframe to csv file
df.to_csv('test.csv', index=False)
The json file I am using has thousands of rows
Thank you in advance!
I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )
import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]
It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.
How to skip the rows based on certain value in the first column of the dataset. For example: if the first column has some unwanted stuffs in the first few rows and i want skip those rows upto a trigger value. please help me for importing csv in python
You can achieve this by using the argument skip_rows
Here is sample code below to start with:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=<the row you want to skip>)
For a series of CSV files in the folder, you could use the for loop, read the CSV file and remove the row from the df containing the string.Lastly, concatenate it to the df_overall.
Example:
from pandas import DataFrame, concat, read_csv
df_overall = DataFrame()
dir_path = 'Insert your directory path'
for file_name in glob.glob(dir_path+'*.csv'):
df = pd.read_csv('file_name.csv', header=None)
df = df[~df. < column_name > .str.contains("<your_string>")]
df_overall = concat(df_overall, df)
I'm trying to use NLTK word_tokenize on an excel file I've opened as a data frame. The column I want to use word_tokenize on contains sentences. How can I pull out that specific column from my data frame to tokenize it? The name of the column I'm trying to access is called "Complaint / Query Detail".
import pandas as pd
from nltk import word_tokenize
file = "List of Complaints.xlsx"
df = pd.read_excel(file, sheet_name = "All Complaints" )
token = df["Complaint / Query Detail"].apply(word_tokenize)
I tried this method but I keep getting errors.
Try this:
df['Complaint / Query Detail'] = df.apply(lambda row:
nltk.word_tokenize(row['Complaint / Query Detail']), axis=1)
This is a for loop for tokenizing columns in a dataframe.
This is where you see DF put in yoru CSV file
def tokenize_text(df):
for columns in df.columns:
dataframe["tokenized_"+ columns] = dataframe.apply(lambda row: nltk.word_tokenize(row[columns]), axis=1)
return dataframe
print(df)
I hope it's helpful.
I am translating some code from R to python to improve performance, but I am not very familiar with the pandas library.
I have a CSV file that looks like this:
O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739
I would like to split the second column on a delimiter (here, a space), and get the unique values in this column. In this case, the code should return [GO:0005737, GO:0015630, GO:0005654 GO:0005794, GO:0005739].
In R, I would do this using the following code:
df <- read.csv("data.csv")
unique <- unique(unlist(strsplit(df[,2], " ")))
In python, I have the following code using pandas:
df = pd.read_csv("data.csv")
split = df.iloc[:, 1].str.split(' ')
unique = pd.unique(split)
But this produces the following error:
TypeError: unhashable type: 'list'
How can I get the unique values in a column of a CSV file after splitting on a delimiter in python?
setup
from io import StringIO
import pandas as pd
txt = """O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739"""
s = pd.read_csv(StringIO(txt), header=None, squeeze=True, index_col=0)
solution
pd.unique(s.str.split(expand=True).stack())
array(['GO:0005737', 'GO:0015630', 'GO:0005654', 'GO:0005794', 'GO:0005739'], dtype=object)