Extract Invalid Data From Dataframe to a File (.txt) - python

First time post here and new to python. My program should take a json file and convert it to csv. I have to check each field for validity. For a record that does not have all valid fields, I need to output those records to file. My question is, how would I take the a invalid data entry and save it to a text file? Currently, the program can check for validity but I do not know how to extract the data that is invalid.
import numpy as np
import pandas as pd
import logging
import re as regex
from validate_email import validate_email
# Variables for characters
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
# Read in json file to dataframe df variable
# Read in data as a string
df = pd.read_json('j2.json', dtype={'string'})
# Find nan values and replace it with string
#df = df.replace(np.nan, 'Error.log', regex=True)
# Data validation check for columns
df['accountValid'] = df['account'].str.contains(nameRegex, regex=True)
df['userNameValid'] = df['userName'].str.contains(nameRegex, regex=True)
df['valid_email'] = df['email'].apply(lambda x: validate_email(x))
df['valid_number'] = df['phone'].apply(lambda x: len(str(x)) == 11)
# Prepend 86 to phone number column
df['phone'] = ('86' + df['phone'])
Convert dataframe to csv file
df.to_csv('test.csv', index=False)
The json file I am using has thousands of rows
Thank you in advance!

Related

Unable to clean the csv file in python

I am trying to load a CSV file into python and clean the text. but I keep getting an error. I saved the CSV file in a variable called data_file and the function below cleans the text and supposed to return the clean data_file.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
df = pd.read_csv("/Users/yoshithKotla/Desktop/janTweet.csv")
data_file = df
print(data_file)
def cleanTxt(text):
text = re.sub(r'#[A-Za-z0-9]+ ', '', text) # removes # mentions
text = re.sub(r'#[A-Za-z0-9]+', '', text)
text = re.sub(r'RT[\s]+', '', text)
text = re.sub(r'https?:\/\/\S+', '', text)
return text
df['data_file'] = df['data_file'].apply(cleanTxt)
df
I get a key error here.
the key error comes from the fact that you are trying to apply a function to the column data_file of the dataframe df which does not contain such a column.
You juste created a copy of df in your line data_file = df.
To change the column names of your dataframe df use:
df.columns = [list,of,values,corresponding,to,your,columns]
Then you can either apply the function to the right column or on the whole dataframe.
To apply a function on the whole dataframe you may want to use the .applymap() method.
EDIT
For clarity's sake:
To print your column names and the length of your dataframe columns:
print(df.columns)
print(len(df.columns))
To modify your column names:
df.columns = [list,of,values,corresponding,to,your,columns]
To apply your function on a column:
df['your_column_name'] = df['your_column_name'].apply(cleanTxt)
To apply your function to your whole dataframe:
df = df.applymap(cleanTxt)

How do I filter out only those reviews from excel file that have the specific words I mention in the code?

I have an excel file loaded into pandas and want to filter out those movie reviews that have the specific word I mention in the code. Perhaps, I have written certain code but it is not giving the complete and correct output.
import pandas as pd
import numpy as np
# reading data file
df = pd.read_excel('./data-for-group-project1(original).xlsx')
# preparing data
# reading data file replacing any NaN value in the data
df = df.replace(np.nan, '', regex=True)
# filtering data
df = df[df['reviewcontent'].str.contains('jamaican')]
# displaying
print(df[['rid', 'user_name', 'rating', 'numberofuseful', 'reviewcontent']])

Losing "+" in phone numbers , when converting csv to txt in Pandas

Phone number "+49374425070" is getting converted to "49374425070.0" when i try to use "to_csv".
source.to_csv('CVV'+'_source.txt',sep = '\t', index = False ,encoding='utf-8',quoting=csv.QUOTE_NONE)
import pandas as pd
import csv
#Source
source = pd.read_csv('source2.csv') #source csv name
source.to_csv('CVV'+'_source.txt',sep = '\t', index = False ,encoding='utf-8',quoting=csv.QUOTE_NONE)#float_format='%.0f'
print('archive.txt and source.txt generated')
You want to read the column as a string rather than int_64. To do this, use something like this:
read_csv('sample.csv', dtype={'phone': str})
This will work in Pandas >= 0.9.1.

How to split a column into multiple columns? [duplicate]

I a importing a .csv file in python with pandas.
Here is the file format from the .csv :
a1;b1;c1;d1;e1;...
a2;b2;c2;d2;e2;...
.....
here is how get it :
from pandas import *
csv_path = "C:...."
data = read_csv(csv_path)
Now when I print the file I get that :
0 a1;b1;c1;d1;e1;...
1 a2;b2;c2;d2;e2;...
And so on... So I need help to read the file and split the values in columns, with the semi color character ;.
read_csv takes a sep param, in your case just pass sep=';' like so:
data = read_csv(csv_path, sep=';')
The reason it failed in your case is that the default value is ',' so it scrunched up all the columns as a single column entry.
In response to Morris' question above:
"Is there a way to programatically tell if a CSV is separated by , or ; ?"
This will tell you:
import pandas as pd
df_comma = pd.read_csv(your_csv_file_path, nrows=1,sep=",")
df_semi = pd.read_csv(your_csv_file_path, nrows=1, sep=";")
if df_comma.shape[1]>df_semi.shape[1]:
print("comma delimited")
else:
print("semicolon delimited")

How to read a file with a semi colon separator in pandas

I a importing a .csv file in python with pandas.
Here is the file format from the .csv :
a1;b1;c1;d1;e1;...
a2;b2;c2;d2;e2;...
.....
here is how get it :
from pandas import *
csv_path = "C:...."
data = read_csv(csv_path)
Now when I print the file I get that :
0 a1;b1;c1;d1;e1;...
1 a2;b2;c2;d2;e2;...
And so on... So I need help to read the file and split the values in columns, with the semi color character ;.
read_csv takes a sep param, in your case just pass sep=';' like so:
data = read_csv(csv_path, sep=';')
The reason it failed in your case is that the default value is ',' so it scrunched up all the columns as a single column entry.
In response to Morris' question above:
"Is there a way to programatically tell if a CSV is separated by , or ; ?"
This will tell you:
import pandas as pd
df_comma = pd.read_csv(your_csv_file_path, nrows=1,sep=",")
df_semi = pd.read_csv(your_csv_file_path, nrows=1, sep=";")
if df_comma.shape[1]>df_semi.shape[1]:
print("comma delimited")
else:
print("semicolon delimited")

Categories