First time post here and new to python. My program should take a json file and convert it to csv. I have to check each field for validity. For a record that does not have all valid fields, I need to output those records to file. My question is, how would I take the a invalid data entry and save it to a text file? Currently, the program can check for validity but I do not know how to extract the data that is invalid.
import numpy as np
import pandas as pd
import logging
import re as regex
from validate_email import validate_email
# Variables for characters
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
# Read in json file to dataframe df variable
# Read in data as a string
df = pd.read_json('j2.json', dtype={'string'})
# Find nan values and replace it with string
#df = df.replace(np.nan, 'Error.log', regex=True)
# Data validation check for columns
df['accountValid'] = df['account'].str.contains(nameRegex, regex=True)
df['userNameValid'] = df['userName'].str.contains(nameRegex, regex=True)
df['valid_email'] = df['email'].apply(lambda x: validate_email(x))
df['valid_number'] = df['phone'].apply(lambda x: len(str(x)) == 11)
# Prepend 86 to phone number column
df['phone'] = ('86' + df['phone'])
Convert dataframe to csv file
df.to_csv('test.csv', index=False)
The json file I am using has thousands of rows
Thank you in advance!
im using pandas with excel and i would like to get the letter of the header in excel searching for column name.
here´s an example
i would like to do something LIKE this: df.columns.get_loc("SR Status") and i would like to return: "D"
i have already done this:
import pandas
df = pd.read_excel("file.xls")
df.columns.get_loc("SR Status")
and let´s assume data will NOT always be in the same place.
sometimes it might be at header "A" but other time could be on other place
thanks in advance
You can use get_column_letter:
import pandas as pd
from openpyxl.utils import get_column_letter
df = pd.read_excel('data.xlsx', usecols='D:F')
offset = 4 # D
col = get_column_letter(df.columns.get_loc('SR Status') + offset)
print(col) # Output: D
I a importing a .csv file in python with pandas.
Here is the file format from the .csv :
a1;b1;c1;d1;e1;...
a2;b2;c2;d2;e2;...
.....
here is how get it :
from pandas import *
csv_path = "C:...."
data = read_csv(csv_path)
Now when I print the file I get that :
0 a1;b1;c1;d1;e1;...
1 a2;b2;c2;d2;e2;...
And so on... So I need help to read the file and split the values in columns, with the semi color character ;.
read_csv takes a sep param, in your case just pass sep=';' like so:
data = read_csv(csv_path, sep=';')
The reason it failed in your case is that the default value is ',' so it scrunched up all the columns as a single column entry.
In response to Morris' question above:
"Is there a way to programatically tell if a CSV is separated by , or ; ?"
This will tell you:
import pandas as pd
df_comma = pd.read_csv(your_csv_file_path, nrows=1,sep=",")
df_semi = pd.read_csv(your_csv_file_path, nrows=1, sep=";")
if df_comma.shape[1]>df_semi.shape[1]:
print("comma delimited")
else:
print("semicolon delimited")
I a importing a .csv file in python with pandas.
Here is the file format from the .csv :
a1;b1;c1;d1;e1;...
a2;b2;c2;d2;e2;...
.....
here is how get it :
from pandas import *
csv_path = "C:...."
data = read_csv(csv_path)
Now when I print the file I get that :
0 a1;b1;c1;d1;e1;...
1 a2;b2;c2;d2;e2;...
And so on... So I need help to read the file and split the values in columns, with the semi color character ;.
read_csv takes a sep param, in your case just pass sep=';' like so:
data = read_csv(csv_path, sep=';')
The reason it failed in your case is that the default value is ',' so it scrunched up all the columns as a single column entry.
In response to Morris' question above:
"Is there a way to programatically tell if a CSV is separated by , or ; ?"
This will tell you:
import pandas as pd
df_comma = pd.read_csv(your_csv_file_path, nrows=1,sep=",")
df_semi = pd.read_csv(your_csv_file_path, nrows=1, sep=";")
if df_comma.shape[1]>df_semi.shape[1]:
print("comma delimited")
else:
print("semicolon delimited")
I am loading a txt file containig a mix of float and string data. I want to store them in an array where I can access each element. Now I am just doing
import pandas as pd
data = pd.read_csv('output_list.txt', header = None)
print data
This is the structure of the input file: 1 0 2000.0 70.2836942112 1347.28369421 /file_address.txt.
Now the data are imported as a unique column. How can I divide it, so to store different elements separately (so I can call data[i,j])? And how can I define a header?
You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns is for naming your columns.
I'd like to add to the above answers, you could directly use
df = pd.read_fwf('output_list.txt')
fwf stands for fixed width formatted lines.
You can do as:
import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")
(like, df = pd.read_csv('F:\Desktop\ds\text.txt', delimiter = "\t")
#Pietrovismara's solution is correct but I'd just like to add: rather than having a separate line to add column names, it's possible to do this from pd.read_csv.
df = pd.read_csv('output_list.txt', sep=" ", header=None, names=["a", "b", "c"])
you can use this
import pandas as pd
dataset=pd.read_csv("filepath.txt",delimiter="\t")
If you don't have an index assigned to the data and you are not sure what the spacing is, you can use to let pandas assign an index and look for multiple spaces.
df = pd.read_csv('filename.txt', delimiter= '\s+', index_col=False)
Based on the latest changes in pandas, you can use, read_csv , read_table is deprecated:
import pandas as pd
pd.read_csv("file.txt", sep = "\t")
If you want to load the txt file with specified column name, you can use the code below. It worked for me.
import pandas as pd
data = pd.read_csv('file_name.txt', sep = "\t", names = ['column1_name','column2_name', 'column3_name'])
You can import the text file using the read_table command as so:
import pandas as pd
df=pd.read_table('output_list.txt',header=None)
Preprocessing will need to be done after loading
I usually take a look at the data first or just try to import it and do data.head(), if you see that the columns are separated with \t then you should specify sep="\t" otherwise, sep = " ".
import pandas as pd
data = pd.read_csv('data.txt', sep=" ", header=None)
You can use it which is most helpful.
df = pd.read_csv(('data.txt'), sep="\t", skiprows=[0,1], names=['FromNode','ToNode'])