How to store integers as strings in CSV file Python - python

When trying to save some number as a string type in CSV file, then saving this file and reading it again, the file shows this saved data as a numpy.int64 instead of a string. How can this be solved so when reading the csv file it reads it as a string, not int?
Here is a Python script that describes this case
import pandas as pd
df = pd.DataFrame(columns=['ID'])
ID = '1'
df = df.append(pd.DataFrame([[ID]], columns=['ID']))
df.to_csv('test.csv', index=False)
"""
now the csv file looks like this:
ID
1
"""
df = pd.read_csv('test.csv')
print(df['ID'].iloc[0] == ID) # this will print False
print(type(df['ID'].iloc[0])) # this will print <class 'numpy.int64'>

The CSV file format doesn't distinguish between different data types. You have to specify the data type when reading the CSV with pandas.
df = pd.read_csv("test.csv", dtype=str)

Related

Extract Invalid Data From Dataframe to a File (.txt)

First time post here and new to python. My program should take a json file and convert it to csv. I have to check each field for validity. For a record that does not have all valid fields, I need to output those records to file. My question is, how would I take the a invalid data entry and save it to a text file? Currently, the program can check for validity but I do not know how to extract the data that is invalid.
import numpy as np
import pandas as pd
import logging
import re as regex
from validate_email import validate_email
# Variables for characters
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
# Read in json file to dataframe df variable
# Read in data as a string
df = pd.read_json('j2.json', dtype={'string'})
# Find nan values and replace it with string
#df = df.replace(np.nan, 'Error.log', regex=True)
# Data validation check for columns
df['accountValid'] = df['account'].str.contains(nameRegex, regex=True)
df['userNameValid'] = df['userName'].str.contains(nameRegex, regex=True)
df['valid_email'] = df['email'].apply(lambda x: validate_email(x))
df['valid_number'] = df['phone'].apply(lambda x: len(str(x)) == 11)
# Prepend 86 to phone number column
df['phone'] = ('86' + df['phone'])
Convert dataframe to csv file
df.to_csv('test.csv', index=False)
The json file I am using has thousands of rows
Thank you in advance!

Handle variable as file with pandas dataframe

I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )
import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]
It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.

Python pandas xlsx/ csv

I want to convert xlsx to csv and it works, but after conversion python add ".0" to string...
Sample xlsx :
Name, Age
Mark, 20
CSV after conversion :
Name, Age
Mark, 20.0 <- add ".0"
What could the problem be?
#importing pandas as pd
import pandas as pd
# Read and store content
# of an excel file
read_file = pd.read_excel ("EXPORT.xlsx")
# Write the dataframe object
# into csv file
read_file.to_csv ("data.csv",
index = True,
header=True,
encoding='utf-8-sig')
# read csv file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_csv("data.csv"))
# show the dataframe
df
I've tried to reproduce this behavior, but in my case pd.read_excel() automatically assigned the int64 format on the Age column using the presented Excel sheet.
However this case can be easily solved with the df.astype() function, that can transforms data types, e.g. for your case from floating to integer format.
#importing pandas as pd
import pandas as pd
# Read and store content
# of an excel file
read_file = pd.read_excel ("EXPORT.xlsx")
# transform data type of column "Age" to int64
read_file = read_file.astype({'Age': 'int64'})
# Write the dataframe object
# into csv file
read_file.to_csv ("data.csv",
index = True,
header=True,
encoding='utf-8-sig')
# read csv file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_csv("data.csv"))
# show the dataframe
print(df)
I added float_format option and it seems that works
read_file.to_csv ("basf.csv",
index = None,
header=True,
encoding='utf-8-sig',
decimal=',',
float_format='%d'
)

Converting date and time format when importing csv file in Python

I haven't been able to find a solution in similar questions yet so I'll have to give it a go here.
I am importing a csv file looking like this in notepad:
",""ItemName"""
"Time,""Raw Values"""
"7/19/2019 10:31:29 PM,"" 0"","
"7/19/2019 10:32:01 PM,"" 1"","
What I want when I save it as a new csv, is to reformat the date/time and the corresponding value to this (required by analysis software): The semicolon as separator and in the end is important, and I don't really need a header.
2019-07-19 22:31:29;0;
2019-07-19 22:32:01;1;
This is what it looks like in Python:
Item1 = pd.read_csv(r'.\Datafiles\ItemName.csv')
Item1
#Output:
# ,"ItemName"
# 0 Time,"Raw Values"
# 1 7/19/2019 10:31:29 AM," 0",
# 2 7/19/2019 10:32:01 AM," 1",
valve_G1.dtypes
# ,"ItemName" object
# dtype: object
I have tried using datetime without any luck but there might be something fishy with the datatypes that I am not aware of.
What you want in principle is read to DataFrame, convert datetime column and export df to csv again. I think you will need to get rid of the quote-chars to get the import correct. You can do so by reading the file content to a string, replace the '"', and feed that string to pandas.read_csv. EX:
import os
from io import StringIO
import pandas as pd
# this is just to give an example:
s='''",""ItemName"""
"Time,""Raw Values"""
"7/19/2019 10:31:29 PM,"" 0"","
"7/19/2019 10:32:01 PM,"" 1"","'''
f = StringIO(s)
# in your script, make f a file pointer instead, e.g.
# with open('path_to_input.csv', 'r') as f:
# now get rid of the "
csvcontent = ''
for row in f:
csvcontent += row.replace('"', '')
# read to DataFrame
df = pd.read_csv(StringIO(csvcontent), sep=',', skiprows=1, index_col=False)
df['Time'] = pd.to_datetime(df['Time'])
# save cleaned output as ;-separated csv
dst = 'path_where_to_save.csv'
df.to_csv(dst, index=False, sep=';', line_terminator=';'+os.linesep)

How to export DataFrame to_json in append mode - Python Pandas?

I have an existing json file in a format of list of dicts.
$cat output.json
[{'a':1, 'b':2}, {'a':2, 'b':3}]
And I have a DataFrame
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
I want to save "df" with to_json to append it to file output.json:
df.to_json('output.json', orient='records') # mode='a' not available for to_json
* There is append mode='a' for to_csv, but not for to_json really.
The expected generated output.json file will be:
[{'a':1, 'b':2}, {'a':2, 'b':3}, {'a':1, 'b':3}, {'a':2, 'b':4}]
The existing file output.json can be huge (say Tetabytes), is it possible to append the new dataframe result without loading the file?
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_json.html
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_csv.html
You could do this. It will write each record/row as json in new line.
f = open(outfile_path, mode="a")
for chunk_df in data:
f.write(chunk_df.to_json(orient="records", lines=True))
f.close()
No, you can't append to a json file without re-writing the whole file using pandas or the json module. You might be able to modify the file "manually" by opening the file in a mode and seeking to the correct position and inserting your data. I wouldn't recommend this though. Better to just use a file format other than json if your file is going to be larger than your RAM.
This answer also might help. It doesn't create valid json files (instead each line is a json string), but its goal is very similar to yours.
May be you need to think in terms of orient='records':
def to_json_append(df,file):
'''
Load the file with
pd.read_json(file,orient='records',lines=True)
'''
df.to_json('tmp.json',orient='records',lines=True)
#append
f=open('tmp.json','r')
k=f.read()
f.close()
f=open(file,'a')
f.write('\n') #Prepare next data entry
f.write(k)
f.close()
df=pd.read_json('output.json')
#Save again as lines
df.to_json('output.json',orient='records',lines=True)
#new data
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
#append:
to_json_append(df,'output.json')
To load full data
pd.read_json('output.json',orient='records',lines=True)
I've solved it just by using in built pandas.DataFrame methods. You need to remember about the performance in case of huge dataframes (there are ways to deal with it).
Code:
if os.path.isfile(dir_to_json_file):
# if exist open read it
df_read = pd.read_json(dir_to_json_file, orient='index')
# add data that you want to save
df_read = pd.concat([df_read, df_to_append], ignore_index=True)
# in case of adding to much unnecessery data (if you need)
df_read.drop_duplicates(inplace=True)
# save it to json file in AppData.bin
df_read.to_json(dir_to_json_file, orient='index')
else:
df_to_append.to_json(dir_to_json_file, orient='index')
Usecase, write big amount of data to JSON file with small memory:
Let's say we have 1,000 dataframe, each dataframe is like 1000,000 line of json. Each dataframe needs 100MB, the total file size would be 1000 * 100MB = 100GB.
Solution:
use buffer to store content of each dataframe
use pandas to dump it to text
use append mode to write text to the end of file
import io
import pandas as pd
from pathlib_mate import Path
n_lines_per_df = 10
n_df = 3
columns = ["id", "value"]
value = "alice#example.com"
f = Path(__file__).change(new_basename="big-json-file.json")
if not f.exists():
for nth_df in range(n_df):
data = list()
for nth_line in range(nth_df * n_lines_per_df, (nth_df + 1) * n_lines_per_df):
data.append((nth_line, value))
df = pd.DataFrame(data, columns=columns)
buffer = io.StringIO()
df.to_json(
buffer,
orient="records",
lines=True,
)
with open(f.abspath, "a") as file:
file.write(buffer.getvalue())

Categories