Extracting values from a csv file - python

So I'm working on taking a txt file and converting it into a csv data table.
I have managed to convert the data into a csv file and put it into a table, but I have a problem with extracting the numbers. In the data table that I made, it's giving me text as well as the value (intensity = 12345).
How do I only put the numerical values into the table?
I tried using regular expressions, but I couldn't get it to work. I would also like to delete all the lines that contain saturated, fragmented and merged. I initially created a code that would delete every uneven line, but this is a code that will be used for several files, so the odd lines in other files might have different data in them. How would I go about doing that?
This is the code that I currently have, plus a picture of what the output looks like.
import pandas as pd
parameters = pd.read_csv("ScanHeader1.txt", header=None)
parameters.columns = ['Packet Number', 'Intensity','Mass/Position']
parameters.to_csv('ScanHeader1.csv', index=None)
df = pd.read_csv('ScanHeader1.csv')
print(df)
I would really appreciate some tips or pointers on how I can do this. Thanks :)

you can try this
def fun_eq(x):
x = x.split(' = ')
return x[1]
def fun_hash(x):
x = x.split(' # ')
return x[1]
df = df.iloc[::2]
df['Intensity'] = df['Intensity'].apply(fun_eq)
df['Mass/Position'] = df['Mass/Position'].apply(fun_eq)
df['Packet Number'] = df['Packet Number'].apply(fun_hash)

Related

Data cleanup in Python, removing CSV rows based on a condition

I've came across a bit of a challenge where I need to sanitize data in a CSV file based on the following criteria:
If the data exists with a date, remove the one with an NA value from the file;
If it is a duplicate, remove it; and
If the data exists only own its own, leave it alone.
I am currently able to do both 2 and 3, however I am struggling to make a condition to capture 1 of the criteria.
Sample CSV File
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Server_B,Test,NA,NA
Current Code So Far
import csv
input_file = 'sample.csv'
output_file = 'completed_output.csv'
with open(input_file, 'r') as inputFile, open(output_file, 'w') as outputFile:
seen = set()
for line in inputFile:
if line in seen:
continue
seen.add(line)
outputFile.write(line)
Currently, this helps with duplicates and capturing the unique values. However, I cannot work the best way to remove the row that has a repeating server.
However, this may not work well because the set type is unordered, so I wasn't sure the best way to compare based on column, then filter down from there.
Any suggestions or solutions that could help me would be greatly appreciated.
Current Output So Far
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Expected Output
Name,Environment,Available,Date
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
You can use pandas instead of manually doing all of that. I have written a short function called custom filter which takes into consideration the criteria.
One area of potential bugs can be the use of pd.NA, use other np.nan or None if this doesn't work accordingly.
import pandas as pd
df = pd.read_csv('sample.csv')
df = df.drop_duplicates()
data_present = []
def custom_filter(x):
global data_present
if x[3] == pd.NA:
data_present.append(x[0])
return True
elif x[3] == pd.NA and x[0] not in data_present:
return True
else:
return False
df = df.sort_values('Date')
df = df[df.apply(custom_filter, axis = 1)]
df.to_csv('completed_output.csv')

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

Can't read .txt file with pandas because it's in a weird shape

I have a data set that contains information from an experiment about particles. You can find it here (hope links are ok, if not let me know and i'll remove immediately) :
http://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification
Trying to read this set in pandas and im encountering the problem of pandas reading this txt as a data frame with 130.064 lines, which is correct, but 1 column. If you check the txt file in the link, you will see that it is in a weird shape, with spaces in the beginning and then 2 spaces between each column.
I tried the command
df = pd.read_csv("path/file.txt", header = None)
and also
df = pd.read_csv("path/file.txt", sep = " ", header = None)
where I set 2 spaces as the separator. Nothing works. The file also, in the 1st line, has 2 numbers that just represent the number of rows, which I deleted. For someone who can't/doesn't want to open the link or the data set, here is a picture of some columns.
This is just a portion of it and not the whole data. In the leftmost side, there are 2 spaces between the edge of the window and the first column, as I said. When reading it using pandas this is what I get
Any advice/help would be appreciated. Thanks
EDIT
I tried doing the following and I think it worked. First I imported the .txt file using NumPy, after deleting the first row from the data frame which contains the two irrelevant numbers.
df1 = np.loadtxt("path/file.txt")
This, for some reason, worked and the resulting array was correct. Then I converted this array to data frame using the command
df = pd.DataFrame(df1)
df.columns = ['X' + str(x) for x in range(50) ]
And yeah, I think it works. Check the following picture.
I think its correct but if you guys find something wrong let me know.
Edited
columns = ['Obs1','Obs2','Obs3','Obs4','Obs5','Obs6','Obs7','Obs8','Obs9','Obs10','Obs11','Obs12','Obs13','Obs14','Obs15','Obs16','Obs17','Obs18','Obs19','Obs20','Obs21','Obs22','Obs23','Obs24','Obs25','Obs26','Obs27','Obs28','Obs29','Obs30','Obs31','Obs32','Obs33','Obs34','Obs35','Obs36','Obs37','Obs38','Obs39','Obs40','Obs41','Obs42','Obs43','Obs44','Obs45','Obs46','Obs47','Obs48','Obs49','Obs50']
df = pd.read_csv("path/file.txt", sep = " ", columns=columns , skiprows=1)
You could try creating the dataframe from lists instead of the txt file, something like the following:
#We put all the lines in a list
data = []
with open("dataset.txt") as fp:
lines = fp.read()
data = lines.split('\n')
df_data= []
for item in data:
df_data.append(item.split(' ')) #I cant see if 1 space or 2 separate the values
#df_data should be something like [[row1col1,row1col2,row1col3],[row2col1,row2col2,row3col3]]
#List to dataframe
df = pd.DataFrame(df_data)
Doing this by memory so watch out for syntax, hope this helps!

Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999).
Issues:
1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series
2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way.
3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario?
My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. Any recommendations or obvious easy tricks i'm missing?
from zipfile import ZipFile
import pandas as pd
def main():
#Driver
FILENAME_PREFIX = 'bhcf'
FILE_TYPE = '.txt'
field_headers = []
with ZipFile('reg_data.zip', 'r') as zip:
with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
headers_df = pd.read_csv(qtr_file, sep='^', header=None)
headers_df = headers_df[:1]
headers_array = headers_df.values[0]
parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)
I try with the file you linked and one i downloaded i think from 2015:
import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)
The final dataframe has the rows of both csv, and the union of columns from them.
You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns.

Why is the cdc_list getting updated after calling the function read_csv() in total_list?

# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131
I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")

Categories