Python Pandas replacing part of a string

Python Pandas replacing part of a string - python

I'm trying to filter data that is stored in a .csv file that contains time and angle values and save filtered data in an output .csv file. I solved the filtering part, but the problem is that time is recorded in hh:mm:ss:msmsmsms (12:55:34:500) format and I want to change that to hhmmss (125534) or in other words remove the : and the millisecond part.
I tried using the .replace function but I keep getting the KeyError: 'time' error.
Input data:
time,angle
12:45:55,56
12:45:56,89
12:45:57,112
12:45:58,189
12:45:59,122
12:46:00,123
Code:
import pandas as pd
#define min and max angle values
alpha_min = 110
alpha_max = 125
#read input .csv file
data = pd.read_csv('test_csv3.csv', index_col=0)
#filter by angle size
data = data[(data['angle'] < alpha_max) & (data['angle'] > alpha_min)]
#replace ":" with "" in time values
data['time'] = data['time'].replace(':','')
#display results
print data
#write results
data.to_csv('test_csv3_output.csv')

That's because time is an index. You can do this and remove the index_col=0:
data = pd.read_csv('test_csv3.csv')
And change this line:
data['time'] = pd.to_datetime(data['time']).dt.strftime('%H%M%S')
Output:
time angle
2 124557 112
4 124559 122
5 124600 123

What would print (data.keys()) or print(data.head()) yield? It seems like you have a stray character before\after the time index string, happens from time to time, depending on how the csv was created vs how it was read (see this question).
If it's not a bigger project and/or you just want the data, you could just do some silly workaround like: timeKeyString=list(data.columns.values)[0] (assuming time is the first one).

Related

How to convert a single column containing JSON with 250 variables to 250 separate column dataset using arrays?

I have an issue converting a JSON column (which contains around 250 variables) into 250 separate columns. I'm able to use Pandas dataframe, but just for 46k rows it takes 30 minutes and sometimes kernel is crashing due to low memory (for 0.5 million rows in database).
Can somebody help me with code using NumPy arrays (which should decrease conversion time and reduce file size)?
The JSON column has data in below format:
My code :
for x in records:
list_ = list(x)
json_acceptable_string = list_[4].read()
list_features.append(json.loads(json_acceptable_string)
Once I get the list-features I'm preprocessing and using machine learning pipeline. This isn't working for large data.

I think this could help for building your np array
variable_name_list = ['var1','va2',....,'var250']
list_features = np.empty(shape=(len(records),len(variable_name_list)),dtype=str)
for index in range(records):
list_ = list(records[index])
json_acceptable_string = list_[4].read()
tmp_feature_list = []
tmp_feature_dict = json.loads(json_acceptable_string)
for var_name in variable_name_list:
if var_name not in tmp_feature_dict.keys():
tmp_feature_list.append("missing_val")
else :
tmp_feature_list.append(tmp_feature_dict[var_name])
tmp_feature_list = np.asarray(tmp_feature_list,dtype=str).reshape(1,len(variable_name_list))
list_features[index] = tmp_feature_list

Change dateformat

I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!

After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?

convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

Multiplication of values in a dataframe with scalars

I am working on a problem where I want to convert X and Y pixel values to physical coordinates. I have a huge folder containing many csv files and i load them, pass them to my function, compute the coordinates and overwrite the columns and return the data frame. I then overwrite it outside the function. I have the formula which does it correctly but I am having some problems implementing it in python.
Each CSV files has many columns. The columns I am interested in are Latitude (degree), Longitude (degree), XPOS and YPOS. The former 2 are blank and the latter 2 have the data with which I need to fill up the former two.
import pandas as pd
import glob
max_long = float(XXXX)
max_lat = float(XXXX)
min_long = float(XXXX)
min_lat = float(XXXX)
hoi = int(909)
woi = int(1070)
def pixel2coor (filepath, max_long, max_lat, min_lat, min_long, hoi, woi):
data = pd.read_csv(filepath) #reading Csv
data2 = data.set_index("Log File") #Setting index of dataframe with first column
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
data2.loc[data2['Latitude (degree)']] = (((max_lat-min_lat)/woi)*[data2[:,'YPOS']]+min_lat) #Computing Latitude & Overwriting
return data2 #Return dataframe
filenames = sorted(glob.glob('*.csv'))
for file in filenames:
df = pixel2coor (file, max_long, max_lat, min_lat, min_long, hoi, woi) #Calling pixel 2 coor function and passing a csv file in every iteration
df.to_csv(file) #overwriting the file with the dataframe
I am getting the following error.
**
TypeError: '(slice(None, None, None), 'XPOS')' is an invalid key
**

It looks to me like your syntax is off. In the following line:
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
The left side of your equation appears to be referring to a column, but you have it in the 'row' section of .loc slicer. So it should be:
data2.loc[:, 'Longitude (degree)']
On the right side of your equation, you've forgotten .loc or need to drop the ':,' so two possible solutions:
(((max_long-min_long)/hoi)*data2.loc[:,'XPOS']+min_long)
(((max_long-min_long)/hoi)*data2['XPOS']+min_long)
Also, I would add that your brackets on the right side should be more explicit. It's a bit unclear how you want scalars to act on the series. Do you want to add min_long first? Or multiply (((max_long-min_long)/hoi) first?
Your final row might look like this, forcing addition first as an example:
data2.loc[:, 'Longitude (degree)'] = ((max_long-min_long)/hoi)*(data2.loc[:,'XPOS']+min_long)
This applies to your next line as well. You may get more errors after you fix this.

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.

Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas replacing part of a string - python

That's because time is an index. You can do this and remove the index_col=0: data = pd.read_csv('test_csv3.csv') And change this line: data['time'] = pd.to_datetime(data['time']).dt.strftime('%H%M%S') Output: time angle 2 124557 112 4 124559 122 5 124600 123

Related

How to convert a single column containing JSON with 250 variables to 250 separate column dataset using arrays?

Change dateformat

How to work with Rows/Columns from CSV files?

Multiplication of values in a dataframe with scalars

Dask read_csv: skip periodically ocurring lines

Categories

Resources