Change CSV numerical values based on header name in file using Python - python

I have a .csv file filled with observations from sensors in the field. The sensors write data as millimeters and I need it as meters to import into another application. My idea was to use Python and possibly pandas to:
1. Read in the .csv as dataframe
2. Find the headers of the data I need to modify (divide each actual value by 1000)
3. Divide each value in the chosen column by 1000 to convert it to meters
4. Write the resulting updated file to disk
Objective: I need to modify all the values except those with a header that contains "rad" in it.
This is what the data looks like:
Here is what I have done so far:
Read data into a dataframe:
import pandas as pd
import numpy as np
delta_df = pd.read_csv('SAAF_121581_67_500.dat',index_col=False)
Filter out all the data that I don't want to touch:
delta_df.filter(like='rad', axis=1)
Here is where I got stuck as I couldn't filter the dataframe to
not like = 'rad'
How can I do this?

Its easier if you post the dataframe rather than the image as the image is not reproducible.
You can use dataframe.filter to keep all the columns containing 'rad'
delta_df = delta_df.select(lambda x: re.search('rad', x), axis=1)
Incase you are trying to remove all the columns containing 'rad', use
delta_df = delta_df.select(lambda x: not re.search('rad', x), axis=1)
Alternate solution without regex:
df.filter(like='rad',axis=1)
EDIT:
Given the dataframes containing rad and not containing rad like this
df_norad = df.select(lambda x: not re.search('rad', x), axis=1)
df_rad = df.select(lambda x: re.search('rad', x), axis=1)
You can convert the values of df_norad df to meters and then merge it with df_rad
merged = pd.concat([df_norad, df_rad], axis = 1)
You can convert the dataframe merged to csv using to_csv
merged.to_csv('yourfilename.csv')

Off the top of my head I believe you can do something like this:
delta_df.filter(regex='^rad', axis=1)
Where we use the regex parameter instead of the like parameter (**note regex and like are mutually exclusive).
The actual regex selects everything that does not match what follow the '^' operator.
Again, I don't have an environment set up to test this but I hope this motivates the idea well enough.

Related

Pandas read_csv truncating 0s in zip code [duplicate]

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

how to merge two columns in a csv and combining them in list formate and put comma between merged value?

I am working on a pandas data frame where I want to merge two columns and putting a comma , between those values which are merged and then add the whole cell by [].
Example:
I have this kind of data frame: note: The sample data is uploaded on this link
bboxes class_names
[[0,0,2336,2836],[0,0,2336,2836],[0,0,2336,2836]] ['No finding','No finding','No finding']
and I want to merge two col and add comma between the content , then enclosed that merge col by [] like below:
final_bboxes
[[[0,0,2336,2836],[0,0,2336,2836],[0,0,2336,2836]],['No finding','No finding','No finding']]
Thank you so much
You first need to convert the list as strings to an actual list before combining them. I have used ast.literal_eval to do this safely.
import ast
df["final_bboxes"] = df.apply(lambda row: [ast.literal_eval(row["bboxes"]), ast.literal_eval(row["class_names"])], axis=1)
Try This
df['new_col'] = [[x,y] for [x,y] in df[['bboxes', 'class_names']].values]
string/quote issue
df['new_col'] = [[eval(x),eval(y)] for [x,y] in df[['bboxes', 'class_names']].values] ## you can use ast modules literal_eval as well
#I am not a fan of eval tbh, don't use it unless you don't have any other way

How to extract inside of column to several columns

I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters

Edit distance between two pandas columns

I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.
from nltk.metrics import edit_distance
df['edit'] = edit_distance(df['column1'], df['column2'])
For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.
Any suggestions are welcome.
The nltk's edit_distance function is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, apply it separately to each row's strings like this:
results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)
Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:
results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)
To add the results to your dataframe, you'd use it like this:
df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)

How to keep leading zeros in a column when reading CSV with Pandas?

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Categories