I am importing a text file and adding two columns and attempting to perform some basic math on the two newly created columns based on two other existing columns. Periodically the data structure of my original text file changes in column length from 10 columns to 7. So I am trying to catch this with an If else statement. But I get the error below. What should I be converting this to? and how can I perform the function on the column by identifying it by a column number rather than header name so instead of mru['t1'] = math.sqrt(mru['r1']**2 + mru['p1']**2) something like this mru['t1'] = math.sqrt(mru[1]**2 + mru[2]**2)?
"cannot convert the series to {0}".format(str(converter)))
TypeError: cannot convert the series to <type 'float'>
My Code is:
mru = pd.read_csv(r"C:\some.txt", skipinitialspace=True, names=['time', 'r1', 'p1', 'h1', 'r2', 'p2', 'h2', 'r3', 'p3', 'h3'])
#Identify colum number
col = len(mru.columns)
#Caluulate Tilt
if col == 10:
converted = mru[mru.columns[-9:]].convert_objects(convert_numeric=True)
mru[mru.columns[-9:]] = converted
mru['t1'] = math.sqrt(mru['r1']**2 + mru['p1']**2)
mru['t2'] = math.sqrt(mru['r2']**2 + mru['p2']**2)
mru['t3'] = math.sqrt(mru['r3']**2 + mru['p3']**2)
else:
converted = mru[mru.columns[-9:]].convert_objects(convert_numeric=True)
mru[mru.columns[-6:]] = converted
mru = pd.read_csv(r"C:\Dan\20150330_150831_C.txt", skipinitialspace=True, names=['time', 'r1', 'p1', 'h1', 'r2', 'p2', 'h2'])
mru['t1']= math.sqrt(mru['r1']**2 + mru['p1']**2)
mru['t2'] = math.sqrt(mru['r2']**2 + mru['p2']**2)
and a snippet of my data is: (10 column example):
15:08:31.898,-0.3000,0.1400,0.0000,-0.3100,0.5300,0.6234,0.3357,-0.1500,0.0000
15:08:32.898,-0.3000,0.1400,0.0000,-0.1500,0.2800,-0.0984,0.0905,0.0100,0.0000
You can't use normal math functions on Series which are arrays use np.sqrt:
import numpy as np
mru['t1'] = np.sqrt(mru['r1']**2 + mru['p1']**2)
The TypeError is telling you that it is expecting a float and not a pandas Series:
TypeError: cannot convert the series to <type 'float'>
As for your other problem after you've named your cols you can filter them using a list comprehension:
p_cols = [col for col in df if 'p' in col]
then just generate the same for t and r cols and then iterate over each of them in tandem and select the cols:
In [76]:
df = pd.DataFrame(columns = ['time', 'r1', 'p1', 'h1', 'r2', 'p2', 'h2', 'r3', 'p3', 'h3'])
df
Out[76]:
Empty DataFrame
Columns: [time, r1, p1, h1, r2, p2, h2, r3, p3, h3]
Index: []
In [83]:
r_cols = [col for col in df if 'h' in col]
p_cols = [col for col in df if 'p' in col]
for i in range(3):
r = df[r_cols[i]]
p = df[p_cols[i]]
t_col = 't'+str(i+1)
print(r_cols[i], p_cols[i], t_col)
# do something like thi
#df[t_col] = np.sqrt(r**2 + p**2)
h1 p1 t1
h2 p2 t2
h3 p3 t3
So the above shows a skeleton of how you could modify your code to achieve what you want in a dynamic way
Related
I am trying to assign all the three unique groups from the group column in df to different variables (see my code) using Python. How do I incorporate this inside a for loop? Obviously var + i does not work.
import pandas as pd
data = {
'group': ['a', 'a', 'a', 'b', 'b', 'c', 'c'],
'num': list(range(7))
}
df = pd.DataFrame(data)
unique_groups = df['group'].unique()
# How do I incorporate this logic inside a for loop??
var1 = df[df['group'] == unique_groups[0]]
var2 = df[df['group'] == unique_groups[1]]
var3 = df[df['group'] == unique_groups[2]]
# My approach:
for i in range(len(unique_groups)):
var + i = df[df['group'] == unique_groups[i]] # obviously "var + i" does not work
From your comment it seems it is okay for all_vars to be a list so that all_vars[0] is the first group, all_vars[1] the second, etc. In that case, consider using groupby instead:
all_vars = [group for name, group in df.groupby("group")]
You can do this using a dictionary, basically:
all_vars ={}
for i in range(len(unique_groups)):
all_vars[f"var{i}"] = df[df['group'] == unique_groups[i]]
I am trying to create a function that loops through specific columns in a dataframe and replaces the values with the column names. I have tried the below but it does not change the values in the columns.
def value_replacer(df):
cols = ['Account Name', 'Account Number', 'Maintenance Contract']
x= [i for i in df.columns if i not in cols]
for i in x:
for j in df[i]:
if isinstance(j,str):
j.replace(j,i)
return df
What should be added to the function to change the values?
Similar to #lazy's solution, but using difference to get the unlisted columns and using a mask instead of the list comprehension:
df = pd.DataFrame({'w': ['a', 'b', 'c'], 'x': ['d', 'e', 'f'], 'y': [1, 2, '3'], 'z': [4, 5, 6]})
def value_replacer(df):
cols_to_skip = ['w', 'z']
for col in df.columns.difference(cols_to_skip):
mask = df[col].map(lambda x: isinstance(x, str))
df.loc[mask, col] = col
return df
Output:
Loop through only the columns of interest once, and only evaluate each row within each column to see if it is a string or not, then use the resulting mask to bulk update all strings with the column name.
Note that this will change the dataframe inplace, so make a copy if you want the original, and you don't necessarily need the return statement.
I have a piece of code as below:
a = df[['col1', 'col2_1', 'col2_2', 'col2_3', 'col3]]
a_indices = np.argmax(a.ne(0).values, axis=1)
a_df = pd.DataFrame(a.values[np.arange(len(a)), a_indices])
b = df[['col2_1', 'col2_2', 'col2_3', 'col3', 'col1]]
b_indices = np.argmax(b.ne(0).values, axis=1)
b_df = pd.DataFrame(b.values[np.arange(len(b)), b_indices])
....
This code is repetitive, and I am hoping to loop them through. The idea is to have all the combination of different orders of cal_1, col_2(col2_1, col2_2, col2_3), and col_3. The return should be a combined dataframe of a_df and b_df.
Note: col2_1, col2_2, and col2_3 can have different orders, but they always stay next to each other. Anyways to make this piece of code simpler?
What you can do so far is to define the maximum number of iterations to loop on. So far you have 5 columns to loop on.
list_columns = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
print(len(list_columns)) # returns 5
Then, you can define your column names based on what you want to put in your dataframe. Suppose you have 5 iterations to make. Your column names would be ['A', 'B', 'C', 'D', 'E']. This is the column argument of your dataframe. An easier way to concatenate several columns at once is to create a dictionary first, with each column name being the key and each of them having a list the same size as a value.
list_columns = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
new_columns = ['A', 'B', 'C', 'D', 'E']
# Use a dictionary comprehension in my case
data_dict = {column: [] for column in new_columns}
n = 50 # Assume the number of loops is arbitrary there
for i in range(n):
for col in new_columns:
# do something
data_dict[col].append(something)
In your case it looks like you can directly operate on the lists by providing a NumPy array instead. Therefore:
list_cols = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
new_cols = ['A', 'B', 'C', 'D', 'E']
data_df = {}
for i, (col, new_col) in enumerate(zip(list_cols, new_cols)):
print(col, list_cols[0:i] + list_cols[i+1:])
temp_df = df[[col] + list_cols[0:i] + list_cols[i+1:]]
temp_indices = np.argmax(temp_df.ne(0).values, axis=1)
data_df[new_col] = b.values[np.arange(len(temp_df)), temp_indices]
final_df = pd.DataFrame(data_df)
What I basically did was a double unpacking combining enumerate to get the index and zip to get your final result. The columns are there selected and placed before the rest of the list in no particular order.
I would like to format the "status" column in a csv and retain the string inside single quotation adjoining comma ('sometext',)
Example:
Input
as in row2&3 - if more than one values are found in any column values then it should be concatenated with a pipe symbol(|)Ex. Phone|Charger
Expected output should get pasted in same status column like below
My attempt (not working):
import pandas as pd
df = pd.read_csv("test projects.csv")
scol = df.columns.get_loc("Status")
statusRegex = re.
compile("'\t',"?"'\t',") mo = statusRegex.search (scol.column)
Let say you have df as :
df = pd.DataFrame([[[{'a':'1', 'b': '4'}]], [[{'a':'1', 'b': '2'}, {'a':'3', 'b': '5'}]]], columns=['pr'])
df:
pr
0 [{'a': '1', 'b': '4'}]
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}]
df['comb'] = df.pr.apply(lambda x: '|'.join([i['a'] for i in x]))
df:
pr comb
0 [{'a': '1', 'b': '4'}] 1
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}] 1|3
import pandas as pd
# simplified mock data
df = pd.DataFrame(dict(
value=[23432] * 3,
Status=[
[{'product.type': 'Laptop'}],
[{'product.type': 'Laptop'}, {'product.type': 'Charger'}],
[{'product.type': 'TV'}, {'product.type': 'Remote'}]
]
))
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# save to desired column
df['output'] = df['Status'].apply(da_piper) # apply the method to the Status col
Additional help: You do not need to use read_excel since csv is not an excel format. It is comma separated values which is a standard format. in this case you can just do this:
import pandas as pd
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# read csv to dataframe
df = pd.read_csv("test projects.csv")
# apply method and save to desired column
df['Status'] = df['Status'].apply(da_piper) # apply the method to the Status col
Thank you all for the help and suggestions. Please find the final working codes.
df = pd.read_csv('test projects.csv')
rows = len(df['input'])
def get_values(value):
m = re.findall("'(.+?)'",value)
word = ""
for mm in m:
if 'value' not in str(mm):
if 'autolabel_strategy' not in str(mm):
if 'String Matching' not in str(mm):
word += mm + "|"
return str(word).rsplit('|',1)[0]
al_lst =[]
ans_lst = []
for r in range(rows):
auto_label = df['autolabeledValues'][r]
answers = df['answers'][r]
al = get_values(auto_label)
ans = get_values(answers)
al_lst.append(al)
ans_lst.append(ans)
df['a'] = al_lst
df['b'] = ans_lst
df.to_csv("Output.csv",index=False)
How can I properly use task delayed for a group-wise quotient calculation over multiple columns?
some sample data
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'name': ['A', 'B', 'C', 'D', 'E'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
'alotdifferent': ['x', 'y', 'z', 'x', 'a'],
'target': [0,0,0,1,1],
'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a.alotdifferent = df_a.alotdifferent.astype('category')
df_a.name = df_a.name.astype('category')
some setup code which determines the string / categorical columns
FACTOR_FIELDS = df_a.select_dtypes(include=['category']).columns
columnsToDrop = ['alotdifferent']
columnsToBias_keep = FACTOR_FIELDS[~FACTOR_FIELDS.isin(columnsToDrop)]
target = 'target'
the main part: the calculation of the group-wise quotients
def compute_weights(da, colname):
# group only a single time
grouped = da.groupby([colname, target]).size()
# calculate first ratio
df = grouped / da[target].sum()
nameCol = "pre_" + colname
grouped_res = df.reset_index(name=nameCol)
grouped_res = grouped_res[grouped_res[target] == 1]
grouped_res = grouped_res.drop(target, 1)
# todo persist the result in dict for transformer
result_1 = grouped_res
return result_1, nameCol
And now actually calling it on multiple columns
original = df_a.copy()
output_df = original
ratio_weights = {}
for colname in columnsToBias_keep.union(columnsToDrop):
result_1, result_2, nameCol, nameCol_2 = compute_weights(original, colname)
# persist the result in dict for transformer
# this is required to separate fit and transform stage (later on in a sklearn transformer)
ratio_weights[nameCol] = result_1
ratio_weights[nameCol_2] = result_2
when trying to use dask delayed, I need to call compute which breaks the DAG. How can I curcumvent this, in order to create a single big computational graph which is calculated in parallel?
compute_weights = delayed(compute_weights)
a,b = delayed_res_name.compute()
ratio_weights = {}
ratio_weights[b] = a