I would like to add a string at the beginning of each row- either positive or negative - depending on the value in the columns:
I keep getting ValueError, as per screenshot
For a generic method to handle any number of columns, use pandas.from_dummies:
cols = ['positive', 'negative']
user_input_1.index = (pd.from_dummies(user_input_1[cols]).squeeze()
+'_'+user_input_1.index
)
Example input:
Score positive negative
A 1 1 0
B 2 0 1
C 3 1 0
Output:
Score positive negative
positive_A 1 1 0
negative_B 2 0 1
positive_C 3 1 0
Use Series.map for prefixes by conditions and add to index:
df.index = df['positive'].eq(1).map({True:'positive_', False:'negative_'}) + df.index
Or use numpy.where:
df.index = np.where(df['positive'].eq(1), 'positive_','negative_') + df.index
Related
I have 2 dataframes:
df1=pd.DataFrame({'number': ['14578', '45621', '1564']})
df2=pd.DataFrame({'number': ['1457891521', '123456215', '15643']})
My question how is it possible to determine if df1['number'] contains in df2['number'] strictly from the left.
Desirable result:
number full number
0 14578 1457891521
1 45621 0
2 1564 15643
You unfortunately need to loop here:
df1['full number'] = [b if b.startswith(a) else ''
for a,b in zip(df1['number'], df2['number'])]
Output:
number full number
0 14578 1457891521
1 45621
2 1564 15643
Another possible solution, using numpy, specifically numpy.char.startswith:
x2, x1 = df2['number'].values.astype(str), df1['number'].values.astype(str)
out = df1
out['full number'] = df2.loc[np.char.startswith(x2, x1), 'number']
Output:
number full number
0 14578 1457891521
1 45621 NaN
2 1564 15643
In pandas crosstab, now I am getting the output as below if the other col contains all zero value:
0
0 5
1 2
But I need to get an output for the other column even if it contains all zero.
0 1
0 5 0
1 2 0
I am using below code to create cross tab:
data_crosstab = pd.crosstab(data[df_all.columns[56]],
data[df_all.columns[57]],
margins = False,dropna=False)
Use DataFrame.reindex:
#margins=False is default value, so removed
#https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
data_crosstab = pd.crosstab(data[df_all.columns[56]],
data[df_all.columns[57]], dropna=False)
data_crosstab = data_crosstab.reindex(columns=[0,1], fill_value=0)
More general solution:
data_crosstab = data_crosstab.reindex(columns=[0,1],index=[0,1], fill_value=0)
I have a dataframe with some numbers (or strings, it doesn't actually matter). The thing is that I need to add a character in the middle of them. The dataframe looks like this (I got it from Google Takeout)
id A B
1 512343 -1234
1 213 1231345
1 18379 187623
And I want to add a comma in the second position
id A B
1 51,2343 -12,34
1 21,3 12,31345
1 18,379 18,7623
A and B are actually longitude and latitude so I think it is not possible to achieve to add the comma in the right place since there is no way to know if a number is supposed to have one or two digits as coordinates, but it would do the trick if I can put the comma on the second position.
This should do the trick:
df[["A", "B"]]=df[["A", "B"]].astype(str).replace(r"(\d{2})(\d+)", r"\1,\2", regex=True)
Outputs:
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623
Here's another approach with str.extract:
for c in ['A','B']:
df[c] = df[c].astype(str).str.extract('(-?\d{2})(\d*)').agg(','.join,axis=1)
Output:
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623
You could do something like this -
import numpy as np
df['A'] = np.where(df['A']>=0,'', '-') + ( df['A'].abs().astype(str).str[:2] + ',' + df['A'].abs().astype(str).str[2:] )
df['B'] = np.where(df['B']>=0,'', '-') + ( df['B'].abs().astype(str).str[:2] + ',' + df['B'].abs().astype(str).str[2:] )
df
id A B
0 1 51,2343 -12,34
1 1 21,3 12,31345
2 1 18,379 18,7623
We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0
This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.