I have a huge data set with columns like: "Eas_1", "Eas_2", and so on to "Eas_40" and "Nor_1" to "Nor_40". I want to automatically create multiple separate data sets that consist of all columns that end with the same number (grouped by column name number) and column number pasted as values in the new column (Bin).
My data frame:
df = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Eas_2": [4, 5, 10, 2],
"Nor_1": [9, 7, 9, 2],
"Nor_2": [10, 8, 10, 3],
"Error_1": [2, 5, 1, 6],
"Error_2": [5, 0, 3, 2],
})
I don't know how to create Bin column and paste the column name values, but I could separate data sets manually like this:
df1 = df.filter(regex='_1')
df2 = df.filter(regex='_2')
This would take a lot of effort for me, plus I would have to change the script every time I get new data. This is how I imagine end result:
df1 = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Nor_1": [9, 7, 9, 2],
"Error_1": [2, 5, 1, 6],
"Bin": [1, 1, 1, 1],
})
Thanks in advance!
You can extract the suffixes with .str.extract, then groupby on those:
suffixes = df.columns.str.extract('(\d+)$', expand=False)
for label, data in df.groupby(suffixes, axis=1):
print('-'*10, label, '-'*10)
print(data)
Note To collect your dataframes, you can do:
dfs = [data for _, data in df.groupby(suffixes, axis=1)]
# access the second dataframe
dfs[1]
Output:
---------- 1 ----------
Eas_1 Nor_1 Error_1
0 3 9 2
1 4 7 5
2 9 9 1
3 1 2 6
---------- 2 ----------
Eas_2 Nor_2 Error_2
0 4 10 5
1 5 8 0
2 10 10 3
3 2 3 2
Related
Given a DataFrame that represents instances of called customers:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})
The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:
df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})
After trying different approaches I came up with this to create call_nr:
np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])
It works, but I doubt this is best practice. Is there a better solution?
A simpler solution would be to groupby your 'customer_id' and use cumcount:
>>> df_1.groupby('customer_id').cumcount()
0 0
1 1
2 2
3 0
4 1
5 0
6 1
7 2
8 3
9 0
10 0
11 1
which you can assign back as a column in your dataframe
I am stuck with an issue on a massive pandas table. I would like to get a boolean to check the cross of 2 series.
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8]})
I would like to add one column in my array to get a result like this one
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8],
'C': [0, -1, 0, 1]
})
So basically to get
0 when there is no cross between series B and A
-1 when table B crosses down table A
1 when table B crosses up table A
I need to do vector calculation because my real table is like more than one million rows.
Thank you
You can compute the relative position of the 2 columns with lt, then convert to integer and compute the diff:
m = df['A'].lt(df['B'])
df['C'] = m.astype(int).diff().fillna(0, downcast='infer')
output:
A B C
0 1 10 0
1 2 1 -1
2 3 2 0
3 4 8 1
visual of A/B:
I have a dataframe below:
import pandas as pd
d = {'id': [1, 2, 3, 4, 4, 6, 1, 8, 9], 'cluster': [7, 2, 3, 3, 3, 6, 7, 8, 8]}
df = pd.DataFrame(data=d)
df = df.sort_values('cluster')
I want to keep ALL the rows
if there is the same cluster but different id AND keep every row from that cluster
even if it is the same id since there was a different id AT LEAST once within that cluster.
The code I have been using to achieve this is the following below, BUT, the only problem
with this is it drops too many rows for what I am looking for.
df = (df.assign(counts=df.count(axis=1))
.sort_values(['id', 'counts'])
.drop_duplicates(['id','cluster'], keep='last')
.drop('counts', axis=1))
The output dataframe I am expecting that the code above does not do
would drop rows at
dataframe index 1, 5, 0, and 6 but leave dataframe indexes 2, 3, 4, 7, and 8. Essentially
resulting in what the code below produces:
df = df.loc[[2, 3, 4, 7, 8]]
I have looked at many deduplication pandas posts on stack overflow but have yet to find this
scenario. Any help would be greatly appreciated.
I think we can do this with a single boolean. using .groupby().nunique()
con1 = df.groupby('cluster')['id'].nunique() > 1
#of these we only want the True indexes.
cluster
2 False
3 True
6 False
7 False
8 True
df.loc[(df['cluster'].isin(con1[con1].index))]
id cluster
2 3 3
3 4 3
4 4 3
7 8 8
8 9 8
I wanted to use column values in one csv file to mask rows in another csv,
as in:
df6 = pd.read_csv(‘py_all1a.csv’) # file with multiple columns
df7 = pd.read_csv(‘artexclude1.csv’) # file with multiple columns
#
# csv df6 col 1 has the same header and data type as col 8 in df7.
# I want to mask rows in df6 that have a matching col value to any
# in df7. The data in each column is a text value (single word).
#
mask = df6.iloc[:,1].isin(df7.iloc[:,8])
df6[~mask].to_csv(‘py_all1b.csv’, index=False)
#
On that last line, I tried [mask] with the tilde, resulting in no change to the df6 file (py_all1b.csv), and without the tilde (producing the file with just the column headers).
An answer using a specific data set was provided in the below answer, but it did not work because there were inconsistencies between the text values, namely, on entry had a space while another did not.
The below answer is correct, and I have added a paragraph to show how the text issue can also be resolved.
Try converting to a set first:
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
This ensures your comparison is against values.
Example
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
# 3 10 11 12
df2 = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
# 0 1 2
# 0 1 2 3
# 1 1 2 3
# 2 1 2 3
# 3 1 2 3
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 1 2 3
With strings
It still works:
df1 = pd.DataFrame([['a', 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
df2 = pd.DataFrame([['a', 2, 3], ['a', 2, 3], ['a', 2, 3], ['a', 2, 3]])
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 a 2 3
When you are dealing with string data, there may be problems with whitespace that can cause matches to be missed. As described in this answer, you may need to instead use:
df6 = pd.read_csv('py_all1a.csv', skipinitialspace=True) # file with multiple columns
df7 = pd.read_csv('artexclude1.csv', skipinitialspace=True) # file with multiple columns
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
df6[~mask].to_csv('py_all1b.csv', index=False)
I'm trying to drop all columns from a df that start with any of a list of strings. I needed to copy these columns to their own dfs, and now want to drop them from a copy of the main df to make it easier to analyze.
df.columns = ["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"...]
Entered some code that gave me this dataframes with these columns:
aaa.columns = ["AAA1234", "AAA5678"]
bbb.columns = ["BBB1234", "BBB5678"]
I did get the final df that I wanted, but my code felt rather clunky:
droplist_cols = [aaa, bbb]
droplist = []
for x in droplist_cols:
for col in x.columns:
droplist.append(col)
df1 = df.drop(labels=droplist, axis=1)
Columns of final df:
df1.columns = ["CCC123", "DDD123"...]
Is there a better way to do this?
--Edit for sample data--
df = pd.DataFrame([[1, 2, 3, 4, 5], [1, 3, 4, 2, 1], [4, 6, 9, 8, 3], [1, 3, 4, 2, 1], [3, 2, 5, 7, 1]], columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123"])
Desired result:
CCC123
0 5
1 1
2 3
3 1
4 1
IICU
Lets begin with a dataframe thus;
df=pd.DataFrame({"A":[0]})
Modify dataframe to include your columns
df2=df.reindex(columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"], fill_value=0)
Drop all columns starting with A
df3=df2.loc[:,~df2.columns.str.startswith('A')]
If you need to drop say A OR B I would
df3=df2.loc[:,~(df2.columns.str.startswith('A')|df2.columns.str.startswith('B'))]