I have a Pandas dataframe called "bag' with columns called beans1, beans2, and beans3
bag = pd.DataFrame({'beans1': [3,1,2,5,6,7], 'beans2': [2,2,1,1,5,6], 'beans3': [1,1,1,3,3,2]})
bag
Out[50]:
beans1 beans2 beans3
0 3 2 1
1 1 2 1
2 2 1 1
3 5 1 3
4 6 5 3
5 7 6 2
I want to use a loop to subset each column with observations greater than 1, so that I get:
beans1
0 3
2 2
3 5
4 6
5 7
beans2
0 2
1 2
4 5
5 6
beans3
3 3
4 3
5 2
The way to do it manually is :
beans1=beans.loc[bag['beans1']>1,['beans1']]
beans2=beans.loc[bag['beans2']>1,['beans2']]
beans3=beans.loc[bag['beans3']>1,['beans3']]
But I need to employ a loop, with something like:
for i in range(1,4):
beans+str(i).loc[beans.loc[bag['beans'+i]>1,['beans'+str(i)]]
But it didn't work. I need a Python version of R's eval(parse(text="")))
Any help appreciated. Thanks much!
It is possible, but not recommended, with globals:
for i in range(1,4):
globals()['beans' + str(i)] = bag.loc[bag['beans'+str(i)]>1,['beans'+str(i)]]
for c in bag.columns:
globals()[c] = bag.loc[bag[c]>1,[c]]
print (beans1)
beans1
0 3
2 2
3 5
4 6
5 7
Better is create dictionary:
d = {c: bag.loc[bag[c]>1, [c]] for c in bag}
print (d['beans1'])
beans1
0 3
2 2
3 5
4 6
5 7
I have multiple csv file which i merged together after that in order to identify individual csv data in all merged csv file i wish to create a new column in pandas where the new column should be called serial.
I want a new column serial in the pandas and it should me numbered on the basis of data in Sequence column (For example-111111111,2222222222,33333333 for every new one in csv ).I had Attached snapshot of csv file also.
Sequence Number
1
2
3
4
5
1
2
1
2
3
4
I want output Like this-
Serial Sequence Number
1 1
1 2
1 3
1 4
1 5
2 1
2 2
3 1
3 2
3 3
3 4
Use DataFrame.insert for column in first position filled with boolean mask for compare by 1 with Series.eq (==) and cumulative sum by Series.cumsum:
df.insert(0, 'Serial', df['Sequence Number'].eq(1).cumsum())
print (df)
Serial Sequence Number
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 2 1
6 2 2
7 3 1
8 3 2
9 3 3
10 3 4
I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])
I have a (2.3m x 33) size dataframe. As I always do when selecting columns to keep, I use
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
However, this time the data under these columns becomes completely jumbled up on running the code. Entries for row A might be in row D for example. Totally at random.
Has anybody experienced this kind of behavior before? There is nothing out of the ordinary about the data and the df is totally fine before running these lines. Code run before problem begins:
with open('file.dat','r') as f:
df = pd.DataFrame(l.rstrip().split() for l in f)
#rename columns with the first row
df.columns = df.iloc[0]
#drop first row which is now duplicated
df = df.iloc[1:]
#. 33 nan columns - Remove all the nan columns that appeared
df = df.loc[:,df.columns.notnull()]
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
Data suddenly goes from being nicely formatted such as:
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
to something more random like:
A B C D E F G H I
7 9 3 4 5 1 2 8 6
3 2 9 2 1 6 7 8 4
2 1 3 6 5 4 7 9 8
I have multiple dfs i need to compare, however the way the data was gathered one df has 25 columns and another 20 columns. Keep in mind the column label names are the same (the 20 columns exist in the 25 columns df).
I can't figure out how to remove columns from df_cont, if they don't exist in df_red + not include columns in df_red, which are not currently df_cont
df_cont A B C D E F
01-01-2019 1 2 3 4 5 5
02-01-2019 1 3 4 4 6 5
df_red A B D F G
01-01-2019 2 5 6 4 3
02-01-2019 2 5 6 4 3
Code:
df_cont1 = df_cont.query(df_cont.columns == df_red.columns)
Expected:
df_cont1 A B D F
01-01-2019 1 2 4 5
02-01-2019 1 3 4 5
As #busybear already stated you can use
df_cont = df_cont[df_red.columns]
in your special case.
This alternative solution is a bit safer if you don't know which DataFrame is the bigger one:
df_cont[df_cont.columns.intersection(df_red.columns)]