Why do I get NaN value when adding values in the b column and not for a?
This is the code:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
#extract all rows where a is present in the grps column
#for each a in a row, create an entry in a column (index 'a') in newdf from corresponding value on row
newdf = pd.DataFrame()
newdf['a'] = df[df["grps"] == 'a']['vals']
#print(df[df["grps"] == 'b']['vals'])
newdf['b']=df[df["grps"] == 'b']['vals']
print(newdf)
This is the output:
a b
0 12 NaN
1 345 NaN
2 3 NaN
6 4 NaN
7 52 NaN
Try the following:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
#extract all rows where a is present in the grps column
#for each a in a row, create an entry in a column (index 'a') in newdf from corresponding value on row
newdf = pd.DataFrame()
newdf['a'] = df[df["grps"] == 'a']['vals'].values
#print(df[df["grps"] == 'b']['vals'])
newdf['b']=df[df["grps"] == 'b']['vals'].values
print(newdf)
The problem is that df[df["grps"] == 'b']['vals'] is a pd.Series which is an index array and a value array, but newdf already has an index that it got from the previous line newdf['a'] = df[df["grps"] == 'a']['vals']. So when you do that twice, the indices are not matching anymore and pd does not know how to handle your command.
By adding the .values accessor you only append the values array, creating a default index which is now going to be just [0,1,2,3,4]
Related
I have a large collection of CSVs, each containing
an "id" column of values, and
a "neighbors" column with lists of strings.
Values from id do not occur in any neighbors lists. Values in id are unique per CSV, no CSV contains all the id values, and if two CSV share an id value, then the rows refer to the same object.
I would like to create something akin to a bipartite adjacency matrix from these CSVs, with a row for each each id value i, a column for each neighbors string j, and a 1 in cell (i,j) if there in any of the CSVs exists a row with id value i where string j occurs in the neighbors list.
The code below does what it's supposed to, but it takes very long.
Is there a more effective way of getting the job done?
import pandas as pd
adjacency_matrix = pd.DataFrame()
list_of_csv_files = [csv_file1, csv_file2, ...]
for file in list_of_csvs:
df = pd.read_csv(file)
df.set_index('id', inplace = True)
for i in df.index:
for j in df.at[i,'neighbors']:
adjacency_matrix.at[i, j] = 1
Example:
Given that the csvs are loaded to the list list_of_dataframes = [df1,df2] with
df1 = pd.DataFrame(data={'id':['11','12'], 'N': [['a'], ['a', 'b']]})
id neighbors
0 11 [a]
1 12 [a, b]
df2 = pd.DataFrame(data={'id':['11','13'], 'N': [['c'], ['d']]})
id neighbors
0 11 [c]
1 13 [d]
I seek the dataframe
a b c d
11 1 NaN 1 NaN
12 1 1 NaN NaN
13 NaN NaN NaN 1
Yes there is a cleaner and efficient way using concat, explode and get_dummies. You can try this:
import pandas as pd
list_of_csv_files = [csv_file1, csv_file2, ...]
list_of_dfs = [pd.read_csv(file) for file in list_of_csv_files]
out = pd.concat(list_of_dfs)
out = out.explode('neighbors').drop_duplicates(ignore_index=True)
out.set_index('id', inplace=True)
out = pd.get_dummies(out, prefix='', prefix_sep='')
out = out.groupby('id').sum()
I have a dataset where I need to match the column A and fetch its corresponding next value in next column B.
For example , I have to check if 1 is matched in column A, If true then print "First Page"
Similarly for all the values in column A has to be matched with say X , if true, then print its next value in column B.
Example:
By using df.iloc you can can get the row or column you want by index.
By using mask you can filter the data frame to get the row you want (where column a == some value) and take the value in the second column by df.iloc[0,1].
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4,3,2,1]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 4
1 2 3
2 3 2
3 4 1
# a is the value in the first column and df is the data frame
def a2b(a,df):
return df[df.iloc[:,0]==a].iloc[0,1]
a2b(2,df)
returns 3
My dataframe looks like this:
And I need to drop the first 4 rows because they have NaN as a value in the first column. Since I'll have to do this to slightly different dataframes I can't just drop them by index.
To achieve this I thought of iterating over the df by rows, checking if the value is NaN using numpy isnan function and then drop the row - sadly it doesn't seem to work.
first_col = df.columns[0]
for i, row in df.iterrows():
if np.isnan(row[first_col]):
df.drop(i, axis=0, inplace=True)
else:
break
isnan does not work though.
So I tried replacing NaN values with a blank string df.fillna("", inplace=True) and replaced the if condition:
first_col = df.columns[0]
for i, row in df.iterrows():
if row[first_col] == '':
df.drop(i, inplace=True, axis=0)
else:
break
This works, but it's pretty ugly alright. Is there a faster/neater way to achieve this?
I can't replicate your full dataset because of the way you posted it but you can do this:
Assume a df (which is similar to your first column):
num.ord.tariffa
0 NaN
1 NaN
2 NaN
3 NaN
4 5
5 6
6 7
You use .loc, and argmax():
new_df = df.loc[df.notnull().all(axis=1).argmax():]
and get back:
num.ord.tariffa
4 5
5 6
6 7
Which removes np.nan until the first non-nan, which is your desired result.
You may try this:
df['num.ord.tariffa'] = df['num.ord.tariffa'].fillna('Remove')
newdf = df[df['num.ord.tariffa'] != 'Remove']
EDIT:
final = pd.DataFrame()
n = 4
for index,row in df.iterrows():
if index < n:
if row['c1'] == np.nan:
pass
else:
new = pd.DataFrame([[row['c1'],row['c2']]],columns=['c1','c2'])
final = final.append(new)
else:
new = pd.DataFrame([[row['c1'],row['c2']]],columns=['c1','c2'])
final = final.append(new)
You should drop rows with Nan values and add a subset of the columns you are interested:
df = df.dropna(subset='num.ord.tariffa')
I got two DataFrame and want remove rows in df1 where we have same value in column 'a' in df2. Moreover one common value in df2 will only remove one row.
df1 = pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2 = pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
result = pd.DataFrame({'a':[1,1,3,4],'b':[1,2,4,6],'c':[6,5,3,1]})
Use Series.isin + Series.duplicated to create a boolean mask and use this mask to filter the rows from df1:
m = df1['a'].isin(df2['a']) & ~df1['a'].duplicated()
df = df1[~m]
Result:
print(df)
a b c
0 1 1 6
1 1 2 5
3 3 4 3
5 4 6 1
Try This:
import pandas as pd
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
df2a = df2['a'].tolist()
def remove_df2_dup(x):
if x in df2a:
df2a.remove(x)
return False
return True
df1[df1.a.apply(remove_df2_dup)]
It creates a list from df2['a'], then checks that list against each value of df1['a'], removing values from the list each time there's a match in df1
try this
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
for x in df2.a:
if x in df1.a:
df1.drop(df1[df1.a==x].index[0], inplace=True)
print(df1)
I have a pandas dataframe that looks like this:
I would like to iterate through column 3 and if an element exists, add a new row to the dataframe, using the value in column 3 as the new value in column 2, while also using the data in columns 0 and 1 from the row where it was found as the values for columns 0 and 1 in the newly added row:
Here, row 2 is the newly added row. The values in columns 0 and 1 in this row come from the row where "D" was found, and now column 2 of the new row contains the value from column 3 in the first row, "D".
Here is one way to do it, but surely there must be a more general solution, especially if I wish to scan more than a single column:
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
for tu in a.itertuples(index=False): # Iterate by row
if tu[3]: # If exists
b = b.append([[tu[0],tu[1],tu[3]]], ignore_index=True) # Append with new row using correct tuple elements.
You can do this without any loops by creating a new df with the columns you want and appending it to the original.
import pandas as pd
import numpy as np
df = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
ndf = df[pd.notnull(df[3])][[0,1,3]]
ndf.columns = [0,1,2]
df = df.append(ndf, ignore_index=True)
This will leave NaN for the new missing values which you can change then change to None.
df[3] = df[3].where((pd.notnull(df[3])), None)
prints
0 1 2 3
0 A B C D
1 1 2 C None
2 A B D None
This may be a bit more general (assuming your columns are integers and that you are always looking to fill the previous columns in this pattern)
import pandas as pd
def append_rows(scan_row,scanned_dataframe):
new_df = pd.DataFrame()
for i,row in scanned_dataframe.iterrows():
if row[scan_row]:
new_row = [row[i] for i in range(scan_row -1)]
new_row.append(row[scan_row])
print new_row
new_df = new_df.append([new_row],ignore_index=True)
return new_df
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
b = b.append(append_rows(3,a))