Combining two columns with related data into a single column (python, pandas) - python

I am looking for the correct logic to combine two columns with related data from an .xlsx file using pandas in python. It is similar to the post: Merge 2 columns in pandas into one columns that have data in python, except that I also want to transform the data as I combine the columns so it's not really a true merge of the two columns. I want to be able to say "if column wbc_na has the value "checked" in row x, place "Not available" in row x under column wbc". Once combined, I want to drop the column" wbc_na" since "wbc" now contains all the information I need. For example:
input:
ID,wbc, wbc_na
1,9.0,-
2,NaN,checked
3,10.2,-
4,8.8,-
5,0,checked
output:
ID,wbc
1,9.0
2,Not available
3,10.2
4,8.8
5,Not available
Thanks for your suggestions.

You can use loc to find where column 'wbc_na' is 'checked' and for those rows assign column 'wbc' value:
In [18]:
df.loc[df['wbc_na'] == 'checked', 'wbc'] = 'Not available'
df
Out[18]:
ID wbc wbc_na
0 1 9 -
1 2 Not available checked
2 3 10.2 -
3 4 8.8 -
4 5 Not available checked
[5 rows x 3 columns]
In [19]:
# now drop the extra column
df.drop(labels='wbc_na', axis=1, inplace=True)
df
Out[19]:
ID wbc
0 1 9
1 2 Not available
2 3 10.2
3 4 8.8
4 5 Not available
[5 rows x 2 columns]

You could also a list comprehension to reassign the values in column wbc:
data = pd.DataFrame({'ID': [1,2,3,4,5], 'wbc': [9, np.nan, 10, 8, 0], 'wbc_nan': ['-', 'checked', '-', '-', 'checked']})
data['wbc'] = [(item if data['wbc_nan'][x] != 'checked' else 'Not available') for x, item in enumerate(data['wbc'])]
data = data.drop('wbc_nan', axis=1)

Related

How to insert a column at the beginning of csv file using python

Please inform how can I append a column with all 1s at the beginning of the csv file?
orginal:
y z
1 5
2 6
3 7
Required:
x y z
1 1 5
1 2 6
1 3 7
If you're using a library like pandas,
df.insert(0, "col1", [100, 100], allow_duplicates=True)
0 indicates the column you're adding. col1 is the name of that column, and the array is the values you're inserting.
As referenced here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html
pandas is a great lib to read csv and manipulate its data. Read the data, add the x-1 column, then dump the content again
import pandas as pd
df = pd.read_csv("export.csv")
df['x'] = 1
df = df[['x', 'y', 'z']] # reorder cols
df.to_csv("export.csv", index=False)

Python pandas - select by row

I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})
I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:
df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]
This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.
Note that data frames may be of different shapes, and contain multiple matching rows.
Similar to this post, here's one using broadcasting -
df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
The idea is :
1) Use np.all for the both part in ""both 'a' and 'b' values"".
2) Use np.any for the any part in "from df1 match any row in df2".
3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.
Sample run -
In [41]: df1
Out[41]:
a b
0 1 4
1 2 5
2 3 6
In [42]: df2 # Modified to add another row : [1,4] for variety
Out[42]:
a b
0 3 4
1 2 5
2 1 6
3 1 4
In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]:
a b
0 1 4
1 2 5
use numpy broadcasting
pd.DataFrame((df1.values[:, None] == df2.values).all(2),
pd.Index(df1.index, name='df1'),
pd.Index(df2.index, name='df2'))

Find mean of the grouped rows of pandas dataframe

I am at very basic level of python. here i am stuck with a problem, can someone help me out?
i have a large pandas dataframe, i want to find rows and do mean, if the first column of each row has some similar value (ex: someinteger seperated by '_' another integer).
i tried to use .split to match 1st number of list, it works for single row but if i have iterate over row, it throws error.
my data frame looks like:
d = {'ID' : pd.Series(['1_1', '2_1', '1_2', '2_2' ], index=['0','1','2', '3']),
'one' : pd.Series([2.5, 2, 3.5, 2.5], index=['0','1', '2', '3']),
'two' : pd.Series([1, 2, 3, 4], index=['0', '1', '2', '3'])}
df2 = pd.DataFrame(d)
requirement:
mean of the rows which has similar ID at first position after split. ex. mean of 1_1 and 1_2, 2_1 and 2_2
output:
ID one two
0 1 3 2
1 2 2.25 3
here is my code,
working version : ((df2.ix[0,0]).split('_'))[0]
error version:
for i in df2.iterrows():
df2[df2.columns[((df2.ix[0,0]).split('_'))[0] == ((df2.ix[0,0]).split('_'))[0]]]
looking forward for sooner reply..
Thanks in advance..
You could create new column only with first number of your ID column with [str methods](http://pandas.pydata.org/pandas-docs/stable/text.html#splitting-and-replacing-strings) and then usegroupby` method:
df['groupedID'] = df.ID.str.split('_').str.get(0)
In [347]: df
Out[347]:
ID one two groupedID
0 10_1 2.5 1 10
1 2_1 2.0 2 2
2 10_2 3.5 3 10
3 2_2 2.5 4 2
df1 = df.groupby('groupedID').mean()
In [349]: df1
Out[349]:
one two
groupedID
10 3.00 2
2 2.25 3
If you need to change name of the index back to 'ID':
df1.index.name = 'ID'
In [351]: df1
Out[351]:
one two
ID
10 3.00 2
2 2.25 3

How to delete columns from a dataframe with columns with the same label?

I have a dataframe where some column labels occur multiple times (i.e., some columns have the same label). This is causing me problems -- I may post more about this separately, because some of the behavior seems a little strange, but here I just wanted to ask about deleting some of these columns. That is, for each column label that occurs multiple times, I would like to delete all but the first column it heads. Here's an exammple:
In [5]: arr = np.array([[0.0, 1.0, 2.0, 3.0], [4.0, 5.0, 6.0, 7.0]])
In [6]: df = pd.DataFrame(data=arr, columns=['A', 'C', 'E', 'A'])
In [7]: df
Out[7]:
A C E A
0 0 1 2 3
1 4 5 6 7
If I drop columns using the label, all columns headed by that label are dropped:
In [9]: df.drop('A', axis=1)
Out[9]:
C E
0 1 2
1 5 6
So I thought I'd try dropping by the column index, but that also deletes all the columns headed by that label:
In [12]: df.drop(df.columns[3], axis=1)
Out[12]:
C E
0 1 2
1 5 6
How can I do what I want, that is, for each such label, delete all but one of the columns? For the above example, I'd want to end up with:
A C E
0 0 1 2
1 4 5 6
For now I've relabeled the columns, as follows:
columns = {}
new_columns = []
duplicate_num = 0
for n in df.columns:
if n in columns:
new_columns.append("duplicate%d" % (duplicate_num))
duplicate_num += 1
else:
columns[n] = 1
new_columns.append(n)
df.columns = new_columns
This works fine for my needs, but it doesn't seem like the best/cleanest solution. Thanks.
Edit: I don't see how this is a duplicate of the other question. For one thing, that deals with duplicate columns, not duplicate column labels. For another, the suggested solution there involved transposing the dataframe (twice), but as mentioned there, transposing large dataframes is inefficient, and in fact I am dealing with large dataframes.
In [18]:
df.ix[: , ~df.columns.duplicated()]
Out[18]:
A C E
0 0 1 2
1 4 5 6
Explanation
In [19]:
~df.columns.duplicated()
Out[19]:
array([ True, True, True, False], dtype=bool)
as you can see here you need first to check whether a column name is duplicated or not , notice that I've added ~ at the beginning of the function .
then you can slice columns using the non duplicated values

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Categories