A clean and efficient way to update cells in pandas DataFrames - python

I am looking for a cleaner way to achieve the following:
I have a DataFrame with certain columns that I want to update if new information arrives. This "new information" in for of a pandas DataFrame (from a CSV file) can have more or less rows, however, I am only interested in adding
Original DataFrame
DataFrame with new information
(Note the missing name "c" here and the change in "status" for name "a")
Now, I wrote the following "inconvenient" code to update the original DataFrame with the new information
Updating the "status" column based on the "name" column
for idx,row in df_base.iterrows():
if not df_upd[df_upd['name'] == row['name']].empty:
df_base.loc[idx, 'status'] = df_upd.loc[df_upd['name'] == row['name'], 'status'].values
It achieves exactly what I want, but it just does neither look nice nor efficient, and I hope that there might be a cleaner way. I tried the pd.merge method, however, the problem is that it would be adding new columns instead of "updating" the cells in that column.
pd.merge(left=df_base, right=df_upd, on=['name'], how='left')
I am looking forward to your tips and ideas.

You could set_index("name") and then call .update:
>>> df_base = df_base.set_index("name")
>>> df_upd = df_upd.set_index("name")
>>> df_base.update(df_upd)
>>> df_base
status
name
a 0
b 1
c 0
d 1
More generally, you can set the index to whatever seems appropriate, update, and then reset as needed.

Related

Pandas style-dataframe has multi-index columns misaligned

I need to style a Pandas dataframe that has a multi-index column arrangement. As an example of my dataframe, consider:
df = pd.DataFrame([[True for a in range(12)]])
df.columns = pd.MultiIndex.from_tuples([(a,b,c) for a in ['Tim','Sarah'] for b in ['house','car','boat'] for c in ['new','used']])
df
This displays the multi-index columns just as I would expect it to, and is easy to read:
However, when I convert it to a style dataframe:
df.style
The column headers suddenly shift and it becomes confusing to figure out where each level begins and ends:
Can anyone help me undo this, to return it to the more readable left-justified setup? I looked through about 10 other posts on SO but none addressed this issue.
TIA.
UPDATE 12/9:
My primary product is a .to_excel() output, and it displays correctly, I just learned. So while this issue is still open for solutions, I am not in urgent need of a solution, but am in search of one nonetheless. Thanks.

Iterate through dataframe and create new dataframe every time condition is met

From a SQL query I pull a list of activities that happen in the shop, one of the column has a label called 'STATUS'.
I want to go down the dataframe and pull the whole row every time the 'STATUS' label changes
I've created a dataframe from the query and called it df
Changed the column types to what I needed
Created a list of all the column headers
Used that list to create an empty dataframe to which I intended to append
Tried creating a for loop with the condition described above
headerlist = df.columns.values.tolist()
newdf = pd.DataFrame(columns=headerlist)
for index, row in df.iterrows():
if df.STATUS[i] != df.STATUS[i-1]:
newdf = newdf.append(i)
I've uploaded an image here that represents what I'm trying to achieve.
Thank you in advance
https://imgur.com/a/XXEOlKs
Not sure that's the best answer : (Im kinda new to Python too)
let's say that's your original dataframe is df
for even, odd in zip(df.iloc[::2], df.iloc[1::2]):
if even.status!=odd.status:
#do you append here
You should give a runnable df example so we could run our code before posting it to be sure it really works
I think I got it! I tried for an hour before posting and as soon as I posted it clicked
It's a lot simpler than a for loop.
I created a new column called 'MATCH' that has a boolean value of whether or not the new value of 'STATUS' is equal or not to the previous.
Then I just filter by df.STATUS == False
'''python
df['MATCH'] = df.STATUS.eq(df.STATUS.shift())'''
Try this
df[df['STATUS'].ne(df['STATUS'].shift().bfill())]

Pandas - merge/join/vlookup df and delete all rows that get a match

I am trying to reference a list of expired orders from one spreadsheet(df name = data2), and vlookup them on the new orders spreadsheet (df name = data) to delete all the rows that contain expired orders. Then return a new spreadsheet(df name = results).
I am having trouble trying to mimic what I do in excel vloookup/sort/delete in pandas. Please view psuedo code/steps as code:
Import simple.xls as dataframe called 'data'
Import wo.xlsm, sheet
name "T" as dataframe called 'data2'
Do a vlookup , using Column
"A" in the "data" to be used to as the values to be
matched with any of the same values in Column "A" of "data2" (there both just Order Id's)
For all values that exist inside Column A in 'data2'
and also exist in Column "A" of the 'data',group ( if necessary) and delete the
entire row(there is 26 columns) for each matched Order ID found in Column A of both datasets. To reiterate, deleting the entire row for the matches found in the 'data' file. Save the smaller dataset as results.
import pandas as pd
data = pd.read_excel("ors_simple.xlsx", encoding = "ISO-8859-1",
dtype=object)
data2 = pd.read_excel("wos.xlsm", sheet_name = "T")
results = data.merge(data2,on='Work_Order')
writer = pd.ExcelWriter('vlookuped.xlsx', engine='xlsxwriter')
results.to_excel(writer, sheet_name='Sheet1')
writer.save()
I re-read your question and think I undertand it correctly. You want to find out if any order in new_orders (you call it data) have expired using expired_orders (you call it data2).
If you rephrase your question what you want to do is: 1) find out if a value in a column in a DataFrame is in a column in another DataFrame and then 2) drop the rows where the value exists in both.
Using pd.merge is one way to do this. But since you want to use expired_orders to filter new_orders, pd.merge seems a bit overkill.
Pandas actually has a method for doing this sort of thing and it's called isin() so let's use that! This method allows you to check if a value in one column exists in another column.
df_1['column_name'].isin(df_2['column_name'])
isin() returns a Series of True/False values that you can apply to filter your DataFrame by using it as a mask: df[bool_mask].
So how do you use this in your situation?
is_expired = new_orders['order_column'].isin(expired_orders['order_column'])
results = new_orders[~is_expired].copy() # Use copy to avoid SettingWithCopyError.
~is equal to not - so ~is_expired means that the order wasn't expired.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

pandas dataframe add new column based on calulation on other column and avoid chained index

I have a pandas dataframe and I need to add a new column, which would be based on calculation of specific columns,indicated by a column 'site'. I have found a way to do this with resort to numpy, but always it gives warning about chained index. I am sure there should be better solution, please help if you know.
df_num_bin1['Chip_id_3']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_89_S1]*0x100+df_num_bin1[WB_78_S1],df_num_bin1[WB_89_S2]*0x100+df_num_bin1[WB_78_S2])
df_num_bin1['Chip_id_2']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_67_S1]*0x100+df_num_bin1[WB_56_S1],df_num_bin1[WB_67_S2]*0x100+df_num_bin1[WB_56_S2])
df_num_bin1['Chip_id_1']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_45_S1]*0x100+df_num_bin1[WB_34_S1],df_num_bin1[WB_45_S2]*0x100+df_num_bin1[WB_34_S2])
df_num_bin1['Chip_id_0']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_23_S1]*0x100+df_num_bin1[WB_12_S1],df_num_bin1[WB_23_S2]*0x100+df_num_bin1[WB_12_S2])
df_num_bin1['mac_low']=(df_num_bin1['Chip_id_1'].map(int) % 0x10000) *0x100+df_num_bin1['Chip_id_0'].map(int) // 0x1000000
The code above have 2 issues:
1: Here the value of column [key_site_num] determines which columns I should extract chip id data from. In this example it is only of site 0 or 1, but actually it could be 2 or 3 as well. I would need a general solution.
2: it generates chained index warning;
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:35: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Well, i´m not too sure about your first quest but i think that this will help you.
import pandas as pd
reader = pd.read_csv(path,engine='python')
reader['new'] = reader['treasury.maturity.rate']+reader['bond.yield']
reader.to_csv('test.csv',index=False)
As you can see,you don´t need get the values before operate with them only reference the column where they are; and to do the same for only a specific row you could filter the dataframe before create the new column.

Categories