Pandas concat flips all my values in the DataFrame - python

I have a dataframe called 'running_tally'
list jan_to jan_from
0 LA True False
1 NY False True
I am trying to append new data to it in the form of a single column dataframe called 'new_data'
list
0 HOU
1 LA
I concat these two dfs based on their 'list' column for further processing, but immediately after I do that all the boolean values unexpectedly flip.
running_tally = pd.concat([running_tally,new_data]).groupby('list',as_index=False).first()
the above statement will produce:
list jan_to jan_from
0 LA False True
1 NY True False
2 HOU NaN NaN
NaN values are expected for the new row, but I don't know why the bools all flip. What could be the reason for this? The code logically makes sense to me so I'm not sure where I'm going wrong. Thanks
EDIT: I made an edit to 'new_data' to include a repeat with LA. The final output should not have repeats which my code currently handles correctly, just has boolean flipping
EDIT 2: Turns out that when concatenating, the columns would flip in order leading me to believe the bools flipped. Still an open issue however

I am not sure why you want to use a groupby in this case... when using concat there is no need to specify which columns you want to use, as long as their names are identical.
Simple concatenation like this should do:
running_tally = pd.concat([running_tally,new_data], ignore_index=True, sort=False)
EDIT to take question edit into account: this should do the same job, without duplicates.
running_tally = running_tally.merge(new_data, on="list", how="outer")

I donĀ“t get the booleans flipped as you, but you can try this too:
running_tally=running_tally.append(new_data,ignore_index=True)
print(running_tally)
Output:
list jan_to jan_from
0 LA True False
1 NY False True
2 HOU NaN NaN
EDIT: Since the question was edited, you could try with:
running_tally=running_tally.append(new_data,ignore_index=True).groupby('list',as_index=False).first()

The actual row order was being flipped when using concat for pandas 0.20.1
How to concat pandas Dataframes without changing the column order in Pandas 0.20.1?

Related

Highlight result of dataframe comparison where values differ

I need to compare 2 DataFrames (which should be identical) and output an Excel sheet that shows the comparison between them, with any mismatched values highlighted. This was the format requested by the analysts working with the reports.
I'm currently using df.compare() to do this, which gives a result like the below, where orig is the original df and new is the new df.
In the below, both values in col_1 at index 3 should be highlighted, because they didn't match between the dataframes:
index col_1 col_2 col_3
orig new orig new orig new
1 1 1 2 2 3 3
2 1 1 2 2 3 3
3 1 2 2 2 3 3
While I can do this on my own, the dataframes could be very large, and there will be hundreds of comparisons. So I need your help in doing it efficiently!
My idea was to do
orig.compare(new, keep_equal=False)
and use that to create a mask. This would work because keep_equal=False only returns values that differ, all other cells are NaN. Then I could run the comparison again with keep_equal=True, which populates all cells. Then finally apply the mask using
df.style.apply
to highlight the values that didn't match.
Is there a faster way to do this? It requires processing all the cells in the df several times.
Thanks for any help you can provide.
orig and new are the two dataframes you want to compare.
Use:
def highlight_diffs(orig, props=''):
return np.where(orig != new, props, '')
orig.style.apply(highlight_diffs, props='color:white;background-color:darkblue', axis=None)
Reference: Styler Functions. Acting on Data.

Filling nan values

I have a dataset that contains nan values. These values are dependent on another variable, and I am trying to clean the data using it. I write a code to replace the nan values but it doesn't work. The code is:
df.loc[(df["house"]=="rented") & (df["car"]=="yes")]["debt"].fillna(2, inplace=True)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
Conditional that returns a boolean Series with column labels specified
df.loc[df['shield'] > 6, ['max_speed']]
max_speed
sidewinder 7
Based on the documentation it should be converted to this:
df.loc['filter','selected column']
Give it a try like this:
df.loc[(df["house"]=="rented") & (df["car"]=="yes"), ["debt"]].fillna(2, inplace=True)
Switch df.loc to
for val in df.index:
if (df["house"][val] == "rented") and (df["car"][val] == "yes"):
df["debt"][val] = 2
If I understand you correctly, you do not want to just fill in the na values. Rather, you'd like to fill the na values only when house is rented and you have a car. To fill all na values at df index "debt"
df["debt"].fillna(2, inplace=True)
should be used rather then your second line of code.

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

pandas dataframe throwing an empty list

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
When writing this dumb question, I was just a beginner not even knowing what I wanted ask.
The OP's question comes down to "getting the row as a list" since he ended his post asking
how to get numbers(though he said "number" maybe by mistake) of each row.
The answer is that he made a mistake of using double square brackets in his example and it caused problems.
The solution is to use df = df["2018/12"] instead of df= df[["2018/12"]]
As for things I(me at the time of writing this) mentioned, I will answer them one by one:
Let's say the table looks like this
Unnamed: 0 2018/12 country drives_right
0 US 809 United States True
1 AUS 731 Australia False
2 JAP 588 Japan False
3 IN 18 India False
4 RU 200 Russia True
5 MOR 70 Morocco True
6 EG 45 Egypt True
1>df = df[["2018/12"]]
: it will output a dataframe which only has the column "2018/12" and the index column on the left side.
2>df.iloc[0,0]
Now, since from 1> we have a new dataframe having only one column(except for index column mentioning index values) this will output the first element of the column.
In the example above, the outcome will be "809" since it's the first element of the column.
3>
But when I just want to extract numbers under that column, using
df.iloc[0,0]
-> doesn't make sense if you want to get extract numbers. It will just output one element
809 from the sub-dataframe you created using df = df[["2018/12"]].
it throws an error like
single positional indexer is out-of-bounds
Maybe you are confused about the outcome.(Maybe in this case "df" is the one before your df dataframe subset assignment?(df=df[["2018/12"]]) Since df = df[["2018/12"]] will output a dataframe so it will work fine.
3
So I am using
df.loc[0]
but it has the column name with the numeric data.
: Yes df.loc[0] from df = df[["2018/12"]] will return column name and the first element of that column.
4.
How can I extract just the number of each row?
You mean "numbers" of each row right?
Use this:
print(df["2018/12"].values.tolist())
In terms of finding varying names of columns or rows, and then access each rows and columns, you should think of using regex.

Create multiindex from existing dataframe

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Categories