Pandas combine row element in one - python

I have a Pandas Dataframe that is populated by an CSV, and after that I read the columns and iterate by element in row (for each element in column) and writes that element in a file. My problem is that I have elements in row that I want joined into one element.
Say I have A through Z columns, and let's say it's elements are 1 to 23. Let's say that I want joined the number 9 and 10 (columns I and J) in one element only (columns I and J become one and it's values become[9,10])
How do I achieve that using pandas (while iterating)?
My code is long but you can find it here. I've tried groupby but I think it only work with booleans and int (correct me if I'm wrong)
Also I'm pretty new to Python, any advises on my code would be much apreciated!!

Here is an example. It adds a new column where each entry is the list of two other columns. I hope it helps!
df= pd.DataFrame(np.random.randn(10,4))
df[4]= [[df[2][x],df[3][x]] for x in range(df.shape[0])]

You can concat the columns, then convert to a list using numpy's tolist():
In [56]: df = pd.DataFrame(dict(A=[1,1,1], I=[9,9,9], J=[10,10,10]))
In [57]: df
Out[57]:
A I J
0 1 9 10
1 1 9 10
2 1 9 10
In [58]: df["IJ"] = pd.concat((df.I, df.J), axis=1).values.tolist()
In [59]: df.drop(["I","J"], axis=1)
Out[59]:
A IJ
0 1 [9, 10]
1 1 [9, 10]
2 1 [9, 10]

Related

Selecting rows based on Boolean values in a non dangerous way

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.
Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

Quick sum of all rows that fill a condition in DataFrame

I have a pandas dataframe that looks something like this:
df = pd.DataFrame(np.array([[1,1, 0], [5, 1, 4], [7, 8, 9]]),columns=['a','b','c'])
a b c
0 1 1 0
1 5 1 4
2 7 8 9
I want to find the first column in which the majority of elements in that column are equal to 1.0.
I currently have the following code, which works, but in practice, my dataframes usually have thousands of columns and this code is in a performance critical part of my application, so I wanted to know if there is a way to do this faster.
for col in df.columns:
amount_votes = len(df[df[col] == 1.0])
if amount_votes > len(df) / 2:
return col
In this case, the code should return 'b', since that is the first column in which the majority of elements are equal to 1.0
Try:
print((df.eq(1).sum() > len(df) // 2).idxmax())
Prints:
b
Find columns with more than half of values equal to 1.0
cols = df.eq(1.0).sum().gt(len(df)/2)
Get first one:
cols[cols].head(1)

How to iterate over every cell in pandas Dataframe?

How to iterate over every cell of a pandas Dataframe?
Like
for every_cell in df.iterrows:
print(cell_value)
printing of course is not the goal.
Cell values of the df should be updated in a MongoDB.
if it has to be a for loop, you can do it like this:
def up_da_ter3(df):
columns = df.columns.tolist()
for _, i in df.iterrows():
for c in columns:
print(i[c])
print("############")
You can use applymap. It will iterate down each column, starting with the left most. But in general you almost never need to iterate over every value of a DataFrame, pandas has much more performant ways to accomplish calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(6).reshape(-1, 2), columns=['A', 'B'])
# A B
#0 0 1
#1 2 3
#2 4 5
df.applymap(lambda x: print(x))
0
2
4
1
3
5
If you need it to raster through the DataFrame across rows you can transpose first:
df.T.applymap(lambda x: print(x))
0
1
2
3
4
5
One technique as an option would be using double for loop within square brackets such as
for i in [df[j][k] for k in range(0,len(df)) for j in df.columns]:
print(i)
in order to iterate starting from the first column to the last one of the first row, and then repeat the same process for each row by the order of members of lists.

Keep column and row order when storing pandas dataframe in json

When storing data in a json object with to_json, and reading it back with read_json, rows and columns are returned sorted alphabetically. Is there a way to keep the results ordered or reorder them upon retrieval?
You could use orient='split', which stores the index and column information in lists, which preserve order:
In [34]: df
Out[34]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
In [35]: df.to_json(orient='split')
Out[35]: '{"columns":["A","C","B"],"index":[5,4,3],"data":[[0,1,2],[3,4,5],[6,7,8]]}'
In [36]: pd.read_json(df.to_json(orient='split'), orient='split')
Out[36]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
Just remember to use orient='split' on reading as well, or you'll get
In [37]: pd.read_json(df.to_json(orient='split'))
Out[37]:
columns data index
0 A [0, 1, 2] 5
1 C [3, 4, 5] 4
2 B [6, 7, 8] 3
If you want to make a format with "orient='records'" and keep orders of the column, try to make a function like this. I don't think it is a wise approach, and do not recommend because it does not guarantee its order.
def df_to_json(df):
res_arr = []
ldf = df.copy()
ldf=ldf.fillna('')
lcolumns = [ldf.index.name] + list(ldf.columns)
for key, value in ldf.iterrows():
lvalues = [key] + list(value)
res_arr.append(dict(zip(lcolumns, lvalues)))
return json.dumps(res_arr)
In addition, for reading without sorted column please ref this [link] (Python json.loads changes the order of the object)
Good Luck
lets say you have a pandas dataframe, that you read
import pandas as pd
df = pd.read_json ('/abc.json')
df.head()
that give following
now there are two ways to save to json using pandas to_json
result.sample(200).to_json('abc_sample.json',orient='split')
that will give the order like this one column
however, to preserve the order like in csv, use this one
result.sample(200).to_json('abc_sample_2nd.json',orient='records')
this will give result as

Apply condition on pandas columns to create a boolen indexing array

I want to drop specific rows from a pandas dataframe. Usually you can do that using something like
df[df['some_column'] != 1234]
What df['some_column'] != 1234 does is creating an indexing array that is indexing the new df, thus letting only rows with value True to be present.
But in some cases, like mine, I don't see how I can express the condition in such a way, and iterating over pandas rows is way too slow to be considered a viable option.
To be more specific, I want to drop all rows where the value of a column is also a key in a dictionary, in a similar manner with the example above.
In a perfect world I would consider something like
df[df['some_column'] not in my_dict.keys()]
Which is obviously not working. Any suggestions?
What you're looking for is isin()
import pandas as pd
df = pd.DataFrame([[1, 2], [1, 3], [4, 6],[5,7],[8,9]], columns=['A', 'B'])
In[9]: df
Out[9]: df
A B
0 1 2
1 1 3
2 4 6
3 5 7
4 8 9
mydict = {1:'A',8:'B'}
df[df['A'].isin(mydict.keys())]
Out[11]:
A B
0 1 2
1 1 3
4 8 9
You could use query for this purpose:
df.query('some_column != list(my_dict.keys()')
You can use the function isin() to select rows whose column value is in an iterable.
Using lists:
my_list = ['my', 'own', 'data']
df.loc[df['column'].isin (my_list)]
Using dicts:
my_dict = {'key1':'Some value'}
df.loc[df['column'].isin (my_dict.keys())]

Categories