Shuffling Several DataFrames Together - python

Is it possible to shuffle several DataFrames together?
For example I have a DataFrame df1 and a DataFrame df2. I want to shuffle the rows randomly, but for both DataFrames in the same way.
Example
df1:
|___|_______|
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
df2:
|___|_______|
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
After shuffling a possible order for both DataFrames could be:
|___|_______|
| 2 | ... |
| 3 | ... |
| 4 | ... |
| 1 | ... |

I think you can double reindex with applying numpy.random.permutation to index, but is necessary both DataFrames have same length and same unique index values:
df1 = pd.DataFrame({'a':range(5)})
print (df1)
a
0 0
1 1
2 2
3 3
4 4
df2 = pd.DataFrame({'a':range(5)})
print (df2)
a
0 0
1 1
2 2
3 3
4 4
idx = np.random.permutation(df1.index)
print (df1.reindex(idx))
a
2 2
4 4
1 1
3 3
0 0
print (df2.reindex(idx))
a
2 2
4 4
1 1
3 3
0 0
Alternative with reindex_axis:
print (df1.reindex_axis(idx, axis=0))
print (df2.reindex_axis(idx, axis=0))

Related

Group by as a list and create new colum for each value

I have a dataframe where every row is a user id and if he has an item:
| user | item_id |
|------|---------|
| 1 | a |
| 1 | b |
| 2 | b |
| 3 | c |
| 4 | a |
| 4 | c |
I want to create n columns where n is all the possible values of item_id, group one row per user and fill 1/0 according if the value is present for the user.
| user | item_a | item_b | item_c |
|------|---------|---------|----------|
| 1 | 1 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 1 |
| 4 | 1 | 0 | 1 |
Use pivot_table:
import pandas as pd
df = pd.DataFrame({'user': [1,1,2,3,4,4], 'item_id': list('abbcac')})
df = df.assign(val=1).pivot_table(values='val',
index='user',
columns='item_id',
fill_value=0)
pd.crosstab(df.user,df.item_id).add_prefix('item_').reset_index()
Yet another approach is to use get_dummies and group by sum where:
pd.get_dummies(df, columns=['item_id']).groupby('user').sum().reset_index()
desired result:
user item_id_a item_id_b item_id_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1
and to change the columns:
df.columns = df.columns.str.replace(r"_id", "")
df
user item_a item_b item_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1

Find count of unique value of each column and save in CSV

I have data like this:
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 2 | 7 |
| 2 | 2 | 7 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
+---+---+---+
Need to count unique value of each column and report it like below:
+---+---+---+
| A | 3 | 3 |
| A | 2 | 1 |
| A | 1 | 1 |
| B | 2 | 5 |
| C | 1 | 3 |
| C | 7 | 2 |
+---+---+---+
I have no issue when number of column are limit and manually name them, when input file is big it become hard,need to have simple way to have output
here is the code I have
import pandas as pd
df=pd.read_csv('1.csv')
A=df['A']
B=df['B']
C=df['C']
df1=A.value_counts()
df2=B.value_counts()
df3=C.value_counts()
all = {'A': df1,'B': df2,'C': df3}
result = pd.concat(all)
result.to_csv('out.csv')
Use DataFrame.stack with SeriesGroupBy.value_counts and then convert Series to DataFrame by Series.rename_axis and Series.reset_index and :
df=pd.read_csv('1.csv')
result = (df.stack()
.groupby(level=1)
.value_counts()
.rename_axis(['X','Y'])
.reset_index(name='Z'))
print (result)
X Y Z
0 A 3 3
1 A 1 1
2 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
X Y Z
2 A 3 3
0 A 1 1
1 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
result.to_csv('out.csv', index=False)
You can loop over column and insert them in dictionary.
you can initiate dictionary by all={}. To be scalable you can read column by colm=df.columns. This would give you all column in your df.
Try this code:
import pandas as pd
df=pd.read_csv('1.csv')
all={}
colm=df.columns
for i in colm:
all.update({i:df[i].value_counts()})
result = pd.concat(all)
result.to_csv('out.csv')
to find unique values of a data-frame.
df.A.unique()
to know the count of the unique values.
len(df.A.unique())
unique create an array to find the count use len() function

Python pandas count occurrences in each column

I am new to pandas. Can someone help me in calculating frequencies of values for each columns.
Dataframe:
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 1 |
2 | 3 | 1 | 1 |
3 | 3 | 4 | 4 |
4 | 4 | 1 | 4 |
5 | 2 | 3 | 2 |
I want something like
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 2 |
2 | 1 | 1 | 1 |
3 | 2 | 1 | 0 |
4 | 1 | 1 | 2 |
Explanation - id 1 has 1 value in flag1, 2 values in flag2 and 2 values in flag3.
First filter only flag columns by filter or removing id column and then apply function value_counts, last replace NaNs to 0 and cast to ints:
df = df.filter(like='flag').apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Or:
df = df.drop('id', 1).apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Thank you, Bharath for suggestion:
df = df.filter(like='flag').apply(pd.Series.value_counts()).fillna(0).astype(int)

Pandas: replace zero value with value of another column

How to replace zero value in a column with value from same row of another column where previous row value of column is zero i.e. replace only where non-zero has not been encountered yet?
For example: Given a dataframe with columns a, b and c:
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 0 | 0 |
| 1 | 5 | 0 | 0 |
| 2 | 3 | 4 | 0 |
| 3 | 2 | 0 | 3 |
| 4 | 1 | 8 | 1 |
+----+-----+-----+----+
replace zero values in b and c with values of a where previous value is zero
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 2 | 2 |
| 1 | 5 | 5 | 5 |
| 2 | 3 | 4 | 3 |
| 3 | 2 | 0 | 3 | <-- zero in this row is not replaced because of
| 4 | 1 | 8 | 1 | non-zero value (4) in row before it.
+----+-----+-----+----+
In [90]: (df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
...: .fillna(pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]),
...: columns=df.columns, index=df.index))
...: .astype(int)
...: )
Out[90]:
a b c
0 2 2 2
1 5 5 5
2 3 4 3
3 2 0 3
4 1 8 1
Explanation:
In [91]: df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
Out[91]:
a b c
0 2 NaN NaN
1 5 NaN NaN
2 3 4.0 NaN
3 2 0.0 3.0
4 1 8.0 1.0
now we can fill NaN's with the corresponding values from the DF below (which is built as 3 concatenated a columns):
In [92]: pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]), columns=df.columns, index=df.index)
Out[92]:
a b c
0 2 2 2
1 5 5 5
2 3 3 3
3 2 2 2
4 1 1 1

Interleaving Pandas Dataframes by Timestamp

I've got 2 Pandas DataFrame, each of them containing 2 columns. One of the columns is a timestamp column [t], the other one contains sensor readings [s].
I now want to create a single DataFrame, containing 4 columns, that is interleaved on the timestamp column.
Example:
First Dataframe:
+----+----+
| t1 | s1 |
+----+----+
| 0 | 1 |
| 2 | 3 |
| 3 | 3 |
| 5 | 2 |
+----+----+
Second DataFrame:
+----+----+
| t2 | s2 |
+----+----+
| 1 | 5 |
| 2 | 3 |
| 4 | 3 |
+----+----+
Target:
+----+----+----+----+
| t1 | t2 | s1 | s2 |
+----+----+----+----+
| 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 5 |
| 2 | 1 | 3 | 5 |
| 2 | 2 | 3 | 3 |
| 3 | 2 | 3 | 3 |
| 3 | 4 | 3 | 3 |
| 5 | 4 | 2 | 3 |
+----+----+----+----+
I hat a look at pandas.merge, but that left me with a lot of NaNs and an unsorted table.
a.merge(b, how='outer')
Out[55]:
t1 s1 t2 s2
0 0 1 NaN NaN
1 2 3 2 3
2 3 3 NaN NaN
3 5 2 NaN NaN
4 1 NaN 1 5
5 4 NaN 4 3
Merging will put NaNs in common columns that you merge on, if those values are not present in both indexes. It will not create new data that is not present in the dataframes that are being merged.
For example, index 0 in your target dataframe shows t2 with a value of 0. This is not present in the second dataframe, so you cannot expect it to appear in the merged dataframe either. Same applies for other rows as well.
What you can do instead is reindex the dataframes to a common index. In your case, since the maximum index is 5 in the target dataframe, lets use this list to reindex both input dataframes:
In [382]: ind
Out[382]: [0, 1, 2, 3, 4, 5]
Now, we will reindex according both inputs to this index:
In [372]: x = a.set_index('t1').reindex(ind).fillna(0).reset_index()
In [373]: x
Out[373]:
t1 s1
0 0 1
1 1 0
2 2 3
3 3 3
4 4 0
5 5 2
In [374]: y = b.set_index('t2').reindex(ind).fillna(0).reset_index()
In [375]: y
Out[375]:
t2 s2
0 0 0
1 1 5
2 2 3
3 3 0
4 4 5
5 5 0
And, now we merge it to get something close to the target dataframe:
In [376]: x.merge(y, left_on=['t1'], right_on=['t2'], how='outer')
Out[376]:
t1 s1 t2 s2
0 0 1 0 0
1 1 0 1 5
2 2 3 2 3
3 3 3 3 0
4 4 0 4 5
5 5 2 5 0

Categories