(I'm a pandas n00b) I have some oddly formatted CSV data that resembles this:
i A B C
x y z x y z x y z
-------------------------------------
1 1 2 3 4 5 6 7 8 9
2 1 2 3 3 2 1 2 1 3
3 9 8 7 6 5 4 3 2 1
where A, B, C are categorical and the properties x, y, z are present for each. What I think I want to do (part of a larger split-apply-combine step) is to read data with Pandas such that I have dimensionally homogenous observations like this:
i id GRP x y z
-----------------------
1 1 A 1 2 3
2 1 B 4 5 6
3 1 C 7 8 9
4 2 A 1 2 3
5 2 B 3 2 1
6 2 C 2 1 3
7 3 A 9 8 7
8 3 B 6 5 4
9 3 C 3 2 1
So how best to accomplish this?
#1: I thought about reading the file using basic read_csv() options, then iterating/ slicing/transposing/whatever to create another dataframe that has the structure i want. But in my case the number of categories (A,B,C) and properties (x,y,z) is large and is not known ahead of time. I'm also worried about memory issues if scaling to large datasets.
#2: I like the idea of setting the iterator param in read_csv() and then yielding multiple observations per line. (any reason y not set chunksize=1?) I wouldn't be creating multiple dataframes this way at least.
What's the smarter way to do this?
First I constructed the sample dataframe like yours:
column = pd.MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
labels=[[i for i in range(3) for _ in range(3)], [0, 1, 2]*3])
df = pd.DataFrame(np.random.randint(1,10, size=(3, 9)),
columns=column, index=[1, 2, 3])
print df
# A B C
# x y z x y z x y z
# 1 5 7 4 7 7 8 9 1 9
# 2 8 5 1 8 5 9 4 4 2
# 3 4 9 6 2 1 4 6 1 6
To get your desired output, reshape the dataframe using df.stack() and then reset the index:
df = df.stack(0).reset_index()
df.index += 1 # to make index begin from 1
print df
# level_0 level_1 x y z
# 1 1 A 5 7 4
# 2 1 B 7 7 8
# 3 1 C 9 1 9
# 4 2 A 8 5 1
# 5 2 B 8 5 9
# 6 2 C 4 4 2
# 7 3 A 4 9 6
# 8 3 B 2 1 4
# 9 3 C 6 1 6
Then you can just rename the columns as you want. Hope it helps.
Related
I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8
I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
I'm new to python and pandas and I need some ideas. Say I have the following DataFrame:
0 1 2 3 4 5
1 5 5 5 5 5
2 5 5 5 5 5
3 5 5 5 5 5
4 5 5 5 5 5
I want to iterate through each row and change the values of specific columns. Say I wanted to change all of the values in columns (2,3,4) to a 3.
This is what I've tried, am I going down the right path?
for row in df.iterrows():
for col in range(2, 4):
df.set_value('row', 'col', 3)
EDIT:
Thanks for the responses. The simple solutions are obvious, but what if I wanted to change the values to this... for example:
0 1 2 3 4 5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
4 16 17 18 19 20
If you are using a loop when working with dataframes, you are almost always not on the right track.
For this you can use a vectorized assignment:
df[[2, 3, 4]] = 3
Example:
df = pd.DataFrame({1: [1, 2], 2: [1, 2]})
print(df)
# 1 2
# 0 1 1
# 1 2 2
df[[1, 2]] = 3
print(df)
# 1 2
# 0 3 3
# 1 3 3
you can do this
df.iloc[:,1] = 3 #columns 2
df.iloc[:,2] = 3
df.iloc[:,3] = 3
I would like to iterate through multiple dataframe columns looking for the top n values in each column. If the value in the column is in the top n values then keep that value, otherwise bucket in "other". Also, I would like to create new columns from this.
However, I'm not sure how to use .apply in this case as it seems like I need to reference both columns and rows.
np.random.seed(0)
example_df = pd.DataFrame(np.random.randint(low=0, high=10, size=(15, 5)),columns=['a', 'b', 'c', 'd', 'e'])
cols_to_group = ['a','b','c']
top = 2
So for the example below, here's my pseudo code that I'm not sure how to execute:
Pseudo Code:
#loop through each column
for column in example_df[cols_to_group]:
#loop through each value in column and check if it's in top values for the column.
for single_value in column:
if single_value.isin(column.value_counts()[:top].values):
#return value if it is in top values
return single_value
else:
return "other"
#create new column in your df that has bucketed values
example_df[column.name + str("bucketed")+ str(top)]=column
Expected output:
Crude example where top = 2.
a b c d e a_bucketed b_bucketed
0 4 6 4 3 1 4 6
1 8 8 1 5 7 8 8
2 8 6 0 0 2 8 6
3 4 1 0 7 4 4 Other
4 7 8 7 7 7 Other 8
Here is one way. But no treatment for ties has been prescribed.
df['a_bucketed'] = np.where(df['a'].isin(df['a'].value_counts().index[:2]), df['a'], 'Other')
df['b_bucketed'] = np.where(df['b'].isin(df['b'].value_counts().index[:2]), df['b'], 'Other')
# a b c d e a_bucketed b_bucketed
# 0 5 0 3 3 7 Other Other
# 1 9 3 5 2 4 9 3
# 2 7 6 8 8 1 Other Other
# 3 6 7 7 8 1 Other Other
# 4 5 9 8 9 4 Other 9
# 5 3 0 3 5 0 3 Other
# 6 2 3 8 1 3 Other 3
# 7 3 3 7 0 1 3 3
# 8 9 9 0 4 7 9 9
# 9 3 2 7 2 0 3 Other
# 10 0 4 5 5 6 Other Other
# 11 8 4 1 4 9 Other Other
# 12 8 1 1 7 9 Other Other
# 13 9 3 6 7 2 9 3
# 14 0 3 5 9 4 Other 3
I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64