Python Pandas Pivot - Why Fails - python

I have tried for a while to get this to wrk and I can't - I read the documentation and I must be misunderstanding something
I have a Data Frame in long format and I want to make it wide - this is quite common. But I get an error
from pandas import DataFrame
data = DataFrame({'value' : [1,2,3,4,5,6,7,8,9,10,11,12],
'group' : ['a','a','a','b','b','b','b','c','c','c','d','d']})
data.pivot(columns='group')
the error I get is (the lats part, as they are quite extensive): ValueError: cannot label index with a null key
I tried this in python (notebook) and also on regular python c command line in OS X with the same result
Thanks for any insight, I am sure it will be something basic

From what you were trying to do, you were trying to pass 'group' as index so the pivot fails.
It should be:
data.pivot(data.index, 'group')
or,
# the format is pivot(index=None, columns=None, values=None)
data.pivot(index=data.index, columns='group')
However I'm not entirely sure what expected output you want, if you just want shorter presentation, you can always use transpose:
data.T
or, the best for presentation in your case, is groupby:
data.groupby('group').sum()
value
group
a 6
b 22
c 27
d 23

Related

convert integer header to a sting in order to apply tuple in pandas

I have this dataframe:
01100MS,02200MS,02500MS,03100MS,22
626323,616720,616288,611860,622375
5188431,5181393,5173583,5165895,5152605
1915,1499,1310,1235,1907
1,4.1,4.41,4.441,4.4441
2,4.2,4.42,4.442,4.4442
3,4.3,4.43,4.443,4.4443
4,4.4,4.44,4.444,4.4444
5,4.5,4.45,4.445,4.4445
6,4.6,4.46,4.446,4.4446
7,4.7,4.47,4.447,4.4447
8,4.8,4.48,4.448,4.4448
9,4.9,4.49,4.449,4.4449
10,5,4.5,4.45,4.445
11,5.1,4.51,4.451,4.4451
I would like to have multiple headers. According to this post, I have done this:
dfr = pd.read_csv(file_input,sep=',',header=None,skiprows=0)
cols = tuple(zip(dfr.iloc[0], (dfr.iloc[1]).apply(lambda x: x[1:-1])))
However, I get an error:
TypeError: 'float' object is not subscriptable
The problem, I suppose, is due to the fact that 22 in the header is an integer. Indeed if I substitute 22 with A22 it works.
Due the fact that I have to work with multiple and large dataframe, I can not do it by end. As a consequence, I have tried this solution:
dfr.iloc[0] = dfr.iloc[0].apply(str)
but it does not seem to work.
Do you have some suggestions?
apply(lambda x: x[1:-1]) removes the first and last character, this was needed in the other post you quote as the format was [col1] but in your case you want the same value as in the file.
The problem is that 22 has only 2 characters. So just remove the apply function and then you can build the multiIndex.

pandas dataframe how to convert object to array and extract the array value

forgive me if my question was a bit ambiguous. will try to be better junior bb.
Question.
I have DataFrame as below which I received from hive DB.
How to extract value 'cat' and 'animal', 'dog' in column col2, whatever.
In[]:
sample = {'col1': ['cat', 'dog'], 'col2': ['WrappedArray([animal], [cat])', 'WrappedArray([animal], [dog])']}
df = pd.DataFrame(data=sample)
df
out[] :
col1 col2
-----------------------------------------
0 cat WrappedArray([animal], [cat])
1 dog WrappedArray([animal], [dog])
I tried to convert object to an array and extract the data like this code.
In[]: df['col2'][0][1]
Out[]: cat
if I'm wrong, I have to try another way because I am a newbie for Pandas Dataframe.
Could someone let me know how's works?
thanks in advance.
The data in the second column col2 appear to be simply strings.
The output from df['col2'][0][1] would be "r" Which is the second character (index 1) in the first string. To get "cat" you would need to alter the strings and remove the 'WrappedArray([animal]...' stuff. leaving only the actual data. "cat", "dog', etc.
You could try df['col2'].iloc[0][24:27], but that's not a general solution. It would also be brittle and unmaintainable.
If you have any control over how the data is exported from the database, try to get the data out in a cleaner format, i.e. without the WrappedArray(... stuff.
Regular expressions might be helpful here.
You could try something like this:
import re
wrapped = re.compile(r'\[(.*?)\].+\[(.*?)\]')
element = wrapped.search(df['col2'].iloc[0]).group(2)
* Danger Danger Danger *
If you need that functionality. You could create a WrappedArray function that returns the contents as list of strings or the like. Then you can call it by using eval(df['col2'][0][1]).
Don't do this.
FYI:
Your dtypes likely defaulted to object, because you didn't specify them when you created your data frame. You can do that like this:
df = pd.DataFrame(data=sample, dtype='string')
Also, it's recommended to use iloc to index dataframes by index.
I solved it as #rkedge advised me
the data is written in a foreign language.
As I said, DataFrame has object data written with 'WrappedArray([우주ごぎゅ],[ぎゃ],[한국어])'.
df_ = df['col2'].str.extractall(r'([REGEX expression]+)')
df_
0 0 우주ごぎゅ
0 1 ぎゃ
0 2 한국어
1 0 cat
2 0 animal

Python - Access one part of a named tuple inside a two-dimensional list

In Python, I have row data that I'm trying to set to a pandas data frame. However the cell data is a named tuple so my output data contains:
Cell(r=1,c=2,v='value').
All I want is the v from the named tuple. How would I go about setting my dataframe with only the cell value.
This is what I use to set the rows to the dataframe:
df = pandas.DataFrame(data=cells)
Named tuple and example code below:
import collections
Cell = collections.namedtuple('Cell',['r','c','v'])
cells = [[Cell(1,3,5),Cell(6,233,22)],[Cell(6,88,22),Cell(6454,2344443,34)]]
Desired result:
5 22
22 34
I thought someone posted an answer here...
df.applymap(lambda x: x.v)
Basically, accessing the value for v in Cell.
Edit: This was #JohnE's solution; not sure what the etiquette is here? It would have taken me a little while to get there.

Python dataframe how to group by one column and get sum of other column

I want to create a new data frame which has 2 columns, grouped by Striker_Id and other column which has sum of 'Batsman_Scored' corresponding to the grouped 'Striker_Id'
Eg:
Striker_ID Batsman_Scored
1 0
2 8
...
I tried this ball.groupby(['Striker_Id'])['Batsman_Scored'].sum() but this is what I get:
Striker_Id
1 0000040141000010111000001000020000004001010001...
2 0000000446404106064011111011100012106110621402...
3 0000121111114060001000101001011010010001041011...
4 0114110102100100011010000000006010011001111101...
5 0140016010010040000101111100101000111410011000...
6 1100100000104141011141001004001211200001110111...
It doesn't sum, only joins all the numbers. What's the alternative?
For some reason, your columns were loaded as strings. While loading them from a CSV, try applying a converter -
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : int})
Or,
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : pd.to_numeric})
If that doesn't work, then convert to integer after loading -
df['Batsman_Scored'] = df['Batsman_Scored'].astype(int)
Or,
df['Batsman_Scored'] = pd.to_numeric(df['Batsman_Scored'], errors='coerce')
Now, performing the groupby should work -
r = df.groupby('Striker_Id')['Batsman_Scored'].sum()
Without access to your data, I can only speculate. But it seems like, at some point, your data contains non-numeric data that prevents pandas from being able to perform conversions, resulting in those columns being retained as strings. It's a little difficult to pinpoint this problematic data until you actually load it in and do something like
df.col.str.isdigit().any()
That'll tell you if there are any non-numeric items. Note that it only works for integers, float columns cannot be debugged like this.
Also, another way of seeing what columns have corrupt data would be to query dtypes -
df.dtypes
Which will give you a listing of all columns and their datatypes. Use this to figure out what columns need parsing -
for c in df.columns[df.dtypes == object]:
print(c)
You can then apply the methods outlined above to fix them.

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Categories