Get Pandas DataFrame first column - python

This question is odd, since I know HOW to do something, but I dont know WHY I cant do it another way.
Suppose simple data frame:
import pandasas pd
a = pd.DataFrame([[0,1], [2,3]])
I can slice this data frame very easily, first column is a[[0]], second is a[[1]]. Simple isnt it?
Now, lets have more complex data frame. This is part of my code:
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
Data frame frame is also pandas DataFrame, such as a. I canget second column very easily as frame[[1]]. But when I try frame[[0]] I get an error:
Traceback (most recent call last):
File "<ipython-input-55-0c56ffb47d0d>", line 1, in <module>
frame[[0]]
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 1991, in __getitem__
return self._getitem_array(key)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1184, in _convert_to_indexer
indexer = labels._convert_list_indexer(objarr, kind=self.name)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\indexes\base.py", line 1112, in _convert_list_indexer
return maybe_convert_indices(indexer, len(self))
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1856, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
I can still use frame.iloc[:,0] but problem is that I dont understand why I cant use simple slicing by [[]]? I use winpython spyder 3 if that helps.

using your code:
import pandas as pd
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
if you ask to print out the 'frame' you get:
Variable 1
loc_1 0 0
loc_2 1 1
loc_3 2 8
loc_4 3 27
loc_5 4 64
loc_6 5 125
......
So the cause of your problem becomes obvious, you have no column called '0'.
At line one you specify a lista called var_vec.
At line 4 you make a dataframe out of that list, but you specify the index values and the column name (which is usually good practice).
The numerical column name, '0', '1',.. as in the first example, only takes place when you dont specify the column name, its not a column position indexer.
If you want to access columns by their position, you can:
df[df.columns[0]]
what happens than, is you get the list of columns of the df, and you choose the term '0' and pass it to the df as a reference.
hope that helps you understand
edit:
another way (better) would be:
df.iloc[:,0]
where ":" stands for all rows. (also indexed by number from 0 to range of rows)

Related

how can I make a for loop to populate a DataFrame?

and from the begining I thanks everyone that seeks to help.
I have started to learn python and came across a opportunity to use python to my advantage at work
Im basically made a script that reads a google sheets file, import it into pandas and cleaned up the data.
In the end, I just wanna have the name of the agents in the columns and all of their values for resolucao column below them so I can take the average amount of time for all of the agentes, but I'm struggling to make it with a list comprehension / for loop.
This is what the DataFrame looks like after I cleaned it up
And this is the Code that I tried to Run
PS: Sorry for the messy code.
agentes_unique = list(df['Agente'].unique())
agentes_duplicated = df['Agente']
value_resolucao_duplicated = df['resolucao']
n_of_rows = []
for row in range(len(df)):
n_of_rows.append(row)
i = 0
while n_of_rows[i] < len(n_of_rows):
df2 = pd.DataFrame({agentes_unique[i]: (value for value in df['resolucao'][i] if df['Agente'][i] == agentes_unique[i])})
i+= 1
df2.to_excel('teste.xlsx',index = True, header = True)
But in the end it came to this error:
Traceback (most recent call last):
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\FELIPE\Desktop\Python\webscraping\bot_csv_extract\bot_login\main.py", line 50, in <module>
df2 = pd.DataFrame({'Agente': (valor for valor in df['resolucao'][i] if df['Agente'][i] == 'Gabriel')})
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 0
I feel like I'm making some obvious mistake but I cant fix it
Again, thanks to anyone who tries to help
Are you looking to do something like this? This is just sample data, but a good start for what you are looking to do if I understand what your wanting to do.
data = {
'Column1' : ['Data', 'Another_Data', 'More_Data', 'Last_Data'],
'Agente' : ['Name', 'Another_Name', 'More_Names', 'Last_Name'],
'Data' : [1, 2, 3, 4]
}
df = pd.DataFrame(data)
df = df.pivot(index = 'Column1', columns=['Agente'], values = 'Data')
df.reset_index()
It is not recommended to use for loops against pandas DataFrames: It is considered messy and inefficient.
With some practice you will be able to approach problems in such a way that you will not need to use for loops in these cases.
From what I understand, your goal can be realized in 3 simple steps:
1. Select the 2 columns of interest. I recommend you take a look at how to access different elements of a dataFrame:
df = df[["Agent", "resolucao"]]
2. Convert the column you want to average to a numeric value. Say seconds:
df["resolucao"] = pd.to_timedelta(df['resolucao'].astype(str)).dt.total_seconds()
3. Apply an average aggregation, via the groupby() function:
df = df.groupby(["Agente"]).mean().reset_index()
Hope this helps.
For the next time, I also recommend you to not post the database as an image in order to be able to reproduce your code.
Cheers and keep it up!

Pandas creating big dataFrame and filling it in loop

I created already the columns of my dataframe
id=[f'GeneID_region_{i}' for i in range(43)]
value=[f'GeneValue_region_{i}' for i in range(43)]
lst=[]
for i in range(43):
lst.append(id[i])
lst.append(value[i])
df = pd.DataFrame(lst)
df = df.T
Now it looks like that:
df
Out[158]:
0 1 ... 84 85
0 GeneID_region_0 GeneValue_region_0 ... GeneID_region_42 GeneValue_region_42
[1 rows x 86 columns]
GeneID_region... are my columns, and now I want to fill the columns line by line.. But I think I haven't defined my rows as rows yet because I cant do:
df.GeneID_region_0
Traceback (most recent call last):
File "<ipython-input-159-2760f7e0dd61>", line 1, in <module>
df.GeneID_region_0
File "/home/anja/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'GeneID_region_0'
Can someone help me how to do that properly?
The result should look like the following:
I have an numpy array of dimension 43x25520.
I want to have 25520 values in column 'GeneID_region0'
and than 25520 values in column 'GeneValue_region0'
and so on.. so in the end I want to have a pandas frame of dimension (25520,86)
I am guessing that what you wanted was GeneID_region_n etc for column names, and then to fill your df with data. You can do this (using 0 as fake data since you didnt specify) like this:
id=[f'GeneID_region_{i}' for i in range(43)]
value=[f'GeneValue_region_{i}' for i in range(43)]
lst=[]
for i in range(43):
lst.append(id[i])
lst.append(value[i])
df = pd.DataFrame([[0 for i in range(43+43)]],columns=lst)

plotting specific columns based on user input

I have a dataframe with the following setup:
a c g s
Ind b d t d
0 11 12 22 33
1 13 14 44 101
The goal is to receive input from user via GUI, save the input as a list and compare it with the list of the headers in the dataframe. If the the two match, then plot the column where they match (the column will be the y-axis and index will be the x-axis).
For example, if user selects [('c','d')] then I would like the code to plot that column. Here is what I have so far.
df = pd.read_csv('foo.csv',sep=r'\s*,\s*', encoding='ascii', engine='python')
header_list = [('a','b'),('c','d'),('g','t'),('s','d')]
user_Input_list = [('c','d')]
sub_list = []
for contents in header_list:
for contents2 in user_Input_list:
if contents == contents2:
ax = df.reset_index().plot(x='Ind', y=x[header_list], kind='bar')
for p in ax.patches:
ax.annotate('{:.2E}'.format(Decimal(str(p.get_height()))),
(p.get_x(),p.get_height()))
plt.tight_layout()
plt.show()
I think the problem lies in how I am trying to select the y-axis with y=x[header_list].
Edit
Here is the Error message I get when I run the above code.
Traceback (most recent call last):
File "/home/JM9/PycharmProjects/subfolder/subfolder.py", line 360, in <module>
ax = x.reset_index().plot(x='Ind', y=x[header_list], kind='bar')
File "/home/JM9/PycharmProjects/subfolder/venv/lib/python3.6/site-packages/pandas/plotting/_core.py", line 780, in __call__
data = data[y].copy()
File "/home/JM9/PycharmProjects/subfolder/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2982, in __getitem__
return self._getitem_frame(key)
File "/home/JM9/PycharmProjects/subfolder/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 3081, in _getitem_frame
raise ValueError("Must pass DataFrame with boolean values only")
ValueError: Must pass DataFrame with boolean values only
I'm having trouble figuring out your example code, one thing that could help to solve the problem is to re-create a simpler version of what you're trying to do, without hard-to-deal-with column names.
import random
import pandas as pd
data={
'a': [random.randint(0, 50) for i in range(4)],
'b': [random.randint(0, 50) for i in range(4)],
'c': [random.randint(0, 50) for i in range(4)]
}
df = pd.DataFrame(data)
df.index = df.index.rename('Ind')
user_input = 'b'
if user_input in df.columns:
ax = df[user_input].plot(x='Ind', kind='bar')
Some useful takeaways for you: (a) instead of looping you can perform a simple test to see if a user's input is equal to a data frame column (if user_input in df:) and (b) you can call .plot() against individual columns.
Edit: changed 'b' to user_input

How to apply a function to a certain column by name in a dataframe

I have a dataframe with a columns that contain GPS coordinates. I want to convert the columns that are in degree seconds to degree decimals. For example, I have a 2 columns named "lat_sec" and "long_sec" that are formatted with values like 186780.8954N. I tried to write a function that saves the last character in the string as the direction, divide the number part of it to get the degree decimal, and then concatenate the two together to have the new format. I then tried to find the column by its name in the data frame and apply the function to it.
New to python and can't find other resources on this. I don't think I created my function properly. I have the word 'coordinate' in it because I did not know what to call the value that I am breaking down.
My data looks like this:
long_sec
635912.9277W
555057.2000W
581375.9850W
581166.2780W
df = pd.DataFrame(my_array)
def convertDec(coordinate):
decimal = float(coordinate[:-1]/3600)
direction = coordinate[-1:]
return str(decimal) + str(direction)
df['lat_sec'] = df['lat_sec'].apply(lambda x: x.convertDec())
My error looks like this:
Traceback (most recent call last):
File "code.py", line 44, in <module>
df['lat_sec'] = df['lat_sec'].apply(lambda x: x.convertDec())
File "C:\Python\Python37\lib\site-packages\pandas\core\frame.py", line 2917, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Python\Python37\lib\site-packages\pandas\core\indexes\base.py", line 2604, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 'lat_sec'
By doing float(coordinate[:-1]/3600) you are dividing str by int which is not possible, what you can do is convert the str into float than divide it by integer 3600 which gives you float output.
Second you are not using apply properly and there is no lat_sec column to which you are applying your function
import pandas as pd
df = pd.DataFrame(['635912.9277W','555057.2000W','581375.9850W','581166.2780W'],columns=['long_sec'])
#function creation
def convertDec(coordinate):
decimal = float(coordinate[:-1])/3600
direction = coordinate[-1:]
return str(decimal) + str(direction)
#if you just want to update the existing column
df['long_sec'] = df.apply(lambda row: convertDec(row['long_sec']), axis=1)
#if you want to create a new column, just change to the name that you want
df['lat_sec'] = df.apply(lambda row: convertDec(row['long_sec']), axis=1)
#OUTPUT
long_sec
0 176.64247991666667W
1 154.18255555555555W
2 161.49332916666665W
3 161.43507722222225W
if you don't want output in float but in integer just change float(coordinate[:-1])/3600 to int(float(coordinate[:-1])/3600)
In your code above, inside convertDec method, there is also an error in :
decimal = float(coordinate[:-1]/3600)
You need to convert the coordinate to float first before divide it with 3600.
So, your code above should look like this :
import pandas as pd
# Your example dataset
dictCoordinates = {
"long_sec" : ["111111.1111W", "222222.2222W", "333333.3333W", "444444.4444W"],
"lat_sec" : ["555555.5555N", "666666.6666N", "777777.7777N", "888888.8888N"]
}
# Insert your dataset into Pandas DataFrame
df = pd.DataFrame(data = dictCoordinates)
# Your conversion method here
def convertDec(coordinate):
decimal = float(coordinate[:-1]) / 3600 # Eliminate last character, then convert to float, then divide it with 3600
decimal = format(decimal, ".4f") # To make sure the output has 4 digits after decimal point
direction = coordinate[-1] # Extract direction (N or W) from content
return str(decimal) + direction # Return your desired output
# Do the conversion for your "long_sec"
df["long_sec"] = df.apply(lambda x : convertDec(x["long_sec"]), axis = 1)
# Do the conversion for your "lat_sec"
df["lat_sec"] = df.apply(lambda x : convertDec(x["lat_sec"]), axis = 1)
print(df)
That's it. Hope this helps.

Appending DataFrame to List in Pandas, Python

I have a a file of data and want to select a specific State. From there I need to return this in a list, but there will be years that correspond to the date with missing data, so I need to replace the missing data.
I am having some issue with my code, likely something is slightly off in my for loop:
def stateCountAsList(filepath,state):
import pandas as pd
pd.set_option('display.width',200)
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
dfState = df[df['State'] == state]
yearList = range(1999,2012)
countsList = []
for dfState['Year'] in yearList:
countsList = dfState['Count']
else:
countsList.append(np.nan)
return countsList
print countsList.tolist()
stateCountAsList(filepath, state)
state = 'California'
Traceback:
C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py:59: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
for dfState['Year'] in yearList:
Traceback (most recent call last):
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 67, in <module>
stateCountAsList(filepath, state)
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 62, in stateCountAsList
countsList.append(np.nan)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\series.py", line 1466, in append
verify_integrity=verify_integrity)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 754, in concat
copy=copy)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 805, in __init__
raise TypeError("cannot concatenate a non-NDFrame object")
TypeError: cannot concatenate a non-NDFrame object
You have at least two different issues in your code:
The warning
A value is trying to be set on a copy of a slice from a DataFrame.
is triggered by for dfState['Year'] in yearList (line 59 in your code). In this line you try to loop over a range of years (1999 to 2012), but instead you implicitely try to assign the year value to dfState['Year']. This is not a copy, but a "view" (http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy), since df = dataFrame.iloc[0:638,:] returns a view.
But as mentioned earlier, you don't want to assign a value to the DataFrame here, only loop over years. So the for-loop should look like:
for year in range(1999,2012):
...
The second issue is in line 62. Here, you try to append np.nan to your "list" countsList - but countsList is not a list anymore, but a DataFrame!
Two lines before, you assign a pd.Series (countsList = dfState['Count']), effectively changing the type. This gives you the TypeError: cannot concatenate a non-NDFrame object
With this information you should be able to correct your loop.
As an alternative, you can get the desired result using Pandas query method (http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method-experimental):
def stateCountAsList(filepath,state):
import pandas as pd
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
stateList = df.query("(State == #state) & (Year > 1999 < 2005").Count.tolist()
return stateList

Categories