How can I print out just the index of a pandas dataframe? - python

I'm trying to make a table, and the way Pandas formats its indices is exactly what I'm looking for. That said, I don't want the actual data, and I can't figure out how to get Pandas to print out just the indices without the corresponding data.

You can access the index attribute of a df using .index:
In [277]:
df = pd.DataFrame({'a':np.arange(10), 'b':np.random.randn(10)})
df
Out[277]:
a b
0 0 0.293422
1 1 -1.631018
2 2 0.065344
3 3 -0.417926
4 4 1.925325
5 5 0.167545
6 6 -0.988941
7 7 -0.277446
8 8 1.426912
9 9 -0.114189
In [278]:
df.index
Out[278]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

.index.tolist() is another function which you can get the index as a list:
In [1391]: datasheet.head(20).index.tolist()
Out[1391]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

You can access the index attribute of a df using df.index[i]
>> import pandas as pd
>> import numpy as np
>> df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)})
a b
0 0 1.088998
1 1 -1.381735
2 2 0.035058
3 3 -2.273023
4 4 1.345342
>> df.index[1] ## Second index
>> df.index[-1] ## Last index
>> for i in xrange(len(df)):print df.index[i] ## Using loop
...
0
1
2
3
4

You can use lamba function:
index = df.index[lambda x : for x in df.index() ]
print(index)

You can always try df.index. This function will show you the range index.
Or you can always set your index. Let say you had a weather.csv file with headers:
'date', 'temperature' and 'event'. And you want set "date" as your index.
import pandas as pd
df = pd.read_csvte'weather_file)
df.set_index('day', inplace=True)
df

Related

Insert Row in Dataframe at certain place

I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.

using loc(index) on pandas DataFrame in Python

I think I miss something very basic in the Pandas DataFrame:
I have the following program:
array = {'A':[1, 2, 3, 4], 'B':[5, 6, 7, 8]}
index = pd.DatetimeIndex(
[ '09:30',
'09:31',
'09:32',
'09:33' ])
data = pd.DataFrame(array, index=index)
data.index = data.index.strftime('%H:%M')
print(data)
print(data.loc('09:33'))
I get:
A B
09:30 1 5
09:31 2 6
09:32 3 7
09:33 4 8
Which is great, but I can not access a row using it's index '09:33' and I get:
ValueError: No axis named 09:33 for object type <class 'pandas.core.frame.DataFrame'>
What am I missing ?
Thank you,
Ehood
You need to use brackets [] instead: data.loc['09:33']
array = {'A':[1, 2, 3, 4], 'B':[5, 6, 7, 8]}
index = pd.DatetimeIndex(
[ '09:30',
'09:31',
'09:32',
'09:33' ])
data = pd.DataFrame(array, index=index)
data.index = data.index.strftime('%H:%M')
print(data)
print(data.loc['09:33']) # HERE!
Output:
A B
09:30 1 5
09:31 2 6
09:32 3 7
09:33 4 8
A 4
B 8
Name: 09:33, dtype: int64
You're close. When using loc or iloc, its followed by square brackets, like this:
df.loc['viper']
In your case that ends up like this:
You can find the documentation for that here.

Build a dataframe from a dict with specified labels from a txt

i want to make a dataframe with defined labels. Dont know how to tell panda to take the labels from the list. Hope someone can help
import numpy as np
import pandas as pd
df = []
thislist = []
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(data= thisdict[thislist]) # <- here is my problem
I want to get this:
df = A D
1 7
2 8
3 9
Use:
df = pd.DataFrame(thisdict)[thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
We could also use DataFrame.drop
df = pd.DataFrame(thisdict).drop(columns = ['B','C'])
or DataFrame.reindex
df = pd.DataFrame(thisdict).reindex(columns = thislist)
or DataFrame.filter
df = pd.DataFrame(thisdict).filter(items=thislist)
We can also use filter to filter thisdict.items()
df = pd.DataFrame(dict(filter(lambda item: item[0] in thislist, thisdict.items())))
print(df)
A D
0 1 7
1 2 8
2 3 9
I think this answer is completed with the solution of #anky_91
Finally, I recommend you see how to index
IIUC, use .loc[] with the dataframe constructor:
df = pd.DataFrame(thisdict).loc[:,thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
Use a dict comprehension to create a new dictionary that is a subset of your original so you only construct the DataFrame you care about.
pd.DataFrame({x: thisdict[x] for x in thislist})
A D
0 1 7
1 2 8
2 3 9
If you want to deal with the possibility of missing Keys, add some logic so it's similar to reindex
pd.DataFrame({x: thisdict[x] if x in thisdict.keys() else np.NaN for x in thislist})
df = pd.DataFrame(thisdict)
df[['A', 'D']]
another alternative for your input:
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(thisdict)
and than simply remove your columns not in thelist (you can do it directly from the df or aggregate them):
remove_columns = []
for c in df.columns:
if c not in thislist:
remove_columns.append(c)
and remove it:
df.drop(columns=remove_columns, inplace=True)

Python calcualte new column based on condition of existing columns

Wanted a new column based on certain conditions of existing columns, below is what I am doing right now, but it takes too much time for huge data. Is there any efficient or faster way to do it.
DF["A"][0] = 0
for x in range(1,rows):
if(DF["B"][x]>DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] + DF["C"][x]
elif(DF["B"][x]<DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] - DF["C"][x]
else:
DF["A"][x] = DF["A"][x-1]
If I got you right this is what you want:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
df['A'] = np.where(df.index==0,
0,
np.where(df['B']>df['B'].shift(),
df['A']-df['A'].shift(),
np.where(df['B']<df['B'].shift(),
df['A'].shift()-df['C'],
df['A'].shift())))
df
# A B C
#0 0.0 12 3
#1 1.0 15 9
#2 -10.0 9 12
#3 -3.0 8 6
#4 1.0 15 8
a new column based on certain conditions of existing columns,
I'm using the DataFrame provided by #zipa:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
First approach
Here's a function that implements efficiently as you specified. It works by leveraging Pandas' indexing features, specifically row masks
def update(df):
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
for cond, sign in [(cond_larger, +1), # A[x-1] + C[x]
(cond_smaller, -1), # A[x-1] - C[x]
(cond_else, 0)]: # A[x-1] + 0
if any(cond):
df.loc[cond, 'A_updated'] = (df['A'].shift().fillna(0) +
sign * df[cond]['C'])
df['A'] = df['A_updated']
df.drop(columns=['A_updated'], inplace=True)
return df
update(df)
=>
A B C
0 3.0 12 3
1 10.0 15 9
2 -10.0 9 12
3 -3.0 8 6
4 12.0 15 8
Optimized
It turns out you can use DataFrame.mask to achieve the same as above. Note you could combine the conditions into the call of mask, however I find it easier to read like this:
# specify conditions
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
# apply
A_shifted = (df['A'].shift().fillna(0)).copy()
df.mask(cond_larger, A_shifted + df['C'], axis=0, inplace=True)
df.mask(cond_smaller, A_shifted - df['C'], axis=0, inplace=True)
df.mask(cond_else, A_shifted, axis=0, inplace=True)
=>
(same results as above)
Notes:
I'm assuming default value 0 for A/B[x-1]. If the first row should be treated differently remove or replace .fillna(0). Results will be different.
Conditions are checked in sequence. Depending on whether updates should use the original values in A or those updated in the previous condition you may not need the helper column A_updated
See previous versions of this answer for a history of how I got here

Detect indexes of a dataframe from values contained in another dataframe with Pandas

I have 2 dataframes:
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
and
# dataframe 2
convert_table = {'XXX': ['ALLO','BELO','CACO','CUCO','DADO','FIGO','FIGO','ONGF','PALO','PALO','PINO','TNCO','TNCO','TNCO','TNTO']}
df2 = pd.DataFrame(convert_table)
My goal is to identify the indexes of the elements of df2['XXX'] which follow these conditions:
Are present in df1['Name']
Have the correspondin df1['Id'] = NaN
I was able to achieve my goal by using the following lines of code:
nan_names = df1['Name'][df1['Id'].isnull()]
df3 = pd.DataFrame()
for name in nan_names:
index = df2[df2['XXX']==name].index.tolist()
if index:
dic = {'name':[name] , 'index':[index]}
df3 = pd.concat([df3,pd.DataFrame(dic)], ignore_index=True)
However I would like to know if there is a more efficient and elegant way to achieve my goal.
The result should look like this:
index name
0 [11, 12, 13] TNCO
1 [5, 6] FIGO
Note: if the name is not found then, it is not needed to store any information.
You're looking for the method isin:
df = df2[df2['XXX'].isin(nan_names)]
This will return:
XXX
5 FIGO
6 FIGO
11 TNCO
12 TNCO
13 TNCO
From there it's just a matter of formatting:
df.reset_index().groupby('XXX')['index'].apply(list)
This will return:
XXX
FIGO [5, 6]
TNCO [11, 12, 13]
The idea is to reset the index so that it becomes a column (named index). Grouping by name and applying the list function will return the list of original indices for each name.
Calling reset_index once more will return the result you were looking for.
Edit
Combine everything into a one-liner, this will be the output:
In [21]: df2[df2['XXX'].isin(nan_names)].reset_index().groupby('XXX')['index'].apply(list).reset_index()
Out[21]:
XXX index
0 FIGO [5, 6]
1 TNCO [11, 12, 13]
I think you can use merge with groupby and apply list:
nan_names = df1.loc[df1['Id'].isnull(), ['Name']]
print (nan_names)
Name
2 TNCO
5 FIGO
7 LABO
df = pd.merge(df2.reset_index(), nan_names, on='Name', suffixes=('','_'))
print (df)
index Name
0 5 FIGO
1 6 FIGO
2 11 TNCO
3 12 TNCO
4 13 TNCO
print (df.groupby('Name')['index'].apply(list).reset_index())
Name index
0 FIGO [5, 6]
1 TNCO [11, 12, 13]

Categories