Strange pandas nameless column - python

My csv file looks like this.
,timestamp,side,size,price,tickDirection,grossValue,homeNotional,foreignNotional
0,1569974396.557895,1,11668,8319.5,1,140248813.0,11668,1.40248813
1,1569974394.78865,0,5000,8319.0,0,60103377.0,5000,0.60103377
2,1569974392.355395,0,564,8319.0,0,6779660.999999999,564,0.06779661
3,1569974383.797042,0,100,8319.0,0,1202067.0,100,0.01202067
4,1569974382.944569,0,3,8319.0,0,36062.0,3,0.00036062
5,1569974382.944569,0,7412,8319.0,-1,89097247.0,7412,0.89097247
There's a nameless index column. I want to remove this column.
When I read this in pandas, it just interprets it as an index and moves on.
The problem is, when you now use df[::-1], it flips the indexes as well. So df[::-1]['timestamp][0] is the same as df['timestamp'][0] if the file was read with indexes, but not if it was read without.
How do I make it actually ignore the index column so that df[::-1] doesn't flip my indexes?
I tried usecols in read_csv, but it doesn't matter, it reads the indexes as well as the columns specified. I tried del df[''], but it doesn't work because it doesn't interpret the index column as column '', even though that's what it is.

Just use index_col=0
df = pd.read_csv('data.csv', index_col=0)
print(df)
# Output
timestamp side size price tickDirection grossValue homeNotional foreignNotional
0 1.569974e+09 1 11668 8319.5 1 140248813.0 11668 1.402488
1 1.569974e+09 0 5000 8319.0 0 60103377.0 5000 0.601034
2 1.569974e+09 0 564 8319.0 0 6779661.0 564 0.067797
3 1.569974e+09 0 100 8319.0 0 1202067.0 100 0.012021
4 1.569974e+09 0 3 8319.0 0 36062.0 3 0.000361
5 1.569974e+09 0 7412 8319.0 -1 89097247.0 7412 0.890972

If I understand correctly you issue, you can just set timestamp as your index:
df.set_index('timestamp', drop = True)

Related

Conditionally dropping columns in a pandas dataframe

I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050

pandas.Index.isin produces a different dataframe than simple slicing

I'm really new to pandas and python in general, so I apologize if this is too basic.
I have a list of indices that I must use to take a subset of the rows of a dataframe. First, I simply sliced the dataframe using the indices to produce (df_1). Then I tried to use index.isin just to see if it also works (df_2). Well, it works but it produces a shorter dataframe (and seemingly ignores some of the rows that are supposed to be selected).
df_1 = df.iloc[df_idx]
df_2 = df[df.index.isin(df_idx)]
So my question is, why are they different? How exactly does index.isin work and when is it appropriate to use it?
Synthesising duplicates in index and then it re-produces the behaviour you note. If your index has duplicates it's absolutely expected the two will give different results. If you want to use these interchangeably you need to ensure that your index values uniquely identify a row
n = 6
df = pd.DataFrame({"idx":[i//2 for i in range(n)],"col1":[f"text {i}" for i in range(n)]}).set_index("idx")
df_idx = df.index
print(f"""
{df}
{df.iloc[df_idx]}
{df[df.index.isin(df_idx)]}
""")
output
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5
col1
idx
0 text 0
0 text 0
0 text 1
0 text 1
1 text 2
1 text 2
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5

Update dataframe values that match a regex condition and keep remaining values intact

The following is an excerpt from my dataframe:
In[1]: df
Out[1]:
LongName BigDog
1 Big Dog 1
2 Mastiff 0
3 Big Dog 1
4 Cat 0
I want to use regex to update BigDog values to 1 if LongName is a mastiff. I need other values to stay the same. I tried this, and although it assigns 1 to mastiffs, it nulls all other values instead of keeping them intact.
def BigDog(longname):
if re.search('(?i)mastiff', longname):
return '1'
df['BigDog'] = df['LongName'].apply(BigDog)
I'm not sure what to do, could anybody please help?
You don't need a loop or apply, use str.match with DataFrame.loc:
df.loc[df['LongName'].str.match('(?i)mastiff'), 'BigDog'] = 1
LongName BigDog
1 Big Dog 1
2 Mastiff 1
3 Big Dog 1
4 Cat 0

In a DataFrame, how could we get a list of indexes with 0's in specific columns?

We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0

Create multiple columns based on values in single column in Pandas DataFrame

I have a column in dataframe(df['Values') with 1000 rows with repetitive codes A30, A31, A32, A33, A34. I want to create five separate columns with headings colA30, colA31, colA32, colA33, colA34 in the same dataframe(df) with values 0 or 1 in the new five columns created based on if the row is anyone of codes in df['Values'].
for Ex: df
Values colA30 colA31 colA32 colA33 colA34
A32 0 0 1 0 0
A30 1 0 0 0 0
A31 0 1 0 0 0
A34 0 0 0 0 1
A33 0 0 0 1 0
So if a row in df['Values'] is A32 then colA32 should be 1 and all other columns should be 0's and so on for rest of columns in df['Values'].
I did in the following way. But, is there anyway to do it in one shot as i have multiple columns with several codes for which multiple columns are to be created.
df['A30']=df['Values'].map(lambda x : 1 if x=='A30' else 0)
df['A31']=df['Values'].map(lambda x : 1 if x=='A31' else 0)
df['A32']=df['Values'].map(lambda x : 1 if x=='A32' else 0)
df['A33']=df['Values'].map(lambda x : 1 if x=='A33' else 0)
df['A34']=df['Values'].map(lambda x : 1 if x=='A34' else 0)
You can do this in many ways :
In pandas there is a function called pd.get_dummies() that allows you to convert each categorical data to a binary data. Apply it to your categorical column and then concatenate the dataframe obtained with the original one.
Here is the link to the documentation.
Another way would be to use the library sklearn and its OneHotEncoder. It does exactly the same as above but the objects is not the same. You should use the instance of your OneHotEncoder class to fit it to your categorical data.
For your case I'd use pd.get_dummies(), it's simpler to use.

Categories