Read all lines of csv file using .read_csv - python

I am trying to read simple csv file using pandas but I can't figure out how to not "lose" the first row.
For example:
my_file.csv
Looks like this:
45
34
77
But when I try to to read it:
In [18]: import pandas as pd
In [19]: df = pd.read_csv('my_file.csv', header=False)
In [20]: df
Out[20]:
45
0 34
1 77
[2 rows x 1 columns]
This is not what I am after, I want to have 3 rows. I want my DataFrame to look exactly like this:
In [26]: my_list = [45,34,77]
In [27]: df = pd.DataFrame(my_list)
In [28]: df
Out[28]:
0
0 45
1 34
2 77
[3 rows x 1 columns]
How can I use .read_csv to get the result I am looking for?

Yeah, this is a bit of a UI problem. We should handle False; right now it thinks you want the header on row 0 (== False.) Use None instead:
>>> df = pd.read_csv("my_file.csv", header=False)
>>> df
45
0 34
1 77
>>> df = pd.read_csv("my_file.csv", header=None)
>>> df
0
0 45
1 34
2 77

Related

Pandas: How to sort rows based on particular suffix values?

My Pandas data frame contains the following data reading from a csv file:
id,values
1001-MAC, 10
1034-WIN, 20
2001-WIN, 15
3001-MAC, 45
4001-LINUX, 12
4001-MAC, 67
df = pd.read_csv('example.csv')
df.set_index('id', inplace=True)
I have to sort this data frame based on the id column order by given suffix list = ["WIN", "MAC", "LINUX"]. Thus, I would like to get the following output:
id,values
1034-WIN, 20
2001-WIN, 15
1001-MAC, 10
3001-MAC, 45
4001-MAC, 67
4001-LINUX, 12
How can I do that?
Here is one way to do that:
import pandas as pd
df = pd.read_csv('example.csv')
idx = df.id.str.split('-').str[1].sort_values(ascending=False).index
df = df.loc[idx]
df.set_index('id', inplace=True)
print(df)
Try:
df = df.sort_values(
by=["id"], key=lambda x: x.str.split("-").str[1], ascending=False
)
print(df)
Prints:
id values
1 1034-WIN 20
2 2001-WIN 15
0 1001-MAC 10
3 3001-MAC 45
5 4001-MAC 67
4 4001-LINUX 12
Add a column to a dataframe that would contain only prefixes (use str.split() function for that) and sort whole df based on that new column.
import pandas as pd
df = pd.DataFrame({
"id":["1001-MAC", "1034-WIN", "2001-WIN", "3001-MAC", "4001-LINUX", "4001-MAC"],
"values":[10, 20, 15, 45, 12, 67]
})
df["id_postfix"] = df["id"].apply(lambda x: x.split("-")[1])
df = df.sort_values("id_postfix", ascending=False)
df = df[["id", "values"]]
print(df)
Please be sure to answer the question. Provide details and share your research!

Adding a list to a column when iterating over a dataframe row

I have the following dataframe:
import numpy as np
import pandas as pd
import random
df = pd.DataFrame(np.random.randint(0,20,size=(2, 2)), columns=list('AB'))
df
A B
0 13 4
1 16 17
Then I create another dataframe in a loop where the columns of the dataframe are lists. There is a post here (Pandas split column of lists into multiple columns) that shows to split the columns.
tmp_lst_1 = []
for index, row in df.iterrows():
tmp_lst_2 = []
for r in range(len(row)):
tmp_lst_2.insert(r, random.sample(range(1, 50), 2) )
tmp_lst_1.insert(index, tmp_lst_2)
df1 = pd.DataFrame(tmp_lst_1)
df1
0 1
0 [21, 5] [6, 42]
1 [49, 40] [8, 45]
but I was wondering if there is a more efficient way to create this dataframe without needing to split all the columns individually? I am looking to get something like this:
df1
C D E F
0 21 5 6 42
1 49 40 8 45
I think loop by DataFrame.iterrows here is not necessary, you can use nested list comprehension with flattening lists:
df = pd.DataFrame(np.random.randint(0,20,size=(2, 2)), columns=list('AB'))
tmp_lst_1 = [[x for r in range(len(df.columns))
for x in random.sample(range(1, 50), 2)]
for i in range(len(df))]
df1 = pd.DataFrame(tmp_lst_1, index=df.index)
print (df1)
0 1 2 3
0 23 24 42 48
1 26 43 24 5
Alternative without list comprehension:
tmp_lst_1 = []
for i in range(len(df)):
flat_list = []
for r in range(len(df.columns)):
for x in random.sample(range(1, 50), 2):
flat_list.append(x)
tmp_lst_1.append(flat_list)

Looping over data frame to cap and sum another data frame

I am trying to use entries from df1 to limit amounts in df2, then add them up based on their type and summarize in df3. I'm not sure how to get it, the for loop using iterrows would be my best guess but it's not complete.
Code:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','50','100']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85','105','80'], \
'Type': ['a' ,'b' ,'b' ,'c' ,'a' , 'b' ,'d' ]})
df3 = pd.DataFrame({'Type': ['a' ,'b' ,'c' ,'d']})
df1['Caps'] = df1['Caps'].astype(float)
df2['Amounts'] = df2['Amounts'].astype(float)
for index1, row1 in df1.iterrows():
for index2, row2 in df3.iterrows():
df3[str(row1['Caps']+'limit')] = df2['Amounts'].where(
df2['Type'] == row2['Type']).where(
df2['Amounts']<= row1['Caps'], row1['Caps']).sum()
# My ideal output would be this:
df3 = pd.DataFrame({'Type':['a','b','c','d'],
'Total':['130','195','35','80'],
'25limit':['50','75','25','25'],
'50limit':['95','125','35','50'],
'100limit':['130','190','35','80'],
})
Output:
>>> df3
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Use numpy for compare all values Amounts with Caps by broadcasting to 2d array a, then create DataFrame by constructor with sum per columns, transpose by DataFrame.T and DataFrame.add_prefix.
For aggregated column use DataFrame.insert for first column with GroupBy.sum:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
a = np.where(am <= ca[:, None], am[None, :], ca[:, None])
df1 = (pd.DataFrame(a,columns=df2['Type'],index=df1['Caps'])
.sum(axis=1, level=0).T.add_suffix('limit'))
df1.insert(0, 'Total', df2.groupby('Type')['Amounts'].sum())
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Here is my solution without numpy, however it is two times slower than #jezrael's solution, 10.5ms vs. 5.07ms.
limcols= df1.Caps.to_list()
df2=df2.reindex(columns=["Amounts","Type"]+limcols)
df2[limcols]= df2[limcols].transform( \
lambda sc: np.where(df2.Amounts.le(sc.name),df2.Amounts,sc.name))
# Summations:
g=df2.groupby("Type")
df3= g[limcols].sum()
df3.insert(0,"Total", g.Amounts.sum())
# Renaming columns:
c_dic={ lim:f"{lim:.0f}limit" for lim in limcols}
df3= df3.rename(columns=c_dic).reset_index()
# Cleanup:
#df2=df2.drop(columns=limcols)

Non-breaking space column indexing in pandas dataframe

I have a pandas groupby object that contains a column with as its name (i.e., non-breaking space). Although the following snippet is able to print it:
In[25]:
...: for key, item in grouped_df:
...: print(key)
Output:
... other names
I'm not able to index it with grouped_df[key]:
In[29]:
...: for key, item in grouped_df:
...: print(key, grouped_df[key].count())
which results in:
KeyError: 'Column not found: '
[Update]
Partial solution was to use .agg(['count']). However, that is a solution to that specific example I'm giving, but not the main problem.
Here is code which reproduces the problem:
import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame({'col':np.random.choice([1,2,3,4,' '], size=N),
'col2':np.random.randint(10, size=N) })
grouped_df = df.groupby('col')
for key, item in grouped_df:
print(key)
print(grouped_df[' '])
grouped_df is a DataFrameGroupBy object, not a DataFrame.
To extract a DataFrame from grouped_df, use the get_group method:
In [231]: grouped_df.get_group(' ')
Out[231]:
col col2
3 9
9 2
14 5
29 0
30 4
33 6
38 7
41 0
53 7
57 6
73 8
75 7
83 0
92 1
98 8

Iterating through pandas string index turned them into floats

I have a csv file:
SID done good_ecg good_gsr good_resp comment
436 0 1 1
2411 1 1 1
3858 0 1 1
4517 0 1 1 117 min diff between files
9458 1 0 1 ######### error in my script
9754 0 1 1 trigger fehler
#REF!
88.8888888889
Which I load in a pandas dataframe it like this:
df = pandas.read_csv(f ,delimiter="\t", dtype="str", index_col='SID')
I want to iterate through the index and print each one. But when I try
for subj in df.index:
print subj
I get
436.0
2411.0
...
Now there is this '.0' at the end of each number. What am I doing wrong?
I have also tried iterating with iterrows() and have the same problem.
Thank you for any help!
EDIT: Here is the whole code I am using:
import pandas
def write():
df = pandas.read_csv("overview.csv" ,delimiter="\t", dtype="str", index_col='SID')
for subj in df.index:
print subj
write()
Ah. The dtype parameter doesn't apply to the index_col:
>>> !cat sindex.csv
a,b,c
123,50,R
234,51,R
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col="a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Int64Index([123, 234], dtype='int64', name='a')
Instead, read it in without an index_col (None is actually the default, so you don't need index_col=None at all, but here I'll be explicit) and then set the index:
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col=None)
>>> df = df.set_index("a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Index(['123', '234'], dtype='object', name='a')
(I can't think of circumstances under which df.index would have dtype object but when you iterate over it you'd get integers, but you didn't actually show any self-contained code that generated that problem.)

Categories