How to create a DataFrame from custom values - python

I am reading in a text file, on each line there are multiple values. I am parsing them based on requirements using function parse.
def parse(line):
......
......
return line[0],line[2],line[5]
I want to create a dataframe, with each line as a row and the three returened values as columns
df = pd.DataFrame()
with open('data.txt') as f:
for line in f:
df.append(line(parse(line)))
When I run the above code, I get all values as a single column. Is it possible to get it in proper tabular format.

You shouldn't .append to DataFrame in a loop, that is very inefficient anyway. Do something like:
colnames = ['col1','col2','col3'] # or whatever you want
with open('data.txt') as f:
df = pd.DataFrame([parse(l) for l in f], columns=colnames)
Note, the fundamental problem is that pd.DataFrame.append expects another data-frame, and it appends the rows of that other data-frame. It interpretes a list as a bunch of single rows. So note, if you structure your list to have "rows" it would work as intended. But you shouldn't be using .append here anyway:
In [6]: df.append([1,2,3])
Out[6]:
0
0 1
1 2
2 3
In [7]: df = pd.DataFrame()
In [8]: df.append([[1, 2, 3]])
Out[8]:
0 1 2
0 1 2 3

Uma forma rĂ¡pida de fazer isso (TL;DR):
Creating the new column:
`df['com_zeros'] = '0'`
Applying the condition::
for b in df.itertuples():
df.com_zeros[b.Index] = '0'+str(b.battles) if b.battles<9 else str(b.battles)
Result:
df
regiment company deaths battles size com_zeros
0 Nighthawks 1st kkk 5 l 05
1 Nighthawks 1st 52 42 ll 42
2 Nighthawks 2nd 25 2 l 02
3 Nighthawks 2nd 616 2 m 02
See the example by https://repl.it/JHW6.
Obs.:
The example running on repl.it seems to hang, but that is not the case, the load of pandas on repl.it is always time consuming.
To suppress warnings on jupyter notebook:
import warnings
warnings.filterwarnings('ignore')

In addition to #juanpa.arrilaga,
It seems that you do have a structured file and just need the 1st 3rd and 5th item in the file.
load it and use drop
df = pd.read_csv('file')
df.drop([columns],axis = 1)

Related

How to skip the first two lines when reading in multiple files into a Pandas df [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

using pandas read 0th row of csv and save it into list

this is my dataframe:-
A B C D
1 23 34 23
3 23 21 4
4 5 6 7
what i want to do is read 0th row and add it into a list like this [A,B,C,D]
(here this column name can be anything)
I have used
data = pd.read_csv(file_path, nrows=0)
But i am not able to add/store it into list
How to do that?
You can easily do this:
list(data)
but it is more explicit to write like t his:
list(data.columns.values)
Edit: As RafAelC said correctly it is better to use tolist() like t his:
data.columns.values.tolist()

How do I skip first rows with Pandas in Python [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

Pandas Dataframes: how to build them efficiently

I have a file with 1M rows that I'm trying to read into 20 DataFrames. I do not know in advance which row belongs to which DataFrame or how large each DataFrame will be. How can I process this file into DataFrames efficiently? I've tried to do this several different ways. Here is what I currently have:
data = pd.read_csv(r'train.data', sep=" ", header = None) # Not slow
def collectData(row):
id = row[0]
df = dictionary[id] # Row content determines which dataframe this row belongs to
next = len(df.index)
df.loc[next] = row
data.apply(collectData, axis=1)
It's very slow. What am I doing wrong? If I just apply an empty function, my code runs in 30 sec. With the actual function it takes at least 10 minutes and I'm not sure if it would finish.
Here are a few sample rows from the dataset:
1 1 4
1 2 2
1 3 10
1 4 4
The full dataset is available here (if you click on Matlab version)
Your approach is not a vectored one, because you apply a python function row by row.
Rather that creating 20 dataframes , make a dictionary containing an index (in range(20)) for each key[0]. Then add this information to your DataFrame:
data['dict']=data[0].map(dictionary)
Then reorganize :
data2=data.reset_index().set_index(['dict','index'])
data2 is like :
0 1 2
dict index
12 0 1 1 4
1 1 2 2
2 1 3 10
3 1 4 4
4 1 5 2
....
and data2.loc[i] is one of the Dataframe you want.
EDIT:
It seems that dictionary is describe in train.label.
You can set the dictionary before by:
with open(r'train.label') as f: u=f.readlines()
v=[int(x) for x in u] # len(v) = 11269 = data[0].max()
dictionary=dict(zip(range(1,len(v)+1),v))
Since, the full data set is easily loaded into memory, the following should be fairly quick
data_split = {i: data[data[0] == i] for i in range(1, 21)}
# to access each dataframe, do a dictionary lookup, i.e.
data_split[2].head()
0 1 2
769 2 12 4
770 2 16 2
771 2 23 4
772 2 27 2
773 2 29 6
you may also want to reset the indices or copy the data frame when you're slicing the data frame into smaller data frames.
additional reading:
copy
reset_index
view-vs-copy
If you want to build them efficiently, I think you need some good raw materials:
wood
cement
Are robust and durable.
Try to avoid using hay as the dataframe can be blown up with a little wind.
Hope that helps

How to select and delete columns with duplicate name in pandas DataFrame

I have a huge DataFrame, where some columns have the same names. When I try to pick a column that exists twice, (eg del df['col name'] or df2=df['col name']) I get an error. What can I do?
You can adress columns by index:
>>> df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','a'])
>>> df
a a
0 1 2
1 3 4
2 5 6
>>> df.iloc[:,0]
0 1
1 3
2 5
Or you can rename columns, like
>>> df.columns = ['a','b']
>>> df
a b
0 1 2
1 3 4
2 5 6
This is not a good situation to be in. Best would be to create a hierarchical column labeling scheme (Pandas allows for multi-level column labeling or row index labels). Determine what it is that makes the two different columns that have the same name actually different from each other and leverage that to create a hierarchical column index.
In the mean time, if you know the positional location of the columns in the ordered list of columns (e.g. from dataframe.columns) then you can use many of the explicit indexing features, such as .ix[], or .iloc[] to retrieve values from the column positionally.
You can also create copies of the columns with new names, such as:
dataframe["new_name"] = data_frame.ix[:, column_position].values
where column_position references the positional location of the column you're trying to get (not the name).
These may not work for you if the data is too large, however. So best is to find a way to modify the construction process to get the hierarchical column index.
Another solution:
def remove_dup_columns(frame):
keep_names = set()
keep_icols = list()
for icol, name in enumerate(frame.columns):
if name not in keep_names:
keep_names.add(name)
keep_icols.append(icol)
return frame.iloc[:, keep_icols]
import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randint(0, 50, (5, 4)), columns=['A', 'A', 'B', 'B'])
print(frame)
print(remove_dup_columns(frame))
The output is
A A B B
0 18 44 13 47
1 41 19 35 28
2 49 0 30 16
3 39 29 43 41
4 26 19 48 13
A B
0 18 13
1 41 35
2 49 30
3 39 43
4 26 48
The following function removes columns with dublicate names and keeps only one. Not exactly what you asked for, but you can use snips of it to solve your problem. The idea is to return the index numbers and then you can adress the specific column indices directly. The indices are unique while the column names aren't
def remove_multiples(df,varname):
"""
makes a copy of the first column of all columns with the same name,
deletes all columns with that name and inserts the first column again
"""
from copy import deepcopy
dfout = deepcopy(df)
if (varname in dfout.columns):
tmp = dfout.iloc[:, min([i for i,x in enumerate(dfout.columns == varname) if x])]
del dfout[varname]
dfout[varname] = tmp
return dfout
where
[i for i,x in enumerate(dfout.columns == varname) if x]
is the part you need

Categories