pandas read text file into a dataframe

pandas read text file into a dataframe - python

I have a .txt file
[7, 9, 20, 30, 50] [1-8]
[9, 14, 27, 31, 45] [2-5]
[7, 10, 22, 27, 38] [1-7]
that I am trying to read into a data frame of two columns using df = pd.read_fwf(readfile,header=None)
Instead of two columns it forms a data frame with three columns and sometimes reads each of the first list of numbers into five columns
0 1 2
0 [7, 9, 20, 30, 50] [1-8]
1 [9, 14, 27, 31, 45] [2-5]
2 [7, 10, 22, 27, 38] [1-7]
I do not understand what I am doing wrongly. Could someone please help?

You can exploit the two spaces between the lists
pd.read_csv(readfile, sep='\s\s', header=None, engine='python')
Out:
0 1
0 [7, 9, 20, 30, 50] [1-8]
1 [9, 14, 27, 31, 45] [2-5]
2 [7, 10, 22, 27, 38] [1-7]
pd.read_fwf without an explicit widths argument tries to infere the fixed widths. But the length of the first list varies. There is no fixed width to separate each line into two columns.
The widths argument is very usefull if your data has no delimiter but fixed number of letters per value. 40 years ago this was a common data format.
# data.txt
20200810ITEM02PRICE30COUNT001
20200811ITEM03PRICE31COUNT012
20200812ITEM12PRICE02COUNT107
pd.read_csv sep argument accepts multi char and regex delimiter. Often this is more flexible to separate strings to columns.

By single line you can read using pandas
import pandas as pd
df = pd.read_csv(readfile, sep='\s\s')

Related

Add Row Numbers To an array [duplicate]

This question already has answers here:
Adding a column in front of a numpy array
(3 answers)
Closed 2 months ago.
How all how do you add rows numbers to an array using numpy?
I wish to print an array to look like the following:
[1, 39, 41, 43],
[2, 38, 32, 18],
[3, 27, 14, 17],
[4, 22, 21, 22],
[5, 20, 28, 23]
With 1-5 being the row numbers
I can only print the array without row numbers.

np.insert(array, 0, np.arange(array.shape[0]), axis=1)

How to use Python Pandas to calculate the mean for skipped backward rows?

Here is the data:
data = {'col1': [12, 13, 5, 2, 12, 12, 13, 23, 32, 65, 33, 52, 63, 12, 42, 65, 24, 53, 35]}
df = pd.DataFrame(data)
I want to create a new col skipped_mean. Only the last 3 rows have a valid value for this variable. What it does is it looks back 6 rows backward, continuously for 3 times, and take the average of the three numbers
How can it be done?

You could do it with a weighted rolling mean approach:
import numpy as np
weights = np.array([1/3,0,0,0,0,0,1/3,0,0,0,0,0,1/3])
df['skipped_mean'] = df['col1'].rolling(13).apply(lambda x: np.sum(weights*x))

vs code does not print full 2-D list in terminal

im trying to print data from a .mat file which contains about 400 rows and 800 columns of data, represented by a 2-D list.
I used scipy.io to load the file, and im trying to print the 2-D list in vs code.
but the result is shortened with "..." to something like
[[55, 23, 12...52, 13, 14]
[28, 26, 12...1, 14, 25]
[28, 13, 12...1, 15, 33]
...
[5, 2, 22...32, 11, 4]
[2, 6, 14...15, 54, 25]
[2, 11, 16...3, 25, 13]]
is there a setting in VS code that allows me to print the entire 400x800 matrix?
Any suggestion is appreciated!

Scoring pandas column's vs other columns

I want to rank how many of other cols in df is greater than or equal to a reference col. Given testdf:
testdf = pd.DataFrame({'RefCol': [10, 20, 30, 40],
'Col1': [11, 19, 29, 40],
'Col2': [12, 21, 28, 39],
'Col3': [13, 22, 31, 38]
})
I am using the helper function:
def sorter(row):
sortedrow = row.sort_values()
return sortedrow.index.get_loc('RefCol')
as:
testdf['Score'] = testdf.apply(sorter, axis=1)
With actual data this method is very slow, how to speed it up? Thanks

Looks like you need to compare RefCol and check if there are any column less than the RefCol , use:
testdf.lt(testdf['RefCol'],axis=0).sum(1)
0 0
1 1
2 2
3 2
For greater than equal to use:
testdf.drop('RefCol',1).ge(testdf.RefCol,axis=0).sum(1)

Splitting a list to 5 lists

I'm trying to split a list to 5 lists. I searched on internet but the only thing I could find was how to split a list to n number of list, with the same amount of items in every list.
This sadly doesn't solve my problem. What I want to do is split a list into 5 lists with different amounts of items.
Lets say the list has 35 items, (this is not always 35, but it is
never more then 45).
I want to split it into:
a list containing items 1-5
a list containing items 5-13
a list containing items 13-20
a list containing items 20-27
and a list containing items 27-35
All of the things I saw where aimed at splitting a list into sub-lists of same sizes. So I was wondering if this is even possible.

You can achieve this using basic list slicing, like below:
In [1]: l = list(xrange(35))
In [2]: l[0:5], l[5:13], l[13:20], l[20:27], l[27:35]
Out[2]:
([0, 1, 2, 3, 4],
[5, 6, 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26],
[27, 28, 29, 30, 31, 32, 33, 34])
I couldn't find any repeatable pattern between the numbers 1, 5, 13, 20, 27, 35, but if there is one, you can easily calculate the nth and n+1th terms, to get the slices dynamically instead of hardcoding.
Also note, the indexes begin with 0 for a list in Python, and that when a slice of list[x:y] is done, the elements list[x], list[x+1], .. list[y-1] only are contained in the slice, and list[y] is not the part of final output.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas read text file into a dataframe - python

By single line you can read using pandas import pandas as pd df = pd.read_csv(readfile, sep='\s\s')

Related

Add Row Numbers To an array [duplicate]

How to use Python Pandas to calculate the mean for skipped backward rows?

vs code does not print full 2-D list in terminal

Scoring pandas column's vs other columns

Splitting a list to 5 lists

Categories

Resources