Converting a multiline string to a dataframe

Converting a multiline string to a dataframe - python

I have the following string:
Hoy
1
5
14
3
0
23
and I would like to turn it into a df.
I think it would be a good idea to turn to a list(string) and then pd.Dataframe(list(string)), however when I turn to a list return the following output:
['\n', 'H', 'o', 'y', '\n', '1', '\n', '5', '\n', '1', '4', '\n', '3', '\n', '0', '\n', '2', '3', '\n', '2', ',', '8', '3', '*', '\n']
Is there an alternative way to turn the initial string into a df such like this?:
0
0 Hoy
1 1
2 5
3 14
4 3
5 0
6 23

Use pd.read_csv, passing an IO buffer to it:
from up import StringIO
text = '''Hoy
1
5
14
3
0
23
'''
pd.read_csv(StringIO(text), header=None)
0
0 Hoy
1 1
2 5
3 14
4 3
5 0
6 23

This should act as an argument for accepting #COLDSPEED's answer by observing how ugly this answer is.
txt = """Hoy
1
5
14
3
0
23"""
(lambda x: pd.Series(pd.to_numeric(x[1:], 'ignore'), name=x[0]))(
txt.split('\n')
).to_frame()
Hoy
0 1
1 5
2 14
3 3
4 0
5 23

Related

Counting list entries in specific columns in Pandas Dataframe

I have a Dataframe like this
1 2 ... 9
0 ['1'] [] ['9']
1 ['1'] ['2', '2', '2'] ['9', '9']
2 ['1', '1'] ['2', '2', '2'] []
3 ['1', '1', '1'] ['2'] []
I want to count the occurences in each column so that the output would be like this
1 2 ... 9
0 1 0 1
1 1 3 2
2 2 3 0
3 3 1 0
This seems to work with the following code
df.1.apply(lambda x: x.count('1'))
but how can I automate it for all my columns so I don't have to run the above code for each individual column?
In addition I used
df.1.apply(lambda x: x.count('1')).sum()
to count the total for the rows which seems to be giving the right answer. Is there a better way though?

How to sort the values in dataframe?

I am trying to sort the values but not getting the desirable result. Can you please help me how to do this?
Example:
df = pd.read_csv("D:/Users/SPate233/Downloads/iMedical/sqoop/New folder/metadata_file_imedical.txt", delimiter='~')
#df.sort_values(by = ['dependency'], inplace = True)
df.sort_values('dependency', ascending=True, inplace=True)
print(list(df['dependency'].unique()))
Output:
['0', '1', '1,10,11,26,28,55', '1,26,28', '10', '11', '12', '17,42', '2', '26,28', '33', '42', '6']
Desirable_output:
['0', '1', '2', '6', '10', '11', '12', '33', '42', '17,42', '26,28', '1,26,28', '1,10,11,26,28,55']

Order by the length of the string, and then by its value:
df.assign(len = df.dependency.str.len()).sort_values(["len", "dependency"])
The output is (leaving the len column in for clarity):
dependency len
0 0 1
1 1 1
8 2 1
12 6 1
4 10 2
5 11 2
6 12 2
10 33 2
11 42 2
7 17,42 5
9 26,28 5
3 1,26,28 7
2 1,10,11,26,28,55 16

Concatenate strings based on inner join

I have two DataFrames containing the same columns; an id, a date and a str:
df1 = pd.DataFrame({'id': ['1', '2', '3', '4', '10'],
'date': ['4', '5', '6', '7', '8'],
'str': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': ['1', '2', '3', '4', '12'],
'date': ['4', '5', '6', '7', '8'],
'str': ['A', 'B', 'C', 'D', 'Q']})
I would like to join these two datasets on the id and date columns, and create a resulting column that is the concatenation of str:
df3 = pd.DataFrame({'id': ['1', '2', '3', '4', '10', '12'],
'date': ['4', '5', '6', '7', '8', '8'],
'str': ['aA', 'bB', 'cC', 'dD', 'e', 'Q']})
I guess I can make an inner join and then concatenate the strings, but is there an easier way to achieve this?

IIUC concat+groupby
pd.concat([df1,df2]).groupby(['date','id']).str.sum().reset_index()
Out[9]:
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
And if we consider the efficiency using sum() base on level
pd.concat([df1,df2]).set_index(['date','id']).sum(level=[0,1]).reset_index()
Out[12]:
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q

Using radd:
i = df1.set_index(['date', 'id'])
j = df2.set_index(['date', 'id'])
j['str'].radd(i['str'], fill_value='').reset_index()
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
This should be pretty fast.

Python splitting list by whitespace

I have a textfile with numbers in the following manner "12345679010111213"
I have constructed a script that reads the fille, appends the values to the list, using a variable called "numbersoflist" list1.append(numbersoflist)
But when I call the list1.split('') it still prints out the values as they appear in the textfile, without whitespaces. My goal is to have them look like "1 2 3 4 5 6 ..."

>>> s = '12345679010111213'
>>> list(s)
['1', '2', '3', '4', '5', '6', '7', '9', '0', '1', '0', '1', '1', '1', '2', '1', '3']
>>> ' '.join(list(s))
'1 2 3 4 5 6 7 9 0 1 0 1 1 1 2 1 3'
>>> ' '.join(s) # works since str is also an iterable
'1 2 3 4 5 6 7 9 0 1 0 1 1 1 2 1 3'

Read each entire Column of CSV file using python (preferably by help of pandas )

I have some data in Microsoft excel that I save them as CSV file for ease of use. the data Structure is like this:
MS Excel format:
L1
0 1 0 0 0 1 1
0 0 1 0 0 1 0
0 0 0 1 0 0 1
0 0 0 0 1 0 0
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
CSV format
L1,,,,,,,,,,,,,,
0,1,0,0,0,1,1,
0,0,1,0,0,1,0,
0,0,0,1,0,0,1,
0,0,0,0,1,0,0,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
As you see only the first column has label now I want to read the CSV file (or it's easier the excel file) to get each column and do some bit manipulation operation on them. How can I acheive this? I have read something about pandas But I can't find anything useful in order to fetch each coloumn

Given the .csv file temp.csv
L1x,,,,,,,
0,1,0,0,0,1,1,
0,0,1,0,0,1,0,
0,0,0,1,0,0,1,
0,0,0,0,1,0,0,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
read it in as follows:
import pandas
a = pandas.read_csv('temp.csv', names = ["c%d" % i for i in range(8)], skiprows = 1)
a
Output:
c0 c1 c2 c3 c4 c5 c6 c7
0 0 1 0 0 0 1 1 NaN
1 0 0 1 0 0 1 0 NaN
2 0 0 0 1 0 0 1 NaN
3 0 0 0 0 1 0 0 NaN
4 1 1 1 1 1 1 1 NaN
5 1 1 1 1 1 1 1 NaN
6 1 1 1 1 1 1 1 NaN
7 1 1 1 1 1 1 1 NaN
The 'NaN's in the last column come from the pesky trailing commas. The 8 in the range needs to match the number of columns. To access the columns in a use either
a.c3
or
a[c3]
both of which result in
0 0
1 0
2 1
3 0
4 1
5 1
6 1
7 1
Name: c3
The cool thing about pandas is that if you want to XOR two columns you can very simply.
a.c0^a.c2
Output
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
Name: c0

Assume I have:
Which you can save into a CSV file that looks like so:
L1,,,
L2,0,10,20
L3,1,11,21
L4,2,12,22
L5,3,13,23
L6,4,14,24
L7,5,15,25
L8,6,16,26
L9,7,17,27
L10,8,18,28
To get just any col, use CSV reader and transpose with zip:
import csv
with open('test.csv', 'rU') as fin:
reader=csv.reader(fin)
data=list(reader)
print 'data:', data
# data: [['L1', '', '', ''], ['L2', '0', '10', '20'], ['L3', '1', '11', '21'], ['L4', '2', '12', '22'], ['L5', '3', '13', '23'], ['L6', '4', '14', '24'], ['L7', '5', '15', '25'], ['L8', '6', '16', '26'], ['L9', '7', '17', '27'], ['L10', '8', '18', '28']]
Notice the data is a list of rows. You can transpose that List of Lists using zip to get a list of columns:
trans=zip(*data)
print 'trans:',trans
# trans: [('L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10'), ('', '0', '1', '2', '3', '4', '5', '6', '7', '8'), ('', '10', '11', '12', '13', '14', '15', '16', '17', '18'), ('', '20', '21', '22', '23', '24', '25', '26', '27', '28')]
Then just index to get a specific column:
print trans[0]
# ('L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10')
Of course if you want to do arithmetic on the cells, you will need to convert the string to ints or floats as appropriate.

import pandas as pd
pd.read_excel("foo.xls", "Sheet 1",
names=["c%d" % i for i in range(7)])
Output:
c0 c1 c2 c3 c4 c5 c6
0 0 1 0 0 0 1 1
1 0 0 1 0 0 1 0
2 0 0 0 1 0 0 1
3 0 0 0 0 1 0 0
4 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1

Sample Code returns column as array.:
input = """L1,,,,,,,,,,,,,,
0,1,0,0,0,1,1,
0,0,1,0,0,1,0,
0,0,0,1,0,0,1,
0,0,0,0,1,0,0,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
"""
def getColumn(data,column_number):
dump_array=[]
lines=data.split("\n")
for line in lines:
tmp_cell = line.split(",")
dump_array.append(tmp_cell[3])
return dump_array
#for ex. get column 3
getColumn(3,input)
This may give an idea to manuplate your grid...
Note: I dont have an interpreter for testing code now, so sorry if there is typo...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting a multiline string to a dataframe - python

Use pd.read_csv, passing an IO buffer to it: from up import StringIO text = '''Hoy 1 5 14 3 0 23 ''' pd.read_csv(StringIO(text), header=None) 0 0 Hoy 1 1 2 5 3 14 4 3 5 0 6 23

This should act as an argument for accepting #COLDSPEED's answer by observing how ugly this answer is. txt = """Hoy 1 5 14 3 0 23""" (lambda x: pd.Series(pd.to_numeric(x[1:], 'ignore'), name=x[0]))( txt.split('\n') ).to_frame() Hoy 0 1 1 5 2 14 3 3 4 0 5 23

Related

Counting list entries in specific columns in Pandas Dataframe

How to sort the values in dataframe?

Concatenate strings based on inner join

Python splitting list by whitespace

Read each entire Column of CSV file using python (preferably by help of pandas )

Categories

Resources