Converting a multiline string to a dataframe - python

I have the following string:
Hoy
1
5
14
3
0
23
and I would like to turn it into a df.
I think it would be a good idea to turn to a list(string) and then pd.Dataframe(list(string)), however when I turn to a list return the following output:
['\n', 'H', 'o', 'y', '\n', '1', '\n', '5', '\n', '1', '4', '\n', '3', '\n', '0', '\n', '2', '3', '\n', '2', ',', '8', '3', '*', '\n']
Is there an alternative way to turn the initial string into a df such like this?:
0
0 Hoy
1 1
2 5
3 14
4 3
5 0
6 23

Use pd.read_csv, passing an IO buffer to it:
from up import StringIO
text = '''Hoy
1
5
14
3
0
23
'''
pd.read_csv(StringIO(text), header=None)
0
0 Hoy
1 1
2 5
3 14
4 3
5 0
6 23

This should act as an argument for accepting #COLDSPEED's answer by observing how ugly this answer is.
txt = """Hoy
1
5
14
3
0
23"""
(lambda x: pd.Series(pd.to_numeric(x[1:], 'ignore'), name=x[0]))(
txt.split('\n')
).to_frame()
Hoy
0 1
1 5
2 14
3 3
4 0
5 23

Related

Counting list entries in specific columns in Pandas Dataframe

I have a Dataframe like this
1 2 ... 9
0 ['1'] [] ['9']
1 ['1'] ['2', '2', '2'] ['9', '9']
2 ['1', '1'] ['2', '2', '2'] []
3 ['1', '1', '1'] ['2'] []
I want to count the occurences in each column so that the output would be like this
1 2 ... 9
0 1 0 1
1 1 3 2
2 2 3 0
3 3 1 0
This seems to work with the following code
df.1.apply(lambda x: x.count('1'))
but how can I automate it for all my columns so I don't have to run the above code for each individual column?
In addition I used
df.1.apply(lambda x: x.count('1')).sum()
to count the total for the rows which seems to be giving the right answer. Is there a better way though?

How to sort the values in dataframe?

I am trying to sort the values but not getting the desirable result. Can you please help me how to do this?
Example:
df = pd.read_csv("D:/Users/SPate233/Downloads/iMedical/sqoop/New folder/metadata_file_imedical.txt", delimiter='~')
#df.sort_values(by = ['dependency'], inplace = True)
df.sort_values('dependency', ascending=True, inplace=True)
print(list(df['dependency'].unique()))
Output:
['0', '1', '1,10,11,26,28,55', '1,26,28', '10', '11', '12', '17,42', '2', '26,28', '33', '42', '6']
Desirable_output:
['0', '1', '2', '6', '10', '11', '12', '33', '42', '17,42', '26,28', '1,26,28', '1,10,11,26,28,55']
Order by the length of the string, and then by its value:
df.assign(len = df.dependency.str.len()).sort_values(["len", "dependency"])
The output is (leaving the len column in for clarity):
dependency len
0 0 1
1 1 1
8 2 1
12 6 1
4 10 2
5 11 2
6 12 2
10 33 2
11 42 2
7 17,42 5
9 26,28 5
3 1,26,28 7
2 1,10,11,26,28,55 16

Concatenate strings based on inner join

I have two DataFrames containing the same columns; an id, a date and a str:
df1 = pd.DataFrame({'id': ['1', '2', '3', '4', '10'],
'date': ['4', '5', '6', '7', '8'],
'str': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': ['1', '2', '3', '4', '12'],
'date': ['4', '5', '6', '7', '8'],
'str': ['A', 'B', 'C', 'D', 'Q']})
I would like to join these two datasets on the id and date columns, and create a resulting column that is the concatenation of str:
df3 = pd.DataFrame({'id': ['1', '2', '3', '4', '10', '12'],
'date': ['4', '5', '6', '7', '8', '8'],
'str': ['aA', 'bB', 'cC', 'dD', 'e', 'Q']})
I guess I can make an inner join and then concatenate the strings, but is there an easier way to achieve this?
IIUC concat+groupby
pd.concat([df1,df2]).groupby(['date','id']).str.sum().reset_index()
Out[9]:
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
And if we consider the efficiency using sum() base on level
pd.concat([df1,df2]).set_index(['date','id']).sum(level=[0,1]).reset_index()
Out[12]:
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
Using radd:
i = df1.set_index(['date', 'id'])
j = df2.set_index(['date', 'id'])
j['str'].radd(i['str'], fill_value='').reset_index()
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
This should be pretty fast.

Python splitting list by whitespace

I have a textfile with numbers in the following manner "12345679010111213"
I have constructed a script that reads the fille, appends the values to the list, using a variable called "numbersoflist" list1.append(numbersoflist)
But when I call the list1.split('') it still prints out the values as they appear in the textfile, without whitespaces. My goal is to have them look like "1 2 3 4 5 6 ..."
>>> s = '12345679010111213'
>>> list(s)
['1', '2', '3', '4', '5', '6', '7', '9', '0', '1', '0', '1', '1', '1', '2', '1', '3']
>>> ' '.join(list(s))
'1 2 3 4 5 6 7 9 0 1 0 1 1 1 2 1 3'
>>> ' '.join(s) # works since str is also an iterable
'1 2 3 4 5 6 7 9 0 1 0 1 1 1 2 1 3'

Read each entire Column of CSV file using python (preferably by help of pandas )

I have some data in Microsoft excel that I save them as CSV file for ease of use. the data Structure is like this:
MS Excel format:
L1
0 1 0 0 0 1 1
0 0 1 0 0 1 0
0 0 0 1 0 0 1
0 0 0 0 1 0 0
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
CSV format
L1,,,,,,,,,,,,,,
0,1,0,0,0,1,1,
0,0,1,0,0,1,0,
0,0,0,1,0,0,1,
0,0,0,0,1,0,0,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
As you see only the first column has label now I want to read the CSV file (or it's easier the excel file) to get each column and do some bit manipulation operation on them. How can I acheive this? I have read something about pandas But I can't find anything useful in order to fetch each coloumn
Given the .csv file temp.csv
L1x,,,,,,,
0,1,0,0,0,1,1,
0,0,1,0,0,1,0,
0,0,0,1,0,0,1,
0,0,0,0,1,0,0,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
read it in as follows:
import pandas
a = pandas.read_csv('temp.csv', names = ["c%d" % i for i in range(8)], skiprows = 1)
a
Output:
c0 c1 c2 c3 c4 c5 c6 c7
0 0 1 0 0 0 1 1 NaN
1 0 0 1 0 0 1 0 NaN
2 0 0 0 1 0 0 1 NaN
3 0 0 0 0 1 0 0 NaN
4 1 1 1 1 1 1 1 NaN
5 1 1 1 1 1 1 1 NaN
6 1 1 1 1 1 1 1 NaN
7 1 1 1 1 1 1 1 NaN
The 'NaN's in the last column come from the pesky trailing commas. The 8 in the range needs to match the number of columns. To access the columns in a use either
a.c3
or
a[c3]
both of which result in
0 0
1 0
2 1
3 0
4 1
5 1
6 1
7 1
Name: c3
The cool thing about pandas is that if you want to XOR two columns you can very simply.
a.c0^a.c2
Output
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
Name: c0
Assume I have:
Which you can save into a CSV file that looks like so:
L1,,,
L2,0,10,20
L3,1,11,21
L4,2,12,22
L5,3,13,23
L6,4,14,24
L7,5,15,25
L8,6,16,26
L9,7,17,27
L10,8,18,28
To get just any col, use CSV reader and transpose with zip:
import csv
with open('test.csv', 'rU') as fin:
reader=csv.reader(fin)
data=list(reader)
print 'data:', data
# data: [['L1', '', '', ''], ['L2', '0', '10', '20'], ['L3', '1', '11', '21'], ['L4', '2', '12', '22'], ['L5', '3', '13', '23'], ['L6', '4', '14', '24'], ['L7', '5', '15', '25'], ['L8', '6', '16', '26'], ['L9', '7', '17', '27'], ['L10', '8', '18', '28']]
Notice the data is a list of rows. You can transpose that List of Lists using zip to get a list of columns:
trans=zip(*data)
print 'trans:',trans
# trans: [('L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10'), ('', '0', '1', '2', '3', '4', '5', '6', '7', '8'), ('', '10', '11', '12', '13', '14', '15', '16', '17', '18'), ('', '20', '21', '22', '23', '24', '25', '26', '27', '28')]
Then just index to get a specific column:
print trans[0]
# ('L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10')
Of course if you want to do arithmetic on the cells, you will need to convert the string to ints or floats as appropriate.
import pandas as pd
pd.read_excel("foo.xls", "Sheet 1",
names=["c%d" % i for i in range(7)])
Output:
c0 c1 c2 c3 c4 c5 c6
0 0 1 0 0 0 1 1
1 0 0 1 0 0 1 0
2 0 0 0 1 0 0 1
3 0 0 0 0 1 0 0
4 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1
Sample Code returns column as array.:
input = """L1,,,,,,,,,,,,,,
0,1,0,0,0,1,1,
0,0,1,0,0,1,0,
0,0,0,1,0,0,1,
0,0,0,0,1,0,0,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
"""
def getColumn(data,column_number):
dump_array=[]
lines=data.split("\n")
for line in lines:
tmp_cell = line.split(",")
dump_array.append(tmp_cell[3])
return dump_array
#for ex. get column 3
getColumn(3,input)
This may give an idea to manuplate your grid...
Note: I dont have an interpreter for testing code now, so sorry if there is typo...

Categories