Search pandas series for value and split series at that value - python

Python 3.3.3
Pandas 0.12.0
I have a single column .csv file with hundreds of float values separated by an arbitrary string (the string contains letters edit: and will vary run to run). I'm a pandas beginner, hoping to find a way to load that .csv file and split the float values into two columns at the level of that string.
I'm so stuck at the first part (searching for the string) that I haven't yet been able to work on the second, which I thought should be much easier.
So far, I've been trying to use raw = pandas.read_csv('myfile.csv', squeeze=True), then something like raw.str.findall('[a-z]'), but I'm not having much luck. I'd really appreciate if someone could lend a hand. I'm planning to use this process on a number of similar .csv files, so I'd hope to find a fairly automated way of performing the task.
Example input.csv:
123.4932
239.348
912.098098989
49391.1093
....
This is a fake string that splits the data.
....
1323.4942
2445.34223
914432.4
495391.1093090
Desired eventual DataFrame:
Column A Column B
123.4932 1323.4942
239.348 2445.34223
912.098098989 914432.4
49391.1093 495391.1093090
... ...
Thanks again if you can point me in the right direction.
20131123
EDIT: Thank you for the responses thus far. Updated to reflect that the splitting string will not remain constant, hence my statement that I'd been trying to find a solution employing a regex raw.str.findall('[a-z]') instead of using .contains.
My solution at this point is to just read the .csv file and split with re, accumulate into lists, and load those into pandas.
import pandas as pd
import re
raw = open('myfile.csv', 'r').read().split('\n')
df = pd.DataFrame()
keeper = []
counter = 0
# Iterate through the rows. Consecutive rows that can be made into float are accumulated.
for row in raw:
try:
keeper.append(float(row))
except:
if keeper:
df = pd.concat([df, pd.DataFrame(keeper, columns = [counter] )], axis = 1)
counter += 1
keeper = []
# Get the last column, assuming the file hasn't ended on a line
# that will trigger the exception in the above loop.
if keeper:
df = pd.concat([df, pd.DataFrame(keeper, columns = [counter] )], axis = 1)
df.describe()
Thank you for any further suggestions.
20180729 EDIT2: One other possible solution using itertools.groupby:
import io
import itertools
import re
import numpy as np
import pandas as pd
txt = """123.4932
239.348
912.098098989
49391.1093
This is a fake string that splits the data.
1323.4942
2445.34223
914432.4
495391.1093090
fake again
31323.4942
42445.34223
2914432.4
5495391.1093090
23423432""".splitlines()
groups = itertools.groupby(
txt,
key=lambda x: not re.match('^[\d.]+$', x)
)
df = pd.concat(
(pd.Series(list(g)) for k, g in groups if not k),
axis=1
)
print(df)

use numpy.split():
import io
import numpy as np
import pandas as pd
txt = """123.4932
239.348
912.098098989
49391.1093
This is a fake string that splits the data.
1323.4942
2445.34223
914432.4
495391.1093090
fake again
31323.4942
42445.34223
2914432.4
5495391.1093090
23423432"""
s = pd.read_csv(io.BytesIO(txt), header=None, squeeze=True)
mask = s.str.contains("fake")
pos = np.where(mask)[0]
pos -= np.arange(len(pos))
arrs = [s.reset_index(drop=True) for s in np.split(s[~mask], pos)]
pd.concat(arrs, axis=1, ignore_index=True).astype(float)
output:
0 1 2
0 123.4932 1323.4942 31323.4942
1 239.348 2445.34223 42445.34223
2 912.098098989 914432.4 2914432.4
3 49391.1093 495391.1093090 5495391.1093090
4 NaN NaN 23423432

If you know you only have two columns, then you could do something like
>>> ser = pd.read_csv("colsplit.csv", header=None, squeeze=True)
>>> split_at = ser.str.contains("fake string that splits").idxmax()
>>> parts = [ser[:split_at], ser[split_at+1:]]
>>> parts = [part.reset_index(drop=True) for part in parts]
>>> df = pd.concat(parts, axis=1)
>>> df.columns = ["Column A", "Column B"]
>>> df
Column A Column B
0 123.4932 ....
1 239.348 1323.4942
2 912.098098989 2445.34223
3 49391.1093 914432.4
4 .... 495391.1093090
5 NaN extra test element
If you have an arbitrary number of places to split at, then you can use a boolean Series/shift/cumsum/groupby pattern, but if you can get away without it, so much the better.
(PS: I'm sure there's a better way than idxmax, but for the life of me I can't remember the idiom to find the first True right now. split_at[split_at].index[0] would do it, but I'm not sure that's much better.)

Related

i want to use the outputs as data and sum them

import numpy as np
import pandas as pd
df = pd.read_csv('test_python.csv')
print(df.groupby('fifth').sum())
this is my data
**And I am summing the first three columns for every word is in fifth.
The result is this and it is correct
The next thing I want to do is take those results and sum the together
example:
**buy = 6
cheese = 8
file = 12
.
.
.
word = 13**
How can I do this? how can I use the results.**
-And also now, want to use the column second as a new column with the name second2 with the results as data, how can I do it?
For Summing you can use apply-lambda ;
df = pd.DataFrame({"first":[1]*14,
"second":np.arange(1,15),
"third":[0]*14,
"forth":["one","two","three","four"]*3+["one","two"],
"fifth":["hello","no","hello","hi","buy","hello","cheese","water","hi","juice","file","word","hi","red"]})
df1 = df.groupby(['fifth'])['first','second','third'].agg('sum').reset_index()
df1["sum_3_Col"] = df1.apply(lambda x: x["first"] + x["second"] + x["third"],axis=1)
df1.rename(columns={'second':'second2'}, inplace=True)
Output of df1;

Calculating most frequently occuring row-specific combinations among dataframe in Python

I have a dataframe that contains text separated by a comma
1 a,b,c,d
2 a,b,e,f
3 a,b,e,f
I am trying to have an output that prints the top 2 most common combinations of 2 letters + the # of occurrences among the entire dataframe. So based on the above dataframe the output would be
(a,b,3) (e,f,2)
The combination of a and b occurs 3 times, and the combination of e and f occurs 2 times. (Yes there are more combos that occur 2 times but we can just cut it off here to keep it simple) I am really stumped on just how to even start this. I was thinking of maybe looping through each row and somehow storing all combinations, and at the end we can print out the top n combinations and how many times they occurred in the dataframe.
Below is what I have so far according to what I have in mind.
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
for index, row in df.iterrows():
(somehow get and store all possible 2 word combos?)
You can do it this way:
import numpy as np
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
df['Date'] = df['Date'].apply(lambda x: x.split(','))
df['combinations'] = df['Date'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
df = df.explode('combinations')
df = df.groupby('combinations').agg('count').reset_index()
df.sort_values('Date', inplace=True, ascending=False)
df['combinations'] = df.values.tolist()
df.drop('Date', axis=1, inplace=True)
df['combinations'] = df['combinations'].apply(np.hstack)
print(df.iloc[:2, :])
Output:
combinations
0 [a, b, 3]
2 [b, e, 2]

How to extract duplicate values in each column separately?

I want to extract only values with two or more occurrence in each column separately and write them in separate file with column header.
Example file: (actual csv file is 1.5 Gb, here including summary of it)
First row is the header row of each column
AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3
I have tried to write code in R and even Python panda but failed to get the result.
Expected outcome:
AO1 BO1 CO1 DO1 EO1 FO1
pep2 xcv4 iop3 typ3 ert3 rtf5
pep2 xcv4 iop3 typ3 ert3 rtf5
pep2 xcv4 typ3 rtf5
wer3 rtf5
wer3 rtf5
import pandas as pd
from StringIO import StringIO
df = pd.read_csv(StringIO("""AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3"""))
d = {}
for col in df.columns:
repeated_values = df[col].value_counts()[df[col].value_counts() >= 2].index.tolist()
cond = df[col].isin(repeated_values)
d[col] = df[cond][col]
final = pd.concat(d, axis=1)
df <- data.table::fread('AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3'
, data.table = FALSE)
lapply(df, function (x) x[duplicated(x) | duplicated(x, fromLast = T)])
You could write a csv directly in the lapply call as well

python subtract every even column from previous odd column

Sorry if this has been asked before -- I couldn't find this specific question.
In python, I'd like to subtract every even column from the previous odd column:
so go from:
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113
to
101.849 110.349 68.513
109.95 110.912 61.274
100.612 110.05 62.15
107.75 118.687 59.712
There will be an unknown number of columns. should I use something in pandas or numpy?
Thanks in advance.
You can accomplish this using pandas. You can select the even- and odd-indexed columns separately and then subtract them.
#hiro protagonist, I didn't know you could do that StringIO magic. That's spicy.
import pandas as pd
import io
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
df = pd.read_csv(data, sep='\s+')
Note that the even/odd terms may be counterintuitive because python is 0-indexed, meaning that the signal columns are actually even-indexed and the background columns odd-indexed. If I understand your question properly, this is contrary to your use of the even/odd terminology. Just pointing out the difference to avoid confusion.
# strip the columns into their appropriate signal or background groups
bg_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
signal_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]
# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(signal_df.values - bg_df.values)
result_df contains columns which are the difference between the signal and background columns. You probably want to rename these column names, though.
>>> result_df
0 1 2
0 101.849 110.349 68.513
1 109.950 110.912 61.274
2 100.612 110.050 62.150
3 107.750 118.687 59.712
import io
# faking the data file
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
header = next(data) # read the first line from data
# print(header[:-1])
for line in data:
# print(line)
floats = [float(val) for val in line.split()] # create a list of floats
for prev, cur in zip(floats[::2], floats[1::2]):
print('{:6.3f}'.format(prev-cur), end=' ')
print()
with output:
101.849 110.349 68.513
109.950 110.912 61.274
100.612 110.050 62.150
107.750 118.687 59.712
if you know what data[start:stop:step] means and how zip works this should be easily understood.

Splitting Regex response column on python

I am receiving an object array after applying re.findall for link and hashtags on Tweets data. My data looks like
b=['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
Now I want to split it in columns, I am using following
df = pd.DataFrame(b.str.split(',',1).tolist(),columns = ['flips','row'])
But it is not working because of weird datatype I guess, I tried few other solutions as well. Nothing worked.And this is what I am expecting, two separate columns
https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
https://t.co/CJZWjaBfJU
https://t.co/4GMhoXhBQO https://t.co/0V
https://t.co/Erutsftlnq
https://t.co/86VvLJEzvG
It's not clear from your question what exactly is part of your data. (Does it include the square brackets and single quotes?). In any case, the pandas read_csv function is very versitile and can handle ragged data:
import StringIO
import pandas as pd
raw_data = """
['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
"""
# You'll probably replace the StringIO part with the filename of your data.
df = pd.read_csv(StringIO.StringIO(raw_data), header=None, names=('flips','row'))
# Get rid of the square brackets and single quotes
for col in ('flips', 'row'):
df[col] = df[col].str.strip("[]'")
df
Output:
flips row
0 https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
1 https://t.co/CJZWjaBfJU NaN
2 https://t.co/4GMhoXhBQO https://t.co/0V
3 https://t.co/Erutsftlnq NaN
4 https://t.co/86VvLJEzvG https://t.co/zCYv5WcFDS

Categories