Splitting Regex response column on python - python

I am receiving an object array after applying re.findall for link and hashtags on Tweets data. My data looks like
b=['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
Now I want to split it in columns, I am using following
df = pd.DataFrame(b.str.split(',',1).tolist(),columns = ['flips','row'])
But it is not working because of weird datatype I guess, I tried few other solutions as well. Nothing worked.And this is what I am expecting, two separate columns
https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
https://t.co/CJZWjaBfJU
https://t.co/4GMhoXhBQO https://t.co/0V
https://t.co/Erutsftlnq
https://t.co/86VvLJEzvG

It's not clear from your question what exactly is part of your data. (Does it include the square brackets and single quotes?). In any case, the pandas read_csv function is very versitile and can handle ragged data:
import StringIO
import pandas as pd
raw_data = """
['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
"""
# You'll probably replace the StringIO part with the filename of your data.
df = pd.read_csv(StringIO.StringIO(raw_data), header=None, names=('flips','row'))
# Get rid of the square brackets and single quotes
for col in ('flips', 'row'):
df[col] = df[col].str.strip("[]'")
df
Output:
flips row
0 https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
1 https://t.co/CJZWjaBfJU NaN
2 https://t.co/4GMhoXhBQO https://t.co/0V
3 https://t.co/Erutsftlnq NaN
4 https://t.co/86VvLJEzvG https://t.co/zCYv5WcFDS

Related

Python Pandas Dataframe from API JSON Response >>

I am new to Python, Can i please seek some help from experts here?
I wish to construct a dataframe from https://api.cryptowat.ch/markets/summaries JSON response.
based on following filter criteria
Kraken listed currency pairs (Please take note, there are kraken-futures i dont want those)
Currency paired with USD only, i.e aaveusd, adausd....
Ideal Dataframe i am looking for is (somehow excel loads this json perfectly screenshot below)
Dataframe_Excel_Screenshot
resp = requests.get(https://api.cryptowat.ch/markets/summaries) kraken_assets = resp.json() df = pd.json_normalize(kraken_assets) print(df)
Output:
result.binance-us:aaveusd.price.last result.binance-us:aaveusd.price.high ...
0 264.48 267.32 ...
[1 rows x 62688 columns]
When i just paste the link in browser JSON response is with double quotes ("), but when i get it via python code. All double quotes (") are changed to single quotes (') any idea why?. Though I tried to solve it with json_normalize but then response is changed to [1 rows x 62688 columns]. i am not sure how do i even go about working with 1 row with 62k columns. i dont know how to extract exact info in the dataframe format i need (please see excel screenshot).
Any help is much appreciated. thank you!
the result JSON is a dict
load this into a dataframe
decode columns into products & measures
filter to required data
import requests
import pandas as pd
import numpy as np
# load results into a data frame
df = pd.json_normalize(requests.get("https://api.cryptowat.ch/markets/summaries").json()["result"])
# columns are encoded as product and measure. decode columns and transpose into rows that include product and measure
cols = np.array([c.split(".", 1) for c in df.columns]).T
df.columns = pd.MultiIndex.from_arrays(cols, names=["product","measure"])
df = df.T
# finally filter down to required data and structure measures as columns
df.loc[df.index.get_level_values("product").str[:7]=="kraken:"].unstack("measure").droplevel(0,1)
sample output
product
price.last
price.high
price.low
price.change.percentage
price.change.absolute
volume
volumeQuote
kraken:aaveaud
347.41
347.41
338.14
0.0274147
9.27
1.77707
613.281
kraken:aavebtc
0.008154
0.008289
0.007874
0.0219326
0.000175
403.506
3.2797
kraken:aaveeth
0.1327
0.1346
0.1327
-0.00673653
-0.0009
287.113
38.3549
kraken:aaveeur
219.87
226.46
209.07
0.0331751
7.06
1202.65
259205
kraken:aavegbp
191.55
191.55
179.43
0.030559
5.68
6.74476
1238.35
kraken:aaveusd
259.53
267.48
246.64
0.0339841
8.53
3623.66
929624
kraken:adaaud
1.61792
1.64602
1.563
0.0211692
0.03354
5183.61
8366.21
kraken:adabtc
3.757e-05
3.776e-05
3.673e-05
0.0110334
4.1e-07
252403
9.41614
kraken:adaeth
0.0006108
0.00063
0.0006069
-0.0175326
-1.09e-05
590839
367.706
kraken:adaeur
1.01188
1.03087
0.977345
0.0209986
0.020811
1.99104e+06
1.98693e+06
Hello Try the below code. I have understood the structure of the Dataset and modified to get the desired output.
`
resp = requests.get("https://api.cryptowat.ch/markets/summaries")
a=resp.json()
a['result']
#creating Dataframe froom key=result
da=pd.DataFrame(a['result'])
#using Transpose to get required Columns and Index
da=da.transpose()
#price columns contains a dict which need to be seperate Columns on the data frame
db=da['price'].to_dict()
da.drop('price', axis=1, inplace=True)
#intialising seperate Data frame for price
z=pd.DataFrame({})
for i in db.keys():
i=pd.DataFrame(db[i], index=[i])
z=pd.concat([z,i], axis=0 )
da=pd.concat([z, da], axis=1)
da.to_excel('nex.xlsx')`

reg problems:TypeError: expected string or bytes-like object

I am trying the code:
`s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'`
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
data[0]=str(data[0])
data['r_id']=data[0].apply(lambda x:re.search(r'(r_id)',data[0]))
data['level']=data[0].apply(lambda x:re.search(r'(level)',data[0]))
print(data)
I wish I could get the result:
r_id level
1312 307
1111 NAN
But it shows the error:expected string or bytes-like object
So how could I use the re.search in pandas or how could I get result?
My two cents...
import re
pattern = re.compile(r'^.*?id\":\"(\d+)\",\"level\":(\d+).*id\":\"(\d+).*$')
string = r'{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data = pattern.findall(string)
data
Which returns an array:
[('1312', '307', '1111')]
And you can access items with, for example:
data[0][2]
Regex demo: https://regex101.com/r/Inv4gp/1
The below works for me. The type problem arises because you cannot change the type of all the rows like that. You would need a lambda functor for that too.
There is an additional problem, that the regex and the exception case handling won't work like that. I proposed a solution for this, but you might want to consider a different regex if you want this to work for other columns.
I'm very novice with regex, so there might be a more general-purpose solution for your problem.
import re
import pandas as pd
import numpy as np
s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
# This is a regex wrapper which gets the row of our pandas dataframes and the columns that we want.
def regex_wrapper(row,column):
match = re.search(r'"' + column + '":"?(\d+)"?', str(row))
if match:
return match.group(1)
else:
return np.nan
data['r_id'] = data[0].apply(lambda row: regex_wrapper(row,"r_id"))
data['level'] = data[0].apply(lambda row: regex_wrapper(row,"level"))
del data[0]

Drop 0 values, NaN values, and empty strings

import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(r'^\s*$', np.nan, regex=True)
filevalues = filevalues.fillna(int(0))
int_series = filevalues.astype(int)
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)
So I have hundreds of csv files with many empty spots for values. Some of the blanks spaces are detected as NaNs and others are empty strings.This has Forced me to create my code in the way it is right now, and the reason so is that I need to conduct a formula on each value so I changed all such NaNs and empty strings to 0 so that I am able to conduct any formula ( in this example 1/1.2.) The problem is that I do not want to see values that are 0, NaN or empty strings when printing my dataframe.
I have tried to use the following:
filevalues = filevalues.dropna()
But because certain csv files have empty strings, this method does not fully work and get the error:
ValueError: invalid literal for int() with base 10: ' '
I have also tried the following after converting all values to 0:
filevalues = filevalues.loc[:, (filevalues != 0).all(axis=0)]
and
mask = np.any(np.isnan(filevalues) | np.equal(a, 0), axis=1)
Every method seems to be giving different errors. Is there a clean way to not count these types of values when I am printing my pandas dataframe? Please let me know if an example csv file is needed.
Got it to work! Here is the answer if it is of use to anyone.
import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(" ", "", regex=True)
filevalues.replace("", np.nan, inplace=True) # replace empty string with np.nan
filevalues.dropna(inplace=True) # drop nan values
int_series = filevalues.astype(int) # change type
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)

python subtract every even column from previous odd column

Sorry if this has been asked before -- I couldn't find this specific question.
In python, I'd like to subtract every even column from the previous odd column:
so go from:
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113
to
101.849 110.349 68.513
109.95 110.912 61.274
100.612 110.05 62.15
107.75 118.687 59.712
There will be an unknown number of columns. should I use something in pandas or numpy?
Thanks in advance.
You can accomplish this using pandas. You can select the even- and odd-indexed columns separately and then subtract them.
#hiro protagonist, I didn't know you could do that StringIO magic. That's spicy.
import pandas as pd
import io
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
df = pd.read_csv(data, sep='\s+')
Note that the even/odd terms may be counterintuitive because python is 0-indexed, meaning that the signal columns are actually even-indexed and the background columns odd-indexed. If I understand your question properly, this is contrary to your use of the even/odd terminology. Just pointing out the difference to avoid confusion.
# strip the columns into their appropriate signal or background groups
bg_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
signal_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]
# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(signal_df.values - bg_df.values)
result_df contains columns which are the difference between the signal and background columns. You probably want to rename these column names, though.
>>> result_df
0 1 2
0 101.849 110.349 68.513
1 109.950 110.912 61.274
2 100.612 110.050 62.150
3 107.750 118.687 59.712
import io
# faking the data file
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
header = next(data) # read the first line from data
# print(header[:-1])
for line in data:
# print(line)
floats = [float(val) for val in line.split()] # create a list of floats
for prev, cur in zip(floats[::2], floats[1::2]):
print('{:6.3f}'.format(prev-cur), end=' ')
print()
with output:
101.849 110.349 68.513
109.950 110.912 61.274
100.612 110.050 62.150
107.750 118.687 59.712
if you know what data[start:stop:step] means and how zip works this should be easily understood.

Search pandas series for value and split series at that value

Python 3.3.3
Pandas 0.12.0
I have a single column .csv file with hundreds of float values separated by an arbitrary string (the string contains letters edit: and will vary run to run). I'm a pandas beginner, hoping to find a way to load that .csv file and split the float values into two columns at the level of that string.
I'm so stuck at the first part (searching for the string) that I haven't yet been able to work on the second, which I thought should be much easier.
So far, I've been trying to use raw = pandas.read_csv('myfile.csv', squeeze=True), then something like raw.str.findall('[a-z]'), but I'm not having much luck. I'd really appreciate if someone could lend a hand. I'm planning to use this process on a number of similar .csv files, so I'd hope to find a fairly automated way of performing the task.
Example input.csv:
123.4932
239.348
912.098098989
49391.1093
....
This is a fake string that splits the data.
....
1323.4942
2445.34223
914432.4
495391.1093090
Desired eventual DataFrame:
Column A Column B
123.4932 1323.4942
239.348 2445.34223
912.098098989 914432.4
49391.1093 495391.1093090
... ...
Thanks again if you can point me in the right direction.
20131123
EDIT: Thank you for the responses thus far. Updated to reflect that the splitting string will not remain constant, hence my statement that I'd been trying to find a solution employing a regex raw.str.findall('[a-z]') instead of using .contains.
My solution at this point is to just read the .csv file and split with re, accumulate into lists, and load those into pandas.
import pandas as pd
import re
raw = open('myfile.csv', 'r').read().split('\n')
df = pd.DataFrame()
keeper = []
counter = 0
# Iterate through the rows. Consecutive rows that can be made into float are accumulated.
for row in raw:
try:
keeper.append(float(row))
except:
if keeper:
df = pd.concat([df, pd.DataFrame(keeper, columns = [counter] )], axis = 1)
counter += 1
keeper = []
# Get the last column, assuming the file hasn't ended on a line
# that will trigger the exception in the above loop.
if keeper:
df = pd.concat([df, pd.DataFrame(keeper, columns = [counter] )], axis = 1)
df.describe()
Thank you for any further suggestions.
20180729 EDIT2: One other possible solution using itertools.groupby:
import io
import itertools
import re
import numpy as np
import pandas as pd
txt = """123.4932
239.348
912.098098989
49391.1093
This is a fake string that splits the data.
1323.4942
2445.34223
914432.4
495391.1093090
fake again
31323.4942
42445.34223
2914432.4
5495391.1093090
23423432""".splitlines()
groups = itertools.groupby(
txt,
key=lambda x: not re.match('^[\d.]+$', x)
)
df = pd.concat(
(pd.Series(list(g)) for k, g in groups if not k),
axis=1
)
print(df)
use numpy.split():
import io
import numpy as np
import pandas as pd
txt = """123.4932
239.348
912.098098989
49391.1093
This is a fake string that splits the data.
1323.4942
2445.34223
914432.4
495391.1093090
fake again
31323.4942
42445.34223
2914432.4
5495391.1093090
23423432"""
s = pd.read_csv(io.BytesIO(txt), header=None, squeeze=True)
mask = s.str.contains("fake")
pos = np.where(mask)[0]
pos -= np.arange(len(pos))
arrs = [s.reset_index(drop=True) for s in np.split(s[~mask], pos)]
pd.concat(arrs, axis=1, ignore_index=True).astype(float)
output:
0 1 2
0 123.4932 1323.4942 31323.4942
1 239.348 2445.34223 42445.34223
2 912.098098989 914432.4 2914432.4
3 49391.1093 495391.1093090 5495391.1093090
4 NaN NaN 23423432
If you know you only have two columns, then you could do something like
>>> ser = pd.read_csv("colsplit.csv", header=None, squeeze=True)
>>> split_at = ser.str.contains("fake string that splits").idxmax()
>>> parts = [ser[:split_at], ser[split_at+1:]]
>>> parts = [part.reset_index(drop=True) for part in parts]
>>> df = pd.concat(parts, axis=1)
>>> df.columns = ["Column A", "Column B"]
>>> df
Column A Column B
0 123.4932 ....
1 239.348 1323.4942
2 912.098098989 2445.34223
3 49391.1093 914432.4
4 .... 495391.1093090
5 NaN extra test element
If you have an arbitrary number of places to split at, then you can use a boolean Series/shift/cumsum/groupby pattern, but if you can get away without it, so much the better.
(PS: I'm sure there's a better way than idxmax, but for the life of me I can't remember the idiom to find the first True right now. split_at[split_at].index[0] would do it, but I'm not sure that's much better.)

Categories