Dataset selective picking and transformation - python

I have a dataset in .xlsx with hundreds of thousands of rows as follow:
slug symbol name date ranknow open high low close volume market close_ratio spread
companyA AAA companyA 28/04/2013 1 135,3 135,98 132,1 134,21 0 1500520000 0,5438 3,88
companyA AAA companyA 29/04/2013 1 134,44 147,49 134 144,54 0 1491160000 0,7813 13,49
companyA AAA companyA 30/04/2013 1 144 146,93 134,05 139 0 1597780000 0,3843 12,88
....
companyA AAA companyA 17/04/2018 1 8071,66 8285,96 7881,72 7902,09 6900880000 1,3707E+11 0,0504 404,24
....
lancer LA Lancer 09/01/2018 731 0,347111 0,422736 0,345451 0,422736 3536710 0 1 0,08
lancer LA Lancer 10/01/2018 731 0,435794 0,512958 0,331123 0,487106 2586980 0 0,8578 0,18
lancer LA Lancer 11/01/2018 731 0,479738 0,499482 0,309485 0,331977 950410 0 0,1184 0,19
....
lancer LA Lancer 17/04/2018 731 0,027279 0,041106 0,02558 0,031017 9936 1927680 0,3502 0,02
....
yocomin YC Yocomin 21/01/2016 732 0,008135 0,010833 0,002853 0,002876 63 139008 0,0029 0,01
yocomin YC Yocomin 22/01/2016 732 0,002872 0,008174 0,001192 0,005737 69 49086 0,651 0,01
yocomin YC Yocomin 23/01/2016 732 0,005737 0,005918 0,001357 0,00136 67 98050 0,0007 0
....
yocomin YC Yocomin 17/04/2018 732 0,020425 0,021194 0,017635 0,01764 12862 2291610 0,0014 0
....
Let's say I have a .txt file with a list of symbol of that time series I want to extract. For example:
AAA
LA
YC
I would like to get a dataset that would look as follow:
date AAA LA YC
28/04/2013 134,21 NaN NaN
29/04/2013 144,54 NaN NaN
30/04/2013 139 NaN NaN
....
....
....
17/04/2018 7902,09 0,031017 0,01764
where under the stock name (like AAA, etc) i get the "close" price. I'm open to both Python and R. Any help would be grate!

In python using pandas, this should work.
import pandas as pd
df = pd.read_excel("/path/to/file/Book1.xlsx")
df = df.loc[:, ['symbol', 'name', 'date', 'close']]
df = df.set_index(['symbol', 'name', 'date'])
df = df.unstack(level=[0,1])
df = df['close']
to read the symbols file file and then filter out symbols not in the dataframe:
symbols = pd.read_csv('/path/to/file/symbols.txt', sep=" ", header=None)
symbols = symbols[0].tolist()
symbols = pd.Index(symbols).unique()
symbols = symbols.intersection(df.columns.get_level_values(0))
And the output will look like:
print(df[symbols])
symbol AAA LA YC
name companyA Lancer Yocomin
date
2018-09-01 00:00:00 None 0,422736 None
2018-10-01 00:00:00 None 0,487106 None
2018-11-01 00:00:00 None 0,331977 None

Related

How to transform multiple columns of days into week columns

I would like to know how can I transform the day columns into week columns.
I tryed groupby.sum() but there is no column name pattern, I dont know what to groupby for.
So the result should be column name like 'weekX' - "week1(Sum of 7 first days) - week2 - week3" and so on.
Thanks in advance.
You can try:
idx = pd.RangeIndex(len(df.columns[4:])) // 7
out = df.iloc[:, 4:].groupby(idx, axis=1).sum().rename(columns=lambda x:f'Week{x+1}')
out = pd.concat([df.iloc[:, :4], out], axis=1)
print(out)
# Output
Province/State Country/Region Lat ... Week26 Week27 Week28
0 NaN Afghanistan 3.393.911 ... 247210 252460 219855
1 NaN Albania 411.533 ... 28068 32671 32113
2 NaN Algeria 280.339 ... 157675 187224 183841
3 NaN Andorra 425.063 ... 6147 6283 5552
4 NaN Angola -112.027 ... 4741 6341 6978
.. ... ... ... ... ... ... ...
261 NaN Sao Tome and Principe 1.864 ... 5199 5813 5231
262 NaN Yemen 15.552.727 ... 11089 11717 10363
263 NaN Comoros -116.455 ... 2310 2419 2292
264 NaN Tajikistan 38.861 ... 47822 50032 44579
265 NaN Lesotho -29.61 ... 2259 3011 3922
[266 rows x 32 columns]
You can use the melt method to combine all your date columns into a single 'Date' column:
df = df.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='Date', value_name='Value')
From this point it should be straightforward to group by the 'Date' column by week, and then unstack it if you want to have it as multiple columns again.

How to use plotly graph_objects scatter with date

starting from a known public data set which I copied to my own server.
The data set is here: https://www.kaggle.com/imdevskp/corona-virus-report/download
import pandas as pd
#df = pd.read_csv("http://g0mesp1res.dynip.sapo.pt/covid_19_clean_complete.csv", index_col=4, parse_dates=True)
df = pd.read_csv("http://g0mesp1res.dynip.sapo.pt/covid_19_clean_complete.csv")
df=df.drop(labels=None, axis=0, index=None, columns=['Province','Lat','Long'], level=None, inplace=False, errors='raise')
#print(df.head())
df['Date']=pd.to_datetime(df['Date'])
#print(df.head())
list_countries = ['Portugal','Brazil','Spain','Italy','Korea, South','Japan']
df= df[df['Country'].isin(list_countries)]
df_pt = df[df.Country == 'Portugal']
df_es = df[df.Country == 'Spain']
df_it = df[df.Country == 'Italy']
print(df_pt.head())
print(df_pt.tail())
I get what I expected
Country Date Confirmed Deaths Recovered
59 Portugal 2020-01-22 0 0 0
345 Portugal 2020-01-23 0 0 0
631 Portugal 2020-01-24 0 0 0
917 Portugal 2020-01-25 0 0 0
1203 Portugal 2020-01-26 0 0 0
Country Date Confirmed Deaths Recovered
15503 Portugal 2020-03-16 331 0 3
15789 Portugal 2020-03-17 448 1 3
16075 Portugal 2020-03-18 448 2 3
16361 Portugal 2020-03-19 785 3 3
16647 Portugal 2020-03-20 1020 6 5
However, when plotting, it seems that all data is in January!
import plotly.graph_objects as go
fig = go.Figure( go.Scatter(x=df.Date, y=df_pt.Confirmed, name='Portugal'))
fig.show()
plotly output graph :
What is missing?
Change the x axis from x=df.Date to x = df_pt.Date:
import plotly.graph_objects as go
fig = go.Figure(go.Scatter(x = df_pt.Date,
y = df_pt.Confirmed,
name='Portugal'))
fig.show()
and you get:

problem merging list and dataframe in Python

I have CSV files that I want to merge with list of struct(class) I made.
In the CSV I have field 'sector' and another field with information about this sector.
The array type is of a class I made with fields: name, x, y where x,y is the location that belong to this name.
This is how I defined the list(I generated it from CSV file as well which each antenna appear many time with different parameters so I extracted only those I need)
# ant_file is the CSV with all the antennas, ant_list_name is the list with
# only antennas name and ant_list_tot is the list with the name and also x,y
# fields
for rowA in range(size_ant_file):
rec = ant_file.iloc[rowA]['name']
if rec not in ant_lis_name:
ant_lis_name.append(rec)
A = Antenna(ant_file.iloc[rowA]['name'], ant_file.iloc[rowA]['x'],
ant_file.iloc[rowA]['y'])
ant_list_tot.append(A)
print(antenna_list)
[Antenna(name='UWE33', x=34.9, y=31.9), Antenna(name='UTN00', x=34.8,
y=32.1), Antenna(name='UWE02', x=34.8, y=32.1)]
I tried to do it with double for loop:
#dataclass
class Antenna:
name: str
x: float
y: float
# records is the csv file and antenna_list is the list of type Antenna
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
The result CSV file is partly right and at the end there are rows with all fields correctly except x and y fields which are 0 and some rows with x and y values but without the information of the original fields.
It seems like there is a big shift of rows but I can't understand why.
I checked that there are no missing values
example:
records.csv at the begining:(date,hour and user_id are random number and its not important)
sector date hour user_id x y
abc 1.1.19 20:00 123 0 0
dfs 5.8.17 12:40 876 0 0
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
antenna_list in the form of (name,x,y): (also here, x and y is random number right now and its not important)
antenna_list[0] = (abc,12,16)
antenna_list[1] = (dfs,6,20)
antenna_list[2] = (ngh,13,98)
antenna_list[3] = (yjt,18,41)
the result I want to see is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 13 98
yjt 10.10.16 17:18 492 18 41
abc 6.8.16 22:10 985 12 16
dfs 7.1.15 19:15 542 6 20
but the real result is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
13 98
18 41
12 16
6 20
TIA
If you save antenna_list as two dicts,
antenna_dict_x = {'abc':12, 'dfs':6, 'ngh':13, 'yjt':18}
antenna_dict_y = {'abc':16, 'dfs':20, 'ngh':98, 'yjt':41}
then creating two columns should be an easy map,
data['x']=data['sector'].map(antenna_dict_x)
data['y']=data['sector'].map(antenna_dict_y)
So if you do:
import pandas as pd
class Antenna():
def __init__(self, name, x, y):
self.name = name
self.x = x
self.y = y
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is what you were expecting. Also, if you do:
import pandas as pd
from dataclasses import dataclass
#dataclass
class Antenna:
name: str
x: float
y: float
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is, again, what you were expecting. You did not post how you created the antenna list, but I assume that is where your error is.

Blank quotes usage in dataframe

I am trying to combine OR | with df.loc to extract data. The code I have written extracts everything in the csv file. Here is the original csv file: https://drive.google.com/open?id=16eo29mF0pn_qNw-BGpZyVM9PBxv2aN1G
import pandas as pd
df = pd.read_csv("yelp_business.csv")
df = df.loc[(df['categories'].str.contains('chinese', case = False)) | (df['name'].str.contains('subway', case = False)) | (df['categories'].str.contains('', case = False)) | (df['address'].str.contains('', case = False))]
print df
It looks like the blank quotes '' are not working in str.contains or the OR | doesn't work in df.loc. Instead of just returning rows with chinese restaurants (which are 4171 in number) and the row with the restaurant name subway, it returns all the 174,568 rows.
EDITED
The output I want should be all the rows of category chinese and all the rows of name subway while taking into consideration that the address might not have any assigned value or is null.
import pandas as pd
df = pd.read_csv("yelp_business.csv")
cusine = 'chinese'
name = 'subway'
address #address has no assigned value or is NULL
df = df.loc[(df['categories'].str.contains(cusine, case = False)) |
(df['name'].str.contains(name, case = False)) |
(df['address'].str.contains(address, case = False))]
print df
This code gives me an error NameError: name 'address' is not defined.
I think here is possible chain conditions by | for categories column, for find empty string use ^""$ - it match start and end of string with quotes:
df = pd.read_csv("yelp_business.csv")
df1 = df.loc[(df['categories'].str.contains('chinese|^""$', case = False)) |
(df['name'].str.contains('subway', case = False)) |
(df['address'].str.contains('^""$', case = False))]
print (len(df1))
11320
print (df1.head())
business_id name neighborhood \
9 TGWhGNusxyMaA4kQVBNeew "Detailing Gone Mobile" NaN
53 4srfPk1s8nlm1YusyDUbjg ***"Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
63 r6Jw8oRCeumxu7Y1WRxT7A "D&D Cleaning" NaN
88 YhV93k9uiMdr3FlV4FHjwA "Caviness Studio" NaN
address city state postal_code latitude \
9 ***"" Henderson NV 89014 36.055825
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119 36.064652
57 "8515 McCowan Road" Markham ON L3P 5E5 43.867918
63 ***"" Urbana IL 61802 40.110588
88 ***"" Phoenix AZ 85001 33.449967
longitude stars review_count is_open \
9 -115.046350 5.0 7 1
53 -115.118954 2.5 6 1
57 -79.283687 3.0 80 1
63 -88.207270 5.0 4 0
88 -112.070223 5.0 4 1
categories
9 Automotive;Auto Detailing
53 Fast Food;Restaurants;Sandwiches
57 ***Restaurants;Chinese
63 Home Cleaning;Home Services;Window Washing
88 Marketing;Men's Clothing;Restaurants;Graphic D...
EDIT: If need filter out empty and NaNs values:
df2 = df.loc[(df['categories'].str.contains('chinese', case = False)) |
(df['name'].str.contains('subway', case = False)) &
~((df['address'] == '""') | (df['categories'] == '""'))]
print (df2.head())
business_id name neighborhood \
53 4srfPk1s8nlm1YusyDUbjg "Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
96 dTWfATVrBfKj7Vdn0qWVWg "Flavor Cuisine" Scarborough
126 WUiDaFQRZ8wKYGLvmjFjAw "China Buffet" University City
145 vzx1WdVivFsaN4QYrez2rw "Subway" NaN
address city state postal_code \
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119
57 "8515 McCowan Road" Markham ON L3P 5E5
96 "8 Glen Watford Drive" Toronto ON M1S 2C1
126 "8630 University Executive Park Dr" Charlotte NC 28262
145 "5111 Boulder Hwy" Las Vegas NV 89122
latitude longitude stars review_count is_open \
53 36.064652 -115.118954 2.5 6 1
57 43.867918 -79.283687 3.0 80 1
96 43.787061 -79.276166 3.0 6 1
126 35.306173 -80.752672 3.5 76 1
145 36.112895 -115.062353 3.0 3 1
categories
53 Fast Food;Restaurants;Sandwiches
57 Restaurants;Chinese
96 Restaurants;Chinese;Food Court
126 Buffets;Restaurants;Sushi Bars;Chinese
145 Sandwiches;Restaurants;Fast Food
Find detail information about contains at
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html

Combine MultiIndex columns to a single index in a pandas dataframe

With my code I integrate 2 databases in 1. The problem is when I add one more column to my databases, the result is not as expected. Use Python 2.7
code:
import pandas as pd
import pandas.io.formats.excel
import numpy as np
# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")
df = (pd.concat([df1,df2])
.set_index(["Cliente",'Fecha'])
.stack()
.unstack(1)
.sort_index(ascending=(True, False)))
m = df.index.get_level_values(1) == 'Impresiones'
df.index = np.where(m, 'Impresiones', df.index.get_level_values(0))
# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None
with pd.ExcelWriter("Data.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
archivo1:
Fecha Cliente Impresiones Impresiones 2 Revenue
20/12/17 Jose 1312 35 $12
20/12/17 Martin 12 56 $146
20/12/17 Pedro 5443 124 $1,256
20/12/17 Esteban 667 1235 $1
archivo2:
Fecha Cliente Impresiones Impresiones 2 Revenue
21/12/17 Jose 25 5 $2
21/12/17 Martin 6347 523 $123
21/12/17 Pedro 2368 898 $22
21/12/17 Esteban 235 99 $7,890
Hope Results:
I tried with m1 = df.index.get_level_values(1) == 'Impresiones 2' df.index = np.where(m1, 'Impresiones 2', df.index.get_level_values(0)) but I have this error: IndexError: Too many levels: Index has only 1 level, not 2
The first bit of the solution is similar to jezrael's answer to your previous question, using concat + set_index + stack + unstack + sort_index.
df = pd.concat([df1, df2])\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
Now comes the challenging part, we have to incorporate the Names in the 0th level, into the 1st level, and then reset the index.
I use np.insert to insert names above the revenue entry in the index.
i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
Now, I create a new MultiIndex which I then use to reindex df -
idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
Now, drop the extra level -
df.index = df.index.droplevel()
df
Fecha 20/12/17 21/12/17
Esteban
Revenue $1 $7,890
Impresiones2 1235 99
Impresiones 667 235
Jose
Revenue $12 $2
Impresiones2 35 5
Impresiones 1312 25
Martin
Revenue $146 $123
Impresiones2 56 523
Impresiones 12 6347
Pedro
Revenue $1,256 $22
Impresiones2 124 898
Impresiones 5443 2368

Categories