Getting fundatmental data
I have tried to obtain cash flow and balance sheet data from tsetmc.com. It was not as straightforward as I thought. Here is what I encountered:
- Easy parts:
- Each stock has a unique number at the at the end of address
- Mutual funds and delisted and trade-able stocks have different structure than others
- Numbers are in English and not Persian
- Balance sheets and Cash flow data have separate pages that has no unique addresses and need to be clicked on
- Troublesome parts:
- Different years balance sheets and cash flow data have no special “id” or “name” or anything else that would make it easy to separate them
- data for different years are just separable by text part
- With R I could not use regex on the text and separate different fields automatically
- It seems that the site uses some Arabic characters instead of Persian for some sounds like “ii”
- there is no unique pattern for balance sheet or cash flow data, it seemed to me that they could be distinguished only by using regex
By seeing these I reached to the conclusion that since regex with partly Persian character and partly Arabic is not possible in R- by my knowledge- so I would use just the data that have unique path. I used RSelenium for web scrapping but since I would not click on the pages or fill forms, RSelenium is not necessary. The some of the data that have unique path included: Book value of outstanding stocks, free float rate, MCap, EPS, P/E and groups P/E. The last one as I said before and considering the kind of group separation that is used is almost useless yet I get it in order to see whether it has any value that I am neglecting. These data seemed to me to be more than enough for splitting the stocks by small vs big criteria.
The URL for each stock is like this: “http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=42354736493447489” which has two part the part before equal sign is common part for different stocks and the number after that is unique to each stock.
I used Docker for Selenium server on Firefox.
tsetmc_export<-"http://www.tsetmc.com/loader.aspx?ParTree=151311&i="
sym.id<- "42354736493447489"
link_dn<- paste(tsetmc_export, sym.id, sep= "")
Doc_name<- "Pars.Khodro"
remDr <- remoteDriver(port = 4445L)
remDr$open()
remDr$navigate(link_dn)
Sys.sleep(4)
remDr$screenshot(display = TRUE)
is.mutual.fund<- tryCatch( remDr$findElement( using = 'xpath',
".//*[@id='d11']/div")),
error = function(e){ TRUE})
if( is.mutual.fund != TRUE){
wxbox <- remDr$findElement(using = 'xpath',
".//*[@id='d11']/div")
MCAP<- wxbox$getElementAttribute("title")[[1]]
} else { temp <- NULL}
is.delisted<- tryCatch( remDr$findElement( using = 'xpath',
".//*[@id='TopBox']/div[1]/div[4]/table/tbody/tr[1]/td[2]/div"),
error=function(e){ TRUE})
if( is.delisted != TRUE ){
wxbox <- remDr$findElement(using = 'xpath',
".//*[@id='TopBox']/div[1]/div[4]/table/tbody/tr[1]/td[2]/div")
BookValue<- wxbox$getElementAttribute("title")[[1]]
wxbox <- remDr$findElement(using = 'xpath',
".//*[@id='TopBox']/div[1]/div[4]/table/tbody/tr[3]/td[2]")
freefloat<-wxbox$getElementText()[[1]]
wxbox <- remDr$findElement(using = 'xpath',
".//*[@id='TopBox']/div[1]/div[6]/table/tbody/tr[1]/td[2]")
EPS<- wxbox$getElementText()[[1]]
wxbox <- remDr$findElement(using = 'xpath',
".//*[@id='TopBox']/div[1]/div[6]/table/tbody/tr[2]/td[4]")
G.PE<- wxbox$getElementText()[[1]]
wxbox <- remDr$findElement(using = 'xpath',
".//*[@id='d12']")
PE<- wxbox$getElementText()[[1]]
temp<- cbind.data.frame(sym = Doc_name, MCAP = MCAP,
BookValue = BookValue, freefloat = freefloat,
EPS = EPS, PE = PE, G.PE = G.PE)
} else{ temp<- NULL}
The first if statement checks whether the id is for a mutual fund or not and the second one checks whether the stock is delisted or not.
Conclusion
By making a list of the ids for URL I made a data frame of some fundamental data. Using that in next post I would consider momentum strategy. Please let me know if you have a find a good way for regex in R or any other language that could be used for getting data from tsetmc.com, I would deeply appreciate it.