R programming for quantitative investment: part 1 - fetch market data from web [originally posted in #crptocurrency]

in #programming8 years ago (edited)

[Originally posted in #cryptocurrency; I'm now posting in #programming after being recommended]

Fetch crypto price chart data from web and analyze data with Excel, R and other programs


In this post we'll learn to use R to "read" web data and, as an illustrative application, use the data to plot a simple pie chart. I have been writing about a secure trading platform first before writing about trading strategies themselves. Instead of writing an intro-to-R type of post, I actually wanted to start with an application so the usefulness for the language is better realized. R users might find the post trivial; we'll soon write about advanced forecasting strategies with R's Finance and Time Series Analysis packages and other advanced techniques. Most important, we need data to conduct any kind of analysis. Here we'll learn to collect web data. For beginners, well, I guess to be able to plot - just with a few key strokes - with, say, R pietop20.R command, in couple of seconds - the below pie chart of percent of market share of top 20 (or any number) currencies should be enough motivation to learn R (or web scraping with other languages). I'm new to R - I will write as I learn. Suggestions and recommendations are much appreciated.


Figure: Market shares of top 20 cryptocurrencies

It looks cool, right? We can plot not only cool but also useful plots. If you don't understand the code now, just get along and get the concept. We'll talk about specific commands in other posts. I will leave links to resources I found helpful. If you don't want to install R, use SageMath online R console. So let's do it...

We'll "read" CryptoCurrency Market Capitalizations (CMC) price chart from their website and then analyze the data:

Figure: CryptoCurrency Market Capitalizations (CMC) price chart

First load R packages, rvest, plotrix and dplyr. url_cmc is CMC price chart URL. In the code, texts entered after # are comments and ignored by R.

R> url_cmc <- "https://coinmarketcap.com/currencies/views/all/"
R> library(rvest) # for data fetching
R> library(plotrix) # for 3Dplot
R> library(dplyr)

Now do the actual reading:

R> url_cmc %>%
   read_html() %>%
   html_nodes(css = "table") %>%
   html_table() %>%
   as.data.frame() -> "tbl_cmc" # save table to variable <tbl_cmc>

The above code snippet "reads" crypto price chart and stores it to the variable named tbl_cmc.

Let's see what our program fetched by issuing head(tbl_cmc) command:

R> head(tbl_cmc)
  X.             Name Symbol      Market.Cap     Price                                                                    Circulating.Supply   Volume..24h.  X..1h X..24h   X..7d
1  1          Bitcoin    BTC $42,533,685,929  $2593.64                                                                            16,399,225 $1,122,030,000 -0.10%  0.76% -10.30%
2  2         Ethereum    ETH $34,637,433,018   $374.00                                                                            92,613,953   $688,324,000 -0.24% -1.57%  -1.84%
3  3           Ripple    XRP $10,845,834,234 $0.283253 38,290,271,363\n                    \n                    \n                        *   $253,494,000  0.12%  4.85%   6.05%
4  4         Litecoin    LTC  $2,584,999,454    $50.08                                                                            51,617,607   $991,078,000 -0.30% 14.08%  59.22%
5  5 Ethereum Classic    ETC  $2,112,575,943    $22.78                                                                            92,744,703   $249,724,000 -0.95%  1.48%   3.51%
6  6              NEM    XEM  $1,806,894,000 $0.200766  8,999,999,999\n                    \n                    \n                        *     $8,277,050  0.27%  2.42%  -9.58%

It did fetch the table, but the table contains a few unwanted characters such as new line, \n, spaces, %. In order to conduct analysis, we need to get rid off these characters. Computers don't seem to go well with texts; they understand numbers better. I would like to remove the first column(coins' ranking from CMC - we'll create rankings based on our custom criteria), make the column names small, lowercase and meaningful. Having a look at the website should make this post's naming. Let's do...

R> tbl_cmc[] <- lapply(tbl_cmc, gsub, pattern = "\\\n|\\s|[%*$,?]", replacement = "")
R> tbl_cmc$X. <- NULL
R> names(tbl_cmc) <- c("name", "symb", "mcap", "price", "supply", "vol", "ch1h", "ch24h", "ch7d")

See how our table looks with head(tbl_cmc):

R> head(tbl_cmc)
             name symb        mcap    price      supply        vol  ch1h ch24h   ch7d
1         Bitcoin  BTC 42533685929  2593.64    16399225 1122030000 -0.10  0.76 -10.30
2        Ethereum  ETH 34637433018   374.00    92613953  688324000 -0.24 -1.57  -1.84
3          Ripple  XRP 10845834234 0.283253 38290271363  253494000  0.12  4.85   6.05
4        Litecoin  LTC  2584999454    50.08    51617607  991078000 -0.30 14.08  59.22
5 EthereumClassic  ETC  2112575943    22.78    92744703  249724000 -0.95  1.48   3.51
6             NEM  XEM  1806894000 0.200766  8999999999    8277050  0.27  2.42  -9.58

Nice! We now have pretty nice looking table. Suppose, we're interested in top 100 crytocoins of total market capitalization. In order to select top 100, we can sort (descending) market capitalization, tbl_cmc$mcap. Oh, wait, can we run mathematical operations on market capitalization variable, mcap? Let's see:

R> typeof(tbl_cmc$mcap)
[1] "character"

We can't do maths on texts. Let's convert the character type to numeric:

R> num_tbl_cmc <- lapply(tbl_cmc[-c(1:2)], as.numeric) %>%
  as.data.frame()
R> tbl_clean <- cbind(tbl_cmc$name, tbl_cmc$symb, num_tbl_cmc)
R> names(tbl_clean) <- c("name", "symb", "mcap", "price", "supply", "vol", "ch1h", "ch24h", "ch7d")

Check again with typeof(tbl_clean$mcap) command:

R> typeof(tbl_clean$mcap)
[1] "double"

double is a floating point data type in R (and all other languages) on which mathematical operations are valid. So let's select top 100 crypto by market capitalization and store the table in variable named top_100_mcap:

R> top_100_mcap <- arrange(.data = tbl_clean, desc(mcap))[1:100, ]

Now we might be interested to know how much the total of top 100 crypto market caps add up to. Well, do:

R> sum(top_100_mcap$mcap)
[1] 107433383060

So top 100 crypto add up to $107.433bn. Well, what percentage does BTC or ETH have of that total market cap? Fine, why not calculate market share percentage for all 100 currencies? We create mcap_prcnt variable where mcap_prcnt [i] = mcap[i] / sum(mcap). Based on the percentage data, we then plot a pie chart for top 20 coins by market cap:

R> top_100_mcap$mcap_prcnt <- top_100_mcap$mcap/sum(top_100_mcap$mcap)
R> top_20_mcap <- arrange(.data = top_100_mcap, desc(mcap))[1:20, ]
R> head(top_20_mcap, 10) # Display first top 10 currencies
              name  symb        mcap       price      supply        vol  ch1h ch24h   ch7d  mcap_prcnt
1          Bitcoin   BTC 42533685929 2593.640000    16399225 1122030000 -0.10  0.76 -10.30 0.395907536
2         Ethereum   ETH 34637433018  374.000000    92613953  688324000 -0.24 -1.57  -1.84 0.322408473
3           Ripple   XRP 10845834234    0.283253 38290271363  253494000  0.12  4.85   6.05 0.100954042
4         Litecoin   LTC  2584999454   50.080000    51617607  991078000 -0.30 14.08  59.22 0.024061417
5  EthereumClassic   ETC  2112575943   22.780000    92744703  249724000 -0.95  1.48   3.51 0.019664055
6              NEM   XEM  1806894000    0.200766  8999999999    8277050  0.27  2.42  -9.58 0.016818739
7             Dash  DASH  1461649683  198.280000     7371496  112050000 -0.96 12.24   4.92 0.013605172
8             IOTA MIOTA  1142659340    0.411098  2779530283    2694160 -1.49  1.00     NA 0.010635980
9        BitShares   BTS   872772343    0.336173  2596200000   59671400  1.49  2.14  -7.89 0.008123847
10         Stratis STRAT   729533748    7.410000    98430281    7947920  0.37 -2.83  -8.76 0.006790569

BTC's market cap is $42.5bn, followed by ETH's $34.6bn. Now we plot the intended pie chart and create labels for it:

R> lbls <- paste0(top_20_mcap$symb, " : ", sprintf("%.2f", top_20_mcap$mcap_prcnt*100), "%")
R> pie3D(top_20_mcap$mcap_prcnt, labels = lbls,
      explode=0.1, main="Top 20 cryptocoin market share")

This final code snippet generates the plot we saw at the beginning. Now let's say, we'd like to see which currency appreciated the most in last hour, then we order again by 24h growth and yet again by 7d.

R> top_perf <- arrange(.data = top_100_mcap, desc(ch1h), desc(ch24h), desc(ch7d))
R> head(top_perf, 10) # Display first 10 rows
         name  symb      mcap    price       supply      vol  ch1h  ch24h   ch7d   mcap_prcnt
1    VeriCoin   VRC  18300343 0.604918     30252602  1977260 10.56  -0.15  66.81 1.703413e-04
2        Asch   XAS  47499525 0.633327     75000000 15971000 10.33 133.39 180.12 4.421300e-04
3    Blocknet BLOCK  23716930 6.060000      3910516    44296  9.23  18.37  -3.52 2.207594e-04
4    Mooncoin  MOON  23303904 0.000105 222009604607   180970  8.79  -6.74 -34.73 2.169149e-04
5   XtraBYtes   XBY  16459430 0.025322    650000000   115598  7.61 -25.02   8.62 1.532059e-04
6  Cryptonite   XCN  10643830 0.031960    333032866  4084900  5.26  32.47 241.90 9.907377e-05
7      BitBay   BAY  34102242 0.033847   1007555925   268915  4.23  20.89  14.56 3.174269e-04
8   RaiBlocks   XRB  13358903 0.163423     81744327   380100  3.45  11.06 362.08 1.243459e-04
9       Steem STEEM 486078959 2.070000    234664310  3165330  3.36   2.95  -8.51 4.524469e-03
10 FedoraCoin  TIPS  14751163 0.000033 443168182458   255852  3.06  26.78  15.14 1.373052e-04

We see, Vericoin (VRC), grew the most, by 10.56% -0.15% 66.81% in the last hour, last 24 hours and last week. It'd have been nice to know how a currency, BTC or ETH, for example, is performing compared to the market, right? Or suppose we've a portfolio of BTC, ETH, ETC, XMR, DASH, SC, IOTA and ZEC. Wouldn't it be nice to know how our entire portfolio is performing relative to the entire crypto market? We'll see some more examples in next posts. Stay networked! Keep Steeming!

Code

This is the code that generates our intended plot. It takes less than a quarter-minute to read web chart and generate the plot with realtime market data. Save the script, for example, as pietop20.R, and run from terminal with R pietop20.R command to get the realtime market pie chart. In R console, run install.packages(package-name) to install package <package-name>.

# Load required packages and find price table URL
url_cmc <- "https://coinmarketcap.com/currencies/views/all/"
library(rvest) # for data fetching
library(plotrix) # for 3Dplot
library(dplyr) # for data manipulation

# Read the price table
url_cmc %>%
  read_html() %>%
  html_nodes(css = "table") %>%
  html_table() %>%
  as.data.frame() -> "tbl_cmc" # save table to variable <tbl_cmc>


# Clean data
tbl_cmc[] <- lapply(tbl_cmc, gsub, pattern = "\\\n|\\s|[%*$,?]", replacement = "")
tbl_cmc$X. <- NULL
names(tbl_cmc) <- c("name", "symb", "mcap", "price", "supply", "vol", "ch1h", "ch24h", "ch7d")

# Prepare data for mathematical operations
num_tbl_cmc <- lapply(tbl_cmc[-c(1:2)], as.numeric) %>%
  as.data.frame()
tbl_clean <- cbind(tbl_cmc$name, tbl_cmc$symb, num_tbl_cmc)
names(tbl_clean) <- c("name", "symb", "mcap", "price", "supply", "vol", "ch1h", "ch24h", "ch7d")

# Select top 100 by market cap
top_100_mcap <- arrange(.data = tbl_clean, desc(mcap))[1:100, ]
# Add a column named <mcap_prcnt> which percentage of a crypto assets market share
top_100_mcap$mcap_prcnt <- top_100_mcap$mcap/sum(top_100_mcap$mcap)


# 1h growth, highest, then 2h highest and 7d
perf <- arrange(.data = top_100_mcap, desc(ch1h), desc(ch24h), desc(ch7d))
head(perf, 10) # Display first 10 rows

sum(top_100_mcap$mcap) # coin market cap, aggregate

# Top 20
top_20_mcap <- arrange(.data = top_100_mcap, desc(mcap))[1:20, ]

# Actual plot
lbls <- paste0(top_20_mcap$symb, " : ", # Create labels for plot
               sprintf("%.2f", top_20_mcap$mcap_prcnt*100), "%")
pie3D(top_20_mcap$mcap_prcnt, labels = lbls,
      explode=0.1, main="Top 20 cryptocoin market share")


# Export collected data to Excel, SPSS, etc. `table_clean` variable contians data on 756 coins in analyzable format.
# We export the whole data set:

library(xlsx) # for Excel
library(foreign) # for SPSS and more including Excel
R> write.xlsx(tbl_clean, "clean_chart.xlsx")
# The above line of code saves our table in a file <clean_chart.xlsx> in your
# current working directory. Run getwd() to display working directory.
R> write.csv(x = tbl_clean, file = "clean_chart.csv", sep = ",", row.names = FALSE)
# This saves the table in a comma-separated CSV file

References and R resources

It should list more materials. Later.

Sort:  

Nice post. Thanks.

Wow, this is great man.
Thanks for sharing!

You're more than welcome! Looking forward to putting some more useful stuff.

Nice write up. Very clear instructions (a non R-user could even follow this).

Coin Marketplace

STEEM 0.22
TRX 0.25
JST 0.040
BTC 104076.88
ETH 3257.73
SBD 4.15