Why yet another YouTube downloader?
I started studying computer science and gained a lot of valuable skills which lead me to further understand how YouTube transfers data between server and client. Before starting majoring in CS I already built a YouTube downloader once and it worked fine back then. However, it was horrible. I crammed all my classes into one big file back then which had almost 8000 lines of code - not that much(for a total at least - I’m not talking about the fact that I had that all in just one file), but for the unexperienced me back then still a rather big amount. But that sin pales in comparison to my coding style: I had extremely tight coupling because I loved having very finely defined parameters instead of having parameters with high level of compatibility. To clarify: I had methods like:
1
2
3
Public Async Function Download(IHasDownloadableFilePath downloadable) As Task<Mp3File>
'Download file
End Function
So I figured reprogramming it with my new knowledge could be a good thing.
TL;DR: Because I can.
So, how does YouTube work now?
Exactly. “now”. That’s the point. I’m definitely not the only one programming a YouTube downloader and one of the challenges is maintainability. But we’ll come to that later.
Figuring out how the website works
Grab yourself your web browser of trust and open a new tab. Navigate to YouTube and search for “Cutest Cats Compilation 2017 | Best Cute Cat Videos Ever”(Trust me with this - it’ll work). In case you’re as lazy as I am: Here’s the link.
Watch the video 5-6 times(important!), then open up the developer tools and change to a view which shows you the network traffic of this very tab. Maybe you’ll have to reload the page to clear the previous entries. Now you should see entries like:
Status | Method | Domain | File |
---|---|---|---|
200 | GET | www.youtube.com | /watch?v=rNSnfXl1ZjU |
200 | POST | www.youtube.com | /youtubei/v1/log_event?alt=json&key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8 |
499 | GET | s.youtube.com | /api/stats/watchtime?ns=yt&el=detailpage&cpn=GS_JfBXhmDuUP4B-&docid=rNSnfXl1ZjU&… |
499 | GET | s.youtube.com | /api/stats/qoe?event=streamingstats&fmt=248&afmt=251&cpn=GS_JfBXhmDuUP4B-&… |
304 | GET | www.youtube.com | /yts/cssbin/www-core-vflEw3T4V.css |
200 | GET | www.youtube.com | /yts/jsbin/scheduler-vflFMss6b/scheduler.js |
304 | GET | www.youtube.com | /yts/jsbin/player-vflmgXZN3/de_DE/base.js |
304 | GET | www.youtube.com | /yts/cssbin/player-vfldNZ2yp/www-player.css |
304 | GET | www.youtube.com | /yts/cssbin/www-pageframe-vflvmMK_J.css |
304 | GET | www.youtube.com | /yts/cssbin/www-watch-transcript-vflp9_n_i.css |
304 | GET | www.youtube.com | /yts/img/pixel-vfl3z5WfW.gif |
304 | GET | yt3.ggpht.com | /…/photo.jpg |
200 | GET | www.youtube.com | /yts/jsbin/spf-vflktUosI/spf.js |
304 | GET | www.youtube.com | /yts/jsbin/www-en_US-vflNi-csa/base.js |
204 | GET | r7—sn-nfpnnjvh-9anz.googlevideo.com | /generate_204?conn2 |
204 | GET | r7—sn-nfpnnjvh-9anz.googlevideo.com | /generate_204 |
499 | GET | s.youtube.com | /api/stats/qoe?event=streamingstats&fmt=248&afmt=251&cpn=rV7I2hNtggAhESMv&… |
200 | GET | r7—sn-nfpnnjvh-9anz.googlevideo.com | /videoplayback?itag=248&ipbits=0&signature=… |
200 | GET | r7—sn-nfpnnjvh-9anz.googlevideo.com | /videoplayback?itag=251&ipbits=0&signature=… |
200 | GET | s.ytimg.com | /yts/imgbin/www-hitchhiker-vfl-Nn88d.png |
200 | GET | s.ytimg.com | /yts/img/icn_loading_animated-vflff1Mjj.gif |
304 | GET | www.youtube.com | /yts/jsbin/player-vflmgXZN3/de_DE/captions.js |
304 | GET | www.youtube.com | /yts/jsbin/player-vflmgXZN3/de_DE/endscreen.js |
304 | GET | www.youtube.com | /yts/jsbin/player-vflmgXZN3/de_DE/annotations_module.js |
304 | GET | www.youtube.com | /yts/jsbin/player-vflmgXZN3/de_DE/remote.js |
304 | GET | www.youtube.com | /yts/jsbin/player-vflmgXZN3/de_DE/annotations_module.js |
499 | GET | www.youtube.com | /api/stats/ads?ver=2&ns=1&event=5&device=1&content_v=rNSnfXl1ZjU&el=detailpageP&… |
200 | GET | www.googleapis.com | /youtube/v3/videos?id=DGilp8SwVXA&part=snippet,status,statistics&… |
200 | GET | www.youtube.com | /get_video_metadata?video_id=DGilp8SwVXA&html5=1&page_subscribe=0&authuser=0 |
304 | GET | i1.ytimg.com | /vi/rNSnfXl1ZjU/mqdefault.jpg |
499 | GET | www.youtube.com | /api/stats/ads?ver=2&ns=1&event=5&device=1&content_v=rNSnfXl1ZjU&… |
200 | POST | www.youtube.com | /annotations_invideo?cap_hist=1&video_id=rNSnfXl1ZjU&client=1&… |
204 | POST | www.youtube.com | /get_endscreen?v=rNSnfXl1ZjU&ei=SDZOWdSuNMWmWKy5jLAP&client=1 |
304 | GET | www.youtube.com | /mac_204?action_fcts=1 |
200 | GET | r7—sn-nfpnnjvh-9anz.googlevideo.com | /videoplayback?itag=248&ipbits=0&signature=… |
204 | GET | s.youtube.com | /api/stats/playback?ns=yt&el=detailpage&cpn=rV7I2hNtggAhESMv&docid=rNSnfXl1ZjU&… |
499 | GET | www.youtube.com | /ptracking?html5=1&video_id=rNSnfXl1ZjU&cpn=rV7I2hNtggAhESMv&plid=… |
200 | GET | r7—sn-nfpnnjvh-9anz.googlevideo.com | /videoplayback?itag=251&ipbits=0&signature=… |
200 | GET | yt3.ggpht.com | /proxy/_b7z2mMA2Pu7_OvgnFQpciztYnnWR3c6r7yxU8UhwUOnSvFooNPP3ILI9… |
200 | GET | yt3.ggpht.com | /proxy/P7rR4kPFGWDO7-1De2dy0dt6LsxYjKiEAR3QBU6KF8uOr0y4zcUb-Codg… |
200 | GET | yt3.ggpht.com | /proxy/qDqcmyBYm6z6nH2PHKUAgQq3HgZRAJ2yJ1Z4JyAUQH4YmlCeFATtjVp2e… |
200 | GET | yt3.ggpht.com | /proxy/2r2ORz4a8RzuPT3jMWIXY2Too0wzVwIvQEi8RJoZCPcfaMWscT3lS1jLc… |
Now, those requests interesting to us are the video playbacks which normally go to the nearest stream server. In my case that was r7---sn-nfpnnjvh-9anz.googlevideo.com
. The request goes to /videoplayback
with the following queries:
Query | Value |
---|---|
itag | 248 |
ipbits | 0 |
signature | 433FE0E9FF5840DBF0F3F0B36850FAEFE87C9871.B996B69BC9875122EC16D5F518BD1571C4EBB9AA |
ms | au |
mv | m |
mt | 1498297837 |
keepalive | yes |
source | youtube |
requiressl | yes |
clen | 177291236 |
key | yt6 |
mn | sn-nfpnnjvh-9anz |
mm | 31 |
ei | SDZOWdSuNMWmWKy5jLAP |
id | o-AJL6xVtQH6NhA7g0PkCTaxNiOjSV2rmWpEDfaN5Pmuym |
initcwndbps | 6528750 |
mime | video/webm |
lmt | 1490303207744796 |
ip | IPv4/IPv6 address of client |
sparams | clen,dur,ei,gir,id,initcwndbps,ip,ipbits,itag,keepalive,lmt,mime,mm,mn,ms,mv,pl,requiressl,source,expire |
gir | yes |
expire | 1498319529 |
dur | 614.833 |
pl | 45 |
alr | yes |
ratebypass | yes |
cpn | rV7I2hNtggAhESMv |
c | WEB |
cver | 1.20170622 |
range | 0-766271 |
rn | 0 |
rbuf | 0 |
I go into greater detail about all of those queries in another post.
What is important however, is that we distinguish between various formats via their itag
’s. Let’s find out where this information comes from, shall we? It turns out, there’s a big Json object in the main html file which contains all these format information in args.adaptive_fmts
and args.url_encoded_fmt_stream_map
. Here’s the one from the example above:
Key | Value |
---|---|
url | too long |
itag | 248 |
bitrate | 3055840 |
size | 1920x1080 |
lmt | 1490303207744796 |
type | video/webm; codecs=”vp9” |
init | 0-242 |
projection_type | 1 |
xtags | |
index | 243-2311 |
clen | 177291236 |
fps | 30 |
quality_label | 1080p |
url: https://r7---sn-nfpnnjvh-9anz.googlevideo.com/videoplayback?itag=248&ipbits=0&signature=433FE0E9FF5840DBF0F3F0B36850FAEFE87C9871.B996B69BC9875122EC16D5F518BD1571C4EBB9AA&ms=au&mv=m&mt=1498297837&keepalive=yes&source=youtube&requiressl=yes&clen=177291236&key=yt6&mn=sn-nfpnnjvh-9anz&mm=31&ei=SDZOWdSuNMWmWKy5jLAP&id=o-AJL6xVtQH6NhA7g0PkCTaxNiOjSV2rmWpEDfaN5Pmuym&initcwndbps=6528750&mime=video%2Fwebm&lmt=1490303207744796&ip=XXXX&sparams=clen%2Cdur%2Cei%2Cgir%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Ckeepalive%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cpl%2Crequiressl%2Csource%2Cexpire&gir=yes&expire=1498319529&dur=614.833&pl=45
Again: What those keys and values mean is something I address in another post. Now, to make this download link work, we have to add a &ratebypass=yes
query and we’re done.
At least in this simple case. Most of the times we have to deal with formats having an s
-key with a value that resembles a signature. However, we can’t just add it to our download link - we have do some manipulation on it to turn it into a valid signature. This is where it gets nasty, because the algorithm to decipher this string is in a JavaScript file named base.js
which is shortened and therefore barely readable. It’s exact path can be found in the Json object at assets.js
. In there, one can find a call which sets the signature
to the result of another method call:
This method can look different for every version of base.js
. This one for example looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
yE = function(a) {
a = a.split("");
xE.hI(a, 2);
xE.iF(a, 33);
xE.iF(a, 16);
xE.iF(a, 44);
xE.hI(a, 1);
xE.iF(a, 12);
xE.dy(a, 23);
xE.iF(a, 19);
return a.join("")
};
We have three different methods here. One splices, one reverses and the last one swaps chars of the s
-array. To find out which one does what, we have to identify them. In this case, the setup is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
var xE = {
dy: function(a) {
a.reverse()
},
hI: function(a, b) {
a.splice(0, b)
},
iF: function(a, b) {
var c = a[0];
a[0] = a[b % a.length];
a[b] = c
}
};
Now it is pretty much clear which one does what. Now we can rebuild this function yE
and finally get our signature
query complete.
So that’s basically how you get to the download link. As I already mentioned in the beginning, I’m currently programming the whole thing and you can find the repository here.