Crawling websites using curl and bash.
Every now and then I need to crawl some website or automate a web form, the easiest way to accomplish this is with a simple shells cript and parsing the data manually. I did this recently with www.samair.ru and decided to do a quick writeup.
Finding open proxy servers that work is often a little annoying, very view of the sites that give you the ability to download the complete database allow it forĀ long and most try to obfscuscate the data to prevent crawling. Take a look at http://www.samair.ru/proxy/proxy-01.htm, it seems easy enough to grab the data from this site but since it updates occasionally a script would be better. When we look at the page source however we get the following:
<tr><td>62.148.136.79<script type="text/javascript">document.write(":"+o+s)</script>
Showing that they tried to complicate this a bit, looking at the top of the page we can see the definitions they use for outputting port numbers, this changes every couple of minutes so we can’t simply hard code it either”:
<script type="text/javascript"> q=7;s=0;u=4;z=3;g=6;i=9;p=1;o=8;h=2;m=5;</script></head>
Since curl doesn’t support javascript we have to parse it ourselves, this is fairly simple using sed and grep hoewever. Since not all these proxies work we have to test them as well, they either block certain regions of the net or maybe never even have worked. Either way, using curl we connect to a site through the proxy and search for some known piece of text, if we don’t find it the proxy doesn’t work, easy. If anyone is interested here’s the script for crawling the site, on a complete pass it gives about 150 working http proxies. You can use these for faking online polls, sql injection, you name it.
There’s a lot of room for improvement, detecting redirecting proxies for example. Interestingly the pages are only numbered to 25 but we can crawl much farther than that, the scripped maxed out at 72 the last time I used it. Detecting when we get our last proxy doesn’t always work though, you can get an empty page and the following pages would have more loot, we just crawl to page 99 then.

