Scrubbing Car websites for statistics | DIMI’s place

DIMI's place

My thoughts on different things

22 August 2021

Scrubbing Car websites for statistics

I would like to share how to get data from any (unless web developers decided to make it really obscure for everyone ) web page.

I’ll be using chrome browser + TamperMonkey plugin.

TLDR:

you can find examples in the separate repo + github-hosted page

Preparation phase

Save web page that you would like to locally. Open html source code and remove all <script> tags

Add

<script src = "https://code.jquery.com/jquery-3.4.1.min.js" ></script>

to head tag.

Open page in broser, check that Jquery is there using command:

$().jquery

If no error appears & you see version of Jquerry - you are ready for scrubbing!

This would allow you to run jquerry querries in console & test your requests.

Few words about jquery methods

So main goal is to find either .class field under the tag or specific tag, or specific order of tag, so selection is possible.

Below are some examples that might help in fetching data:

$(item).find(".classname")                //Contents under .classname
$(item).find(".classname:first")          //First element with .classname
$(item).find(".classname:nth-child(3n)")  //3rd item with .classname
$(item).find(".classname > div")          //div tag contents under .classname
$(item).find(".classname").attr('href')   //attribute href under .classname

Note - that you have to add .text() at the end of query to convert from jQuery into text.

Implementation

Now go fishing for items. Most often they are stored under one parent or have same class name.:

var elements = $(".list_parent_class .subitem_class")
or if no parent defined:
var elements = $(".subitem_class")

After that we’ll itterate over all items like this:

$(elements).each((index, item) => extractAllFields(item));

Method extractAllFields would contain code that would match selectors and store them in variables :

function extractAllFields(item) {
     let nameField=$(eachOrder).find(".headline-block").text().replace(/\s\s+/g, ' ')
     ...
}

Result would be stored in the object & after that in the list:

var results = [];

let result = {
    name: nameField
    km: <>
    year: <>
    ...
}

results.push(car);

Final step - putting result in the clipboard:

GM_setClipboard(JSON.stringify(cars));

So you’ll end up with clipboard with JSON

TODO’s and TODONT’s

tags: life