You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/extractors/custom/README.md
+32-29
Original file line number
Diff line number
Diff line change
@@ -8,18 +8,19 @@ Custom parsers allow you to write CSS selectors that will find the content you'r
8
8
9
9
You can query for every field returned by the Mercury Parser:
10
10
11
-
- title
12
-
- author
13
-
- content
14
-
- date_published
15
-
- lead_image_url
16
-
- dek
17
-
- next_page_url
18
-
- excerpt
11
+
- title
12
+
- author
13
+
- content
14
+
- date_published
15
+
- lead_image_url
16
+
- dek
17
+
- next_page_url
18
+
- excerpt
19
19
20
20
### Using selectors
21
21
22
22
#### Basic selectors
23
+
23
24
To demonstrate, let's start with something simple: Your selector for the page's title might look something like this:
24
25
25
26
```javascript
@@ -41,12 +42,13 @@ As you might guess, the selectors key provides an array of selectors that Mercur
41
42
The selector you choose should return one element. If more than one element is returned by your selector, it will fail (and Mercury will fall back to its generic extractor).
42
43
43
44
#### Selecting an attribute
44
-
Sometimes the information you want to return lives in an element's attribute rather than its text — e.g., sometimes a more exact ISO-formatted date/time will be stored in an attribute of an element.
45
+
46
+
Sometimes the information you want to return lives in an element's attribute rather than its text — e.g., sometimes a more exact ISO-formatted date/time will be stored in an attribute of an element.
The text you want isn't the text inside a matching element, but rather, inside the datetime attribute. To write a selector that returns an attribute, you provide your custom parser with a two-element array. The first element is your selector; the second element is the attribute you'd like to return.
@@ -71,7 +73,7 @@ This is all you'll need to know to handle most of the fields Mercury parses (tit
71
73
72
74
An article's content can be more complex than the other fields, meaning you sometimes need to do more than just provide the selector(s) in order to return clean content.
73
75
74
-
For example, sometimes an article's content will contain related content that doesn't translate or render well when you just want to see the article's content. The clean key allows you to provide an array of selectors identifying elements that should be removed from the content.
76
+
For example, sometimes an article's content will contain related content that doesn't translate or render well when you just want to see the article's content. The clean key allows you to provide an array of selectors identifying elements that should be removed from the content.
75
77
76
78
Here's an example:
77
79
@@ -195,21 +197,21 @@ If you look at your parser's test file, you'll see a few instructions to guide y
195
197
By default, the first test, which ensures your custom extractor is being selected properly, should be passing. The first failing test checks to see whether your extractor returns the correct title:
196
198
197
199
```javascript
198
-
it('returns the title', (async) () => {
199
-
// To pass this test, fill out the title selector
200
-
// in ./src/extractors/custom/www.newyorker.com/index.js.
const { title } =awaitMercury.parse(articleUrl, html, { fallback:false });
210
+
211
+
// Update these values with the expected values from
212
+
// the article.
213
+
assert.equal(title, 'Schrödinger’s Hack');
214
+
});
213
215
```
214
216
215
217
As you can see, to pass this test, we need to fill out our title selector. In order to do this, you need to know what your selector is. To do this, open the html fixture the generator downloaded for you in the [`fixtures`](/fixtures) directory. In our example, that file is `fixtures/www.newyorker.com/1475248565793.html`. Now open that file in your web browser.
@@ -223,7 +225,7 @@ So, back to the title: We want to make sure our test finds the same title we see
223
225
The selector for this title appears to be `h1.title`. To verify that we're right, click on the Console tab in Chrome's Developer Tools and run the following check:
224
226
225
227
```javascript
226
-
$$('h1.title')
228
+
$$('h1.title');
227
229
```
228
230
229
231
If that returns only one match (i.e., an array with just one element), and the text of that element looks like the title we want, you're good to go!
Save the file, and... uh oh, our example still fails.
248
250
249
251
```javascript
250
-
AssertionError:'Hacking, Cryptography, and the Countdown to Quantum Computing'=='Schrödinger’s Hack'
252
+
AssertionError:'Hacking, Cryptography, and the Countdown to Quantum Computing'==
253
+
'Schrödinger’s Hack';
251
254
```
252
255
253
256
When Mercury generated our test, it took a guess at the page's title, and in this case, it got it wrong. So update the test with thte title we expect, save it, and your test should pass!
@@ -259,7 +262,7 @@ We've been moving at a slow pace, but as you can see, once you understand the ba
259
262
For a slightly more complex example, you'll find after a bit of looking that the best place to get the most accurate datetime on the page is in the head of the document, in the value attribute of a meta tag:
As [explained above](#selecting-an-attribute), to return an attribute rather than the text inside an element, your selector should be an array where the first element is the element selector and the second element is the attribute you want to return. So, in this example, the date_published selector should look like this:
0 commit comments