Help Request - XpathOnUrl or CsQueryOnUrl for Infinite Scroll pages +Work-Around


#1

Working with Infinite Scroll???

Hey, any idea how to run a query on an page with infinite scroll?

I'm trying to pull posts from a user's profile on a mobile app where users share memes. Yeah. MEMES!

The site uses infinite scrolling, so I have to scroll to the bottom as far as many posts on that page exist to be able to view the entire html contents.


Right now, my setup is using an iPhone 7 user-agent in the Global HTTP Settings.

The query I'm using is a simple Dump of a CsQueryOnUrl whose href tag contains a substring, and then returns the href value:

=Dump(CsQueryOnUrl("URL","a[href*='substring']","href"))

The query returns the first set of results only. In my case, exactly 10.

Right now, since I don't really know how to use SeoTools with VBA, which I'm assuming would be my best bet, my work-around is to use a UA Spoofer in chrome & an Autoscroll extension to let the entire page contents load. THENNNᴺᴺNNNNᴺ - I have to save the entire raw html page locally with an xml extension, and point the url to it. But the last time I used this work-around, I was using XpathOnUrl. So using CsQueryOnUrl won't work with just a local html file, right? That's fine. I can switch to Xpath queries. I just need to figure out this Infinite Scroll issue.

Thanks in advance for any help! I'm open to any suggestions :grin:


Additional similar request: ↴

Some of these pages I'm pulling may contains over 10K posts. If each scroll loads 10 posts, that's alot of html code.

Any ideas on how to speed things up? Or maybe using the autoscroll extension I use in Chrome, can I tell Chrome somehow to load a page without actually "Loading" the page contents?
I'm not sure if that makes sense or if that's how html works, but if I disable the loading of actual images on the page but load the code only, it should process the code much faster no? I don't actually need the images just yet. I just need the hrefs for them.

These posts are all image based.


Parse Links from Sites with Pagination embedded by Google's Custom Search script
#2

Can you give ous a URL/name of that app to try find a solutions?


#3

app name is "ifunny :)"

site is https://ifunny.co/

Their homepage has the infinite scroll.

I'm trying to parse particular user profiles. But it's the same setup.

It look like when I save the html, It get's rid of the elements and keep just the javascript.


#4

there is no such string in the source code of ifunny.co


#5

Hi,

@diskborste and I created a custom connector for you. Save this to a file named IFunny.xml and put it inside your connectors folder. Restart excel and it will be available under Social -> Ifunny -> Post list. Enter the username of the user and the number of desired links. It will stop either when it reaches your defined limit of links or when there's nothing left to scrape. You might want to add a random delay between requests (in Settings -> Global HTTP settings)

<?xml version="1.0" encoding="utf-8" ?>
<Suite Title="iFunny" Id="IFunny" Category="Social"
       SourceUrl="https://github.com/nielsbosma/SeoTools-for-Excel-Connectors/"
       HelpUrl="http://seotoolsforexcel.com/">

  <Author Name="dovydasm" Url="https://github.com/dovydasm"/>

  <RestConnector Id="PostList" Title="Post list" HelpUrl="https://ifunny.co/">
    <Parameters>
      <Text Id="UserId" Title="Username" Debug.DefaultValue="JelloSenpai" Required="true"/>
    </Parameters>
    <Paging PageSize="10">
      <Parse>
        <XPath Id="NextPageToken" Expr="//li[contains(@@class, 'stream__item')]" Attribute="data-next"></XPath>
      </Parse>
    </Paging>
    <Fetch>
      <HttpSettings>
        <RequestHeaders>
          <Header Name='X-Requested-With'>XMLHttpRequest</Header>
        </RequestHeaders>
      </HttpSettings>
      <Fetch.Url>
        <![CDATA[
        https://ifunny.co/user/@Model.UserId
        @if(Model.PageCursor.Page != 0)
        {
          @: /timeline/@Model.NextPageToken
          @: ?batch=@(Model.PageCursor.Page+1)
          @: &mode=grid
        }
        ]]>
      </Fetch.Url>
    </Fetch>
    <Parse>
      <XPath Expr="//div[@@class='post__media']">
        <Compute Id="Url" Title="URL">
          <Compute.Expr>
            <![CDATA[
              https://ifunny.co@(Model.Path)
            ]]>
          </Compute.Expr>
          <Xpath Id="Path"  Expr="a[contains(@@class, 'grid__link')]" Attribute="href"></Xpath>
        </Compute>
      </XPath>
    </Parse>
  </RestConnector>
</Suite>

#6

@dovydasm Oh wow. That's very nice of you guys. I'll have to try this out first thing tomorrow when I get off work.

The connector looks pretty simple. I'll use this to add more parameters if needed.

Really appreciate this.

BTW, default user is "JelloSenpai" :thinking:
Is that just a random account you found? lol

Thanks again!!
You too @diskborste !!!


#7

Yeah, it was the first user I found that had a decent amount of posts. If it's annoying to see it every time you load the connector, you can delete this part from the XML file:

Debug.DefaultValue="JelloSenpai"


#8

Hey @dovydasm and @diskborste sorry to bring this up again. Should I start a new thread? My current situation is now with pastebin.com

It's similar to infinite scroll, but this time it's infinite pagination? If that makes sense.

Here's a link for example
https://pastebin.com/search?q=excel

I'm not requesting for another custom connector (even though that would be awesome :upside_down_face:) But maybe you can point me in the right direction to configuring either the current "iFunny" connector you attached earlier, or maybe there's another way to go about the data extraction process?

I'm not looking for anything in particular, but if I had to, how can I paginate the link I sent and return all the hrefs from the results?

I understand the Xpath part of the connector but I'm not sure what the process was to compute the expression.

I noticed you guys added "/timeline/" in the CDATA block so when I inspected the page I see the code with that exact path being transformed with every scroll. Can you point me to a guide or discussion on how to manipulate these pages myself?

I'll gladly create my own repository of connectors and share them with anyone who might make use of them. I get content from a lot of different sites so maybe we can start a trend here to have users create as much new content as possible.

Any info would be appreciated! Thanks!


#9

Nevermind, I think I understand this now.

Turns out Google's inspector console tells the whole story.


#10

Hi, apologies for the late reply. Did you manage to figure it out? If not, I will give the pastebin site a go and see if I can paginate the results.

Great mindset with the connectors crowdsourcing! That is the purpose of the easy-xml-markup. Also, SeoTools 8.0 is being released as we speak so the Connectors Library should make the sharing process a whole lot simpler :slight_smile: